Skip to content

Commit ad9f02c

Browse files
committed
Add a "sgemm direct" mode for small matrixes
OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's less, around M*N*K = 1 * 512 * 512
1 parent 8771880 commit ad9f02c

File tree

4 files changed

+482
-0
lines changed

4 files changed

+482
-0
lines changed

common_level3.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,14 @@ __global__ void cuda_dgemm_kernel(int, int, int, double *, double *, double *);
4747
extern "C" {
4848
#endif
4949

50+
extern void sgemm_kernel_direct(BLASLONG M, BLASLONG N, BLASLONG K,
51+
float * __restrict__ A, BLASLONG strideA,
52+
float * __restrict__ B, BLASLONG strideB,
53+
float * __restrict__ R, BLASLONG strideR);
54+
55+
extern int sgemm_kernel_direct_performant(BLASLONG M, BLASLONG N, BLASLONG K);
56+
57+
5058
int sgemm_beta(BLASLONG, BLASLONG, BLASLONG, float,
5159
float *, BLASLONG, float *, BLASLONG, float *, BLASLONG);
5260
int dgemm_beta(BLASLONG, BLASLONG, BLASLONG, double,

interface/gemm.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,14 @@ void CNAME(enum CBLAS_ORDER order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANS
271271

272272
PRINT_DEBUG_CNAME;
273273

274+
#if !defined(COMPLEX) && !defined(DOUBLE) && defined(USE_SGEMM_KERNEL_DIRECT)
275+
if (beta == 0 && alpha == 1.0 && order == CblasRowMajor && TransA == CblasNoTrans && TransB == CblasNoTrans && sgemm_kernel_direct_performant(m,n,k)) {
276+
sgemm_kernel_direct(m, n, k, a, lda, b, ldb, c, ldc);
277+
return;
278+
}
279+
280+
#endif
281+
274282
#ifndef COMPLEX
275283
args.alpha = (void *)α
276284
args.beta = (void *)β

0 commit comments

Comments
 (0)