You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm comparing function ssyrk against an equivalent sgemm operation for the following operation:
C <- t(A)*A
(AMD Ryzen 7 2700, OpenBLAS 0.3.8)
In MKL, doing this through ssyrk is, as expected, slightly faster than using the more general sgemm, but in OpenBLAS I find it actually takes longer using the more specialized ssyrk when the input matrix A has far more rows than columns, and in all cases it takes longer than MKL, despite running on AMD hardware.
These are the timings I get from different problem sizes:
OpenBLAS:
1,000,000 x 100 sgemm: 247 ms
1,000,000 x 100 ssyrk: 332 ms
1,000 x 10,000 sgemm: 731 ms
1,000 x 10,000 ssyrk: 395 ms
MKL:
1,000,000 x 100 sgemm: 61.2 ms
1,000,000 x 100 ssyrk: 46.8 ms
1,000 x 10,000 sgemm: 567 ms
1,000 x 10,000 ssyrk: 324 ms
In the case with more rows than columns, the ssyrk function is actually taking longer than the sgemm, despite having to fill only half of the array, which I find problematic.
Code in C:
#include"cblas.h"voidcall_gemm(float*A, float*C, intn, intk, floatalpha)
{
cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
k, k, n,
alpha, A, k, A, k,
0., C, k);
}
voidcall_syrk(float*A, float*C, intn, intk, floatalpha)
{
cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
k, n, alpha,
A, k, 0., C, k);
}
Code in Python/Cython:
Cython file:
%%cython--compile-args=-O3-lopenblasimportnumpyasnpcimportnumpyasnpcdefexternfrom*:
""" #include "cblas.h" //#include "mkl.h" void call_gemm(float *A, float *C, int n, int k, float alpha) { cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans, k, k, n, alpha, A, k, A, k, 0., C, k); } void call_syrk(float *A, float *C, int n, int k, float alpha) { cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans, k, n, alpha, A, k, 0., C, k); } """voidcall_gemm(float*A, float*C, intn, intk, floatalpha)
voidcall_syrk(float*A, float*C, intn, intk, floatalpha)
defpy_gemm(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, intn, intk, floatalpha):
call_gemm(&A[0,0], &C[0,0],n, k, alpha)
defpy_syrk(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, intn, intk, floatalpha):
call_syrk(&A[0,0], &C[0,0],n, k, alpha)
The text was updated successfully, but these errors were encountered:
david-cortes
changed the title
ssyrk slower than sgemm equivalence when k>n
ssyrk slower than sgemm equivalence when k>n (dsyrk not affected)
Feb 14, 2020
david-cortes
changed the title
ssyrk slower than sgemm equivalence when k>n (dsyrk not affected)
ssyrk slower than sgemm equivalence when k>n
Feb 14, 2020
I'm comparing function
ssyrk
against an equivalentsgemm
operation for the following operation:(AMD Ryzen 7 2700, OpenBLAS 0.3.8)
In MKL, doing this through
ssyrk
is, as expected, slightly faster than using the more generalsgemm
, but in OpenBLAS I find it actually takes longer using the more specializedssyrk
when the input matrix A has far more rows than columns, and in all cases it takes longer than MKL, despite running on AMD hardware.These are the timings I get from different problem sizes:
OpenBLAS:
MKL:
In the case with more rows than columns, the
ssyrk
function is actually taking longer than thesgemm
, despite having to fill only half of the array, which I find problematic.Code in C:
Code in Python/Cython:
The text was updated successfully, but these errors were encountered: