ssyrk slower than sgemm equivalence when k>n #2418

david-cortes · 2020-02-14T15:14:23Z

I'm comparing function ssyrk against an equivalent sgemm operation for the following operation:

C <- t(A)*A

(AMD Ryzen 7 2700, OpenBLAS 0.3.8)

In MKL, doing this through ssyrk is, as expected, slightly faster than using the more general sgemm, but in OpenBLAS I find it actually takes longer using the more specialized ssyrk when the input matrix A has far more rows than columns, and in all cases it takes longer than MKL, despite running on AMD hardware.

These are the timings I get from different problem sizes:

OpenBLAS:

1,000,000 x 100 sgemm: 247 ms
1,000,000 x 100 ssyrk: 332 ms
1,000 x 10,000 sgemm: 731 ms
1,000 x 10,000 ssyrk: 395 ms

MKL:

1,000,000 x 100 sgemm: 61.2 ms
1,000,000 x 100 ssyrk: 46.8 ms
1,000 x 10,000 sgemm: 567 ms
1,000 x 10,000 ssyrk: 324 ms

In the case with more rows than columns, the ssyrk function is actually taking longer than the sgemm, despite having to fill only half of the array, which I find problematic.

Code in C:

#include "cblas.h"
void call_gemm(float *A, float *C, int n, int k, float alpha)
{
        cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                    k, k, n,
                    alpha, A, k, A, k,
                    0., C, k);
}
void call_syrk(float *A, float *C, int n, int k, float alpha)
{
        cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                    k, n, alpha,
                    A, k, 0., C, k);
}

Code in Python/Cython:

Cython file:

%%cython --compile-args=-O3 -lopenblas
import numpy as np
cimport numpy as np
cdef extern from *:
    """
    #include "cblas.h"
    //#include "mkl.h"
    void call_gemm(float *A, float *C, int n, int k, float alpha)
    {
        cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                    k, k, n,
                    alpha, A, k, A, k,
                    0., C, k);
    }
    void call_syrk(float *A, float *C, int n, int k, float alpha)
    {
        cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                    k, n, alpha,
                    A, k, 0., C, k);
    }
    """
    void call_gemm(float *A, float *C, int n, int k, float alpha)
    void call_syrk(float *A, float *C, int n, int k, float alpha)

def py_gemm(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, int n, int k, float alpha):
    call_gemm(&A[0,0], &C[0,0],n, k, alpha)
def py_syrk(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, int n, int k, float alpha):
    call_syrk(&A[0,0], &C[0,0],n, k, alpha)

Python file:

import numpy as np
import ctypes

n = int(1e6)
k = int(1e2)

np.random.seed(123)
A = np.random.normal(size = (n,k)).astype(ctypes.c_float)
C = np.random.normal(size = (k,k)).astype(ctypes.c_float)

%%timeit
py_gemm(A, C, n, k, 1.)

%%timeit
py_syrk(A, C, n, k, 1.)

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-02-14T21:08:52Z

Possibly related to #1115 (assuming you are running multithreaded)

david-cortes · 2020-02-15T06:10:09Z

Probably. In single-threaded mode the syrk one runs faster than gemm, although it's still about 20% slower than MKL.

OpenBLAS (1 thread):

1,000,000 x 100 sgemm: 449 ms
1,000,000 x 100 ssyrk: 326 ms
1,000 x 10,000 sgemm: 4.32 s
1,000 x 10,000 ssyrk: 2.04 s

MKL (1 thread):

1,000,000 x 100 sgemm: 406 ms
1,000,000 x 100 ssyrk: 269 ms
1,000 x 10,000 sgemm: 3.59 s
1,000 x 10,000 ssyrk: 1.84 s

martin-frbg · 2020-12-11T22:09:24Z

Fixed by reverting #747 in #3026

david-cortes changed the title ~~ssyrk slower than sgemm equivalence when k>n~~ ssyrk slower than sgemm equivalence when k>n (dsyrk not affected) Feb 14, 2020

david-cortes changed the title ~~ssyrk slower than sgemm equivalence when k>n (dsyrk not affected)~~ ssyrk slower than sgemm equivalence when k>n Feb 14, 2020

martin-frbg added this to the 0.3.11 milestone Jun 14, 2020

martin-frbg modified the milestones: 0.3.11, 0.3.12, 0.3.13 Oct 15, 2020

martin-frbg closed this as completed Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ssyrk slower than sgemm equivalence when k>n #2418

ssyrk slower than sgemm equivalence when k>n #2418

david-cortes commented Feb 14, 2020 •

edited

Loading

martin-frbg commented Feb 14, 2020

david-cortes commented Feb 15, 2020

martin-frbg commented Dec 11, 2020

ssyrk slower than sgemm equivalence when k>n #2418

ssyrk slower than sgemm equivalence when k>n #2418

Comments

david-cortes commented Feb 14, 2020 • edited Loading

martin-frbg commented Feb 14, 2020

david-cortes commented Feb 15, 2020

martin-frbg commented Dec 11, 2020

david-cortes commented Feb 14, 2020 •

edited

Loading