Skip to content

ssyrk slower than sgemm equivalence when k>n #2418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
david-cortes opened this issue Feb 14, 2020 · 3 comments
Closed

ssyrk slower than sgemm equivalence when k>n #2418

david-cortes opened this issue Feb 14, 2020 · 3 comments
Milestone

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Feb 14, 2020

I'm comparing function ssyrk against an equivalent sgemm operation for the following operation:

C <- t(A)*A

(AMD Ryzen 7 2700, OpenBLAS 0.3.8)

In MKL, doing this through ssyrk is, as expected, slightly faster than using the more general sgemm, but in OpenBLAS I find it actually takes longer using the more specialized ssyrk when the input matrix A has far more rows than columns, and in all cases it takes longer than MKL, despite running on AMD hardware.

These are the timings I get from different problem sizes:

OpenBLAS:

  • 1,000,000 x 100 sgemm: 247 ms
  • 1,000,000 x 100 ssyrk: 332 ms
  • 1,000 x 10,000 sgemm: 731 ms
  • 1,000 x 10,000 ssyrk: 395 ms

MKL:

  • 1,000,000 x 100 sgemm: 61.2 ms
  • 1,000,000 x 100 ssyrk: 46.8 ms
  • 1,000 x 10,000 sgemm: 567 ms
  • 1,000 x 10,000 ssyrk: 324 ms

In the case with more rows than columns, the ssyrk function is actually taking longer than the sgemm, despite having to fill only half of the array, which I find problematic.

Code in C:

#include "cblas.h"
void call_gemm(float *A, float *C, int n, int k, float alpha)
{
        cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                    k, k, n,
                    alpha, A, k, A, k,
                    0., C, k);
}
void call_syrk(float *A, float *C, int n, int k, float alpha)
{
        cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                    k, n, alpha,
                    A, k, 0., C, k);
}

Code in Python/Cython:

  • Cython file:
%%cython --compile-args=-O3 -lopenblas
import numpy as np
cimport numpy as np
cdef extern from *:
    """
    #include "cblas.h"
    //#include "mkl.h"
    void call_gemm(float *A, float *C, int n, int k, float alpha)
    {
        cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                    k, k, n,
                    alpha, A, k, A, k,
                    0., C, k);
    }
    void call_syrk(float *A, float *C, int n, int k, float alpha)
    {
        cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                    k, n, alpha,
                    A, k, 0., C, k);
    }
    """
    void call_gemm(float *A, float *C, int n, int k, float alpha)
    void call_syrk(float *A, float *C, int n, int k, float alpha)

def py_gemm(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, int n, int k, float alpha):
    call_gemm(&A[0,0], &C[0,0],n, k, alpha)
def py_syrk(np.ndarray[float, ndim=2] A, np.ndarray[float, ndim=2] C, int n, int k, float alpha):
    call_syrk(&A[0,0], &C[0,0],n, k, alpha)
  • Python file:
import numpy as np
import ctypes

n = int(1e6)
k = int(1e2)

np.random.seed(123)
A = np.random.normal(size = (n,k)).astype(ctypes.c_float)
C = np.random.normal(size = (k,k)).astype(ctypes.c_float)
%%timeit
py_gemm(A, C, n, k, 1.)
%%timeit
py_syrk(A, C, n, k, 1.)
@david-cortes david-cortes changed the title ssyrk slower than sgemm equivalence when k>n ssyrk slower than sgemm equivalence when k>n (dsyrk not affected) Feb 14, 2020
@david-cortes david-cortes changed the title ssyrk slower than sgemm equivalence when k>n (dsyrk not affected) ssyrk slower than sgemm equivalence when k>n Feb 14, 2020
@martin-frbg
Copy link
Collaborator

Possibly related to #1115 (assuming you are running multithreaded)

@david-cortes
Copy link
Contributor Author

Probably. In single-threaded mode the syrk one runs faster than gemm, although it's still about 20% slower than MKL.

OpenBLAS (1 thread):

  • 1,000,000 x 100 sgemm: 449 ms
  • 1,000,000 x 100 ssyrk: 326 ms
  • 1,000 x 10,000 sgemm: 4.32 s
  • 1,000 x 10,000 ssyrk: 2.04 s

MKL (1 thread):

  • 1,000,000 x 100 sgemm: 406 ms
  • 1,000,000 x 100 ssyrk: 269 ms
  • 1,000 x 10,000 sgemm: 3.59 s
  • 1,000 x 10,000 ssyrk: 1.84 s

@martin-frbg martin-frbg added this to the 0.3.11 milestone Jun 14, 2020
@martin-frbg martin-frbg modified the milestones: 0.3.11, 0.3.12, 0.3.13 Oct 15, 2020
@martin-frbg
Copy link
Collaborator

Fixed by reverting #747 in #3026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants