Skip to content

Performance of dgemm #1840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jmsargado opened this issue Oct 30, 2018 · 7 comments
Open

Performance of dgemm #1840

jmsargado opened this issue Oct 30, 2018 · 7 comments

Comments

@jmsargado
Copy link

When I perform calculations of the type C = (transpose(A))B I noticed that with openBLAS I don't gain any speedup when I call cblas_dgemm with the right flags to indicate that A is transposed, compared to if I just do a manual transposition myself, i.e. D = transpose(A) which involves a memory allocation for D followed by manually filling in D(i,j)=A(j,i), and only then C = DB via cblas_dgemm. I observe the same behavior when doing A*(transpose(B)). On the other hand using MKL BLAS, the first option which doesn't do a manual transposition results in 50% faster execution. Is this because OpenBLAS makes hidden copies when transA and transB are set to CBlasTrans?

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2018

What are dimensions (M N K) of matrices? Is it numpy?
What OpenBLAS version?

could you prepend asterisks with backslashes or something in your posting, it is hard to tell code from text apart...

Can you run "perf record python sample.py" and "perf report" pair (and give last text output like 10 lines, between tripple reverse apostrophes) against code you run with openblas?

copy/transpose routines which include alpha/beta scaling take <1% of GEMM execution time typically unless you go to size extremes, like 2-values wide matrix where probably gemv could serve better etc.

@jmsargado
Copy link
Author

I'm using OpenBLAS-0.2.20, I write my own code in C++ which then calls BLAS. As the actual calls are through high level wrappers I don't really know how to post the code in a way that will be of use. Anyhow, I set up a more straightforward test performing direct BLAS calls and now the timings are the same for transposed and nontransposed dgemm calls. Here's the code:

void test_dgemm()
{
    std::chrono::time_point<std::chrono::system_clock> tic, toc;
    std::chrono::duration<double> tictoc;

    int n = 4;
    int m = 6;

    double *A = new double[n*m];
    double *B = new double[n*m];
    double *C = new double[n*m];

    int nloop = 10000000;

    tic = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for ( int i = 0; i < nloop; i++ )
    {
        // A is n x m, B is m x n
        A[0] = i;
        cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, m, 1, A, n, B, m, 0, C, n);
    }
    toc = std::chrono::high_resolution_clock::now();
    tictoc = toc - tic;
    double time1 = tictoc.count();

    tic = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for ( int i = 0; i < nloop; i++ )
    {
        // A is m x n, B is m x n
        A[0] = i;
        cblas_dgemm(CblasColMajor, CblasTrans, CblasNoTrans, n, n, m, 1, A, m, B, m, 0, C, n);
    }
    toc = std::chrono::high_resolution_clock::now();
    tictoc = toc - tic;
    double time2 = tictoc.count();

    std::cout << "Time without transpose = " << time1 << std::endl;
    std::cout << "Time with transpose = " << time2 << std::endl;

    delete[] A;
    delete[] B;
    delete[] C;
}

Here are the timings I get ...
Using OpenBLAS:
Time without transpose = 4.09228
Time with transpose = 4.12796

Using MKL BLAS:
Using MKL BLAS
Time without transpose = 0.433103
Time with transpose = 0.44951

This was on a desktop with Core i7-8700 (6-core, 3.20GHz). The main code that I'm running performs finite element calculations, so a lot of computations involving matrix products of the form (B^T)(C)(B). Strangely, I'm not seeing any performance benefit from deferring the manual transposition of the first matrix and just letting dgemm handle it. (It doesn't jive with the above results at all, it's like the manual transposition is taking no time which should not be the case). I also don't know why MKL has so much better results in the above test. I notice that in my actual simulation the speed-up is around 20% when I use MKL instead of OpenBLAS, but this is in terms of overall time and not just BLAS.

@martin-frbg
Copy link
Collaborator

You may get slightly better performance with 0.3.3 or current devel branch compared to 0.2.20, but with 4x6 matrices there will be no speedup from multithreading and OpenBLAS' code layout may actually add overhead compared to just performing the calculation with the reference BLAS algorithm.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

Are you using Intel compiler? It may have some heuristics for MKL and not others.

@fenrus75
Copy link
Contributor

#1914

will make the small matrix go faster

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 13, 2018

#1914

will make the small matrix go faster

won't help with his DGEMM problem on Kaby Lake though, until this gets implemented for more than "just" SGEMM on AVX512...

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 13, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants