Skip to content

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cefengxu opened this issue Jun 22, 2018 · 7 comments
Closed

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

cefengxu opened this issue Jun 22, 2018 · 7 comments

Comments

@cefengxu
Copy link

comparing the speed between traditional meth and openBLAS for Matrix Multiplication , however , obtain result seems kindar confusing.

testing code below

    double start , finish ;
    int NUM = 100;
    double a1[3*2] = {  1, 2,  /* CblasRowMajor */
                        3, 4,
                        5, 6
                };
    double b[2*3] = {  7,8, 9,   /* CblasRowMajor */
                       1, 2, 3
                };

    double c[3*3] = {0,0,0,0,0,0,0,0,0};

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        for(int i=0 ; i<3 ; i++)
        {
            for(int j=0;j<3;j++){
                for (int p = 0; p < 2 ; ++p) 
                    c[i*3 +j] = c[i*3 +j] + a1[i*2+p]*b[p*3+j];         
            }
        }
    }

    finish = clock();
    printf("FOR LOOP : %f seconds\n",(finish-start)/1000);

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 3,    3,    5, 1.0,   a1,   2, b, 3,  0.0, c,  3);
    }
    finish = clock();
    printf("OpenBLAS : %f seconds\n",(finish-start)/1000);

i run the code on PC and Android Platform , and then get the result as follow:

// run on PC ( build with gcc )

FOR LOOP : 0.012000 seconds
OpenBLAS : 0.186000 seconds

FOR LOOP : 0.008000 seconds
OpenBLAS : 0.130000 seconds

// run on ANDROID ( build with clang )

FOR LOOP : 0.006000 sec
OpenBLAS : 0.303000 sec

FOR LOOP : 0.012000 sec
OpenBLAS : 0.214000 sec

@cefengxu cefengxu changed the title Seem OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? Jun 22, 2018
@sandwichmaker
Copy link

This is not a fair comparison.

  1. This is a trivially small matrix.
  2. OpenBLAS routines do a bunch of setup and error checking. That has overhead.
  3. Plus in your matrix multiply loop, you are exposing the size of the matrix to the compiler, which should be able to completely unroll it. OpenBLAS on the other hand is built without knowing the size of the matrix, optimizing for memory access.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 22, 2018

I suspect using cblas_dgemm instead of dgemm (with matrix layout adapted to fortran calling conventions) carries an additional overhead (see https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html ). Also you may want to retry with a current snapshot of the "develop" branch which should have significantly reduced thread startup times. (From #1632 you were building 0.2.20 yesterday, if you built for generic ARMV8 rather than CORTEXA57 this will have used the unoptimized C kernels. Current 0.3.0 version uses most of the optimized assembly kernels in generic ARMV8 mode as well)

@cefengxu
Copy link
Author

thx all , i will try later and output same replies~

@cefengxu
Copy link
Author

refering to "I suspect using cblas_dgemm instead of dgemm" , however , seems no dgemm api can be found for dev-branch

i rebuild from dev-branch and try again , but the result still no good. May be the matix should be more large,

@martin-frbg
Copy link
Collaborator

You will probably need to call it as dgemm_ , but if your matrix size is really just 3x2 (and in particular in the example you gave above, with the compiler able to unroll the loop) the simple loop will still win.
(You should see much less overhead for the thread creation with the develop branch, so the results should be better now than with 0.2.20 - or is that not the case ?)

@brada4
Copy link
Contributor

brada4 commented Sep 19, 2018

"test code" already "optimizes" out scalar constants

*  DGEMM  performs one of the matrix-matrix operations
*     C := alpha*op( A )*op( B ) + beta*C,
*  where  op( X ) is one of
*     op( X ) = X   or   op( X ) = X',
*  alpha and beta are scalars, and A, B and C are matrices, with op( A )
*  an m by k matrix,  op( B )  a  k by n matrix and  C an m by n matrix.

Besides compiler knows that input is static and can optimize out the 100-fold loop (icl will do it, maybe clang too),
besides inlining and subsequently vectorising inner loops, of which neither can happen with external library call in other part.

Actually there is a case that reference BLAS is faster for small samples due to no parallelism setup whatsoever in later.

@martin-frbg
Copy link
Collaborator

You could try https://github.com/giaf/blasfeo if your matrix sizes are always going to be very small.(Still not likely to beat the trivial optimizations possible to the compiler in your original example, as has been pointed out already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants