OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

cefengxu · 2018-06-22T03:56:06Z

comparing the speed between traditional meth and openBLAS for Matrix Multiplication , however , obtain result seems kindar confusing.

testing code below

    double start , finish ;
    int NUM = 100;
    double a1[3*2] = {  1, 2,  /* CblasRowMajor */
                        3, 4,
                        5, 6
                };
    double b[2*3] = {  7,8, 9,   /* CblasRowMajor */
                       1, 2, 3
                };

    double c[3*3] = {0,0,0,0,0,0,0,0,0};

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        for(int i=0 ; i<3 ; i++)
        {
            for(int j=0;j<3;j++){
                for (int p = 0; p < 2 ; ++p) 
                    c[i*3 +j] = c[i*3 +j] + a1[i*2+p]*b[p*3+j];         
            }
        }
    }

    finish = clock();
    printf("FOR LOOP : %f seconds\n",(finish-start)/1000);

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 3,    3,    5, 1.0,   a1,   2, b, 3,  0.0, c,  3);
    }
    finish = clock();
    printf("OpenBLAS : %f seconds\n",(finish-start)/1000);

i run the code on PC and Android Platform , and then get the result as follow:

// run on PC ( build with gcc )

FOR LOOP : 0.012000 seconds
OpenBLAS : 0.186000 seconds

FOR LOOP : 0.008000 seconds
OpenBLAS : 0.130000 seconds

// run on ANDROID ( build with clang )

FOR LOOP : 0.006000 sec
OpenBLAS : 0.303000 sec

FOR LOOP : 0.012000 sec
OpenBLAS : 0.214000 sec

The text was updated successfully, but these errors were encountered:

sandwichmaker · 2018-06-22T05:15:27Z

This is not a fair comparison.

This is a trivially small matrix.
OpenBLAS routines do a bunch of setup and error checking. That has overhead.
Plus in your matrix multiply loop, you are exposing the size of the matrix to the compiler, which should be able to completely unroll it. OpenBLAS on the other hand is built without knowing the size of the matrix, optimizing for memory access.

martin-frbg · 2018-06-22T06:43:53Z

I suspect using cblas_dgemm instead of dgemm (with matrix layout adapted to fortran calling conventions) carries an additional overhead (see https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html ). Also you may want to retry with a current snapshot of the "develop" branch which should have significantly reduced thread startup times. (From #1632 you were building 0.2.20 yesterday, if you built for generic ARMV8 rather than CORTEXA57 this will have used the unoptimized C kernels. Current 0.3.0 version uses most of the optimized assembly kernels in generic ARMV8 mode as well)

cefengxu · 2018-06-25T10:21:42Z

thx all , i will try later and output same replies~

cefengxu · 2018-06-28T09:25:55Z

refering to "I suspect using cblas_dgemm instead of dgemm" , however , seems no dgemm api can be found for dev-branch

i rebuild from dev-branch and try again , but the result still no good. May be the matix should be more large,

martin-frbg · 2018-06-28T09:40:38Z

You will probably need to call it as dgemm_ , but if your matrix size is really just 3x2 (and in particular in the example you gave above, with the compiler able to unroll the loop) the simple loop will still win.
(You should see much less overhead for the thread creation with the develop branch, so the results should be better now than with 0.2.20 - or is that not the case ?)

brada4 · 2018-09-19T10:07:57Z

"test code" already "optimizes" out scalar constants

*  DGEMM  performs one of the matrix-matrix operations
*     C := alpha*op( A )*op( B ) + beta*C,
*  where  op( X ) is one of
*     op( X ) = X   or   op( X ) = X',
*  alpha and beta are scalars, and A, B and C are matrices, with op( A )
*  an m by k matrix,  op( B )  a  k by n matrix and  C an m by n matrix.

Besides compiler knows that input is static and can optimize out the 100-fold loop (icl will do it, maybe clang too),
besides inlining and subsequently vectorising inner loops, of which neither can happen with external library call in other part.

Actually there is a case that reference BLAS is faster for small samples due to no parallelism setup whatsoever in later.

martin-frbg · 2018-11-25T12:12:34Z

You could try https://github.com/giaf/blasfeo if your matrix sizes are always going to be very small.(Still not likely to beat the trivial optimizations possible to the compiler in your original example, as has been pointed out already)

cefengxu changed the title ~~Seem OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication~~ OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? Jun 22, 2018

martin-frbg closed this as completed Nov 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

cefengxu commented Jun 22, 2018

sandwichmaker commented Jun 22, 2018

martin-frbg commented Jun 22, 2018 •

edited

Loading

cefengxu commented Jun 25, 2018

cefengxu commented Jun 28, 2018

martin-frbg commented Jun 28, 2018

brada4 commented Sep 19, 2018 •

edited

Loading

martin-frbg commented Nov 25, 2018

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

Comments

cefengxu commented Jun 22, 2018

sandwichmaker commented Jun 22, 2018

martin-frbg commented Jun 22, 2018 • edited Loading

cefengxu commented Jun 25, 2018

cefengxu commented Jun 28, 2018

martin-frbg commented Jun 28, 2018

brada4 commented Sep 19, 2018 • edited Loading

martin-frbg commented Nov 25, 2018

martin-frbg commented Jun 22, 2018 •

edited

Loading

brada4 commented Sep 19, 2018 •

edited

Loading