-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not a fair comparison.
|
I suspect using cblas_dgemm instead of dgemm (with matrix layout adapted to fortran calling conventions) carries an additional overhead (see https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html ). Also you may want to retry with a current snapshot of the "develop" branch which should have significantly reduced thread startup times. (From #1632 you were building 0.2.20 yesterday, if you built for generic ARMV8 rather than CORTEXA57 this will have used the unoptimized C kernels. Current 0.3.0 version uses most of the optimized assembly kernels in generic ARMV8 mode as well) |
thx all , i will try later and output same replies~ |
refering to "I suspect using cblas_dgemm instead of dgemm" , however , seems no dgemm api can be found for dev-branch i rebuild from dev-branch and try again , but the result still no good. May be the matix should be more large, |
You will probably need to call it as dgemm_ , but if your matrix size is really just 3x2 (and in particular in the example you gave above, with the compiler able to unroll the loop) the simple loop will still win. |
"test code" already "optimizes" out scalar constants
Besides compiler knows that input is static and can optimize out the 100-fold loop (icl will do it, maybe clang too), Actually there is a case that reference BLAS is faster for small samples due to no parallelism setup whatsoever in later. |
You could try https://github.com/giaf/blasfeo if your matrix sizes are always going to be very small.(Still not likely to beat the trivial optimizations possible to the compiler in your original example, as has been pointed out already) |
comparing the speed between traditional meth and openBLAS for Matrix Multiplication , however , obtain result seems kindar confusing.
testing code below
i run the code on PC and Android Platform , and then get the result as follow:
The text was updated successfully, but these errors were encountered: