-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Blaze (in some cases) 2x faster than OpenBLAS #821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Number of threads used may have played a role (OpenBLAS may have picked too many by default), and of the three systems you mentioned in JuliaLang/julia#810 I suspect only the Xeon E3 will have a substantially optimized sgemv kernel in OpenBLAS. (Does this result carry over to other functions as well, seeing that your benchmark calls only sgemv |
I use CPU core number for number of threads. Don't know if it's proper. |
Sorry for misreading earlier. Perhaps it could be useful to compare single thread performance as well. And as far as I can tell the segfaults are thought to be fixed post 0.2.15 (#697) |
Multiplication by 1.0 or adding (float)0 is a math operation. It is somehow omitted in blaze example part. |
Just tried Blaze library and was shocked.
Matrix/Vector multiplication test code (Used VS2013 and linked against 0.2.15-mingw binary for maximum performance):
Result:
I tested this on all my computers mentioned in JuliaLang/julia#810 and got similar results. In fact this was quite coherent with their own benchmark result. Blaze seemed to be purely header-only C++ (I might be wrong) so it was REALLY fast.
What's their trick? Could their technique benefit OpenBLAS? Or I used OpenBLAS incorrectly?
The text was updated successfully, but these errors were encountered: