-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Performance of dgemm #1840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What are dimensions (M N K) of matrices? Is it numpy? could you prepend asterisks with backslashes or something in your posting, it is hard to tell code from text apart... Can you run "perf record python sample.py" and "perf report" pair (and give last text output like 10 lines, between tripple reverse apostrophes) against code you run with openblas? copy/transpose routines which include alpha/beta scaling take <1% of GEMM execution time typically unless you go to size extremes, like 2-values wide matrix where probably gemv could serve better etc. |
I'm using OpenBLAS-0.2.20, I write my own code in C++ which then calls BLAS. As the actual calls are through high level wrappers I don't really know how to post the code in a way that will be of use. Anyhow, I set up a more straightforward test performing direct BLAS calls and now the timings are the same for transposed and nontransposed dgemm calls. Here's the code:
Here are the timings I get ... Using MKL BLAS: This was on a desktop with Core i7-8700 (6-core, 3.20GHz). The main code that I'm running performs finite element calculations, so a lot of computations involving matrix products of the form (B^T)(C)(B). Strangely, I'm not seeing any performance benefit from deferring the manual transposition of the first matrix and just letting dgemm handle it. (It doesn't jive with the above results at all, it's like the manual transposition is taking no time which should not be the case). I also don't know why MKL has so much better results in the above test. I notice that in my actual simulation the speed-up is around 20% when I use MKL instead of OpenBLAS, but this is in terms of overall time and not just BLAS. |
You may get slightly better performance with 0.3.3 or current |
Are you using Intel compiler? It may have some heuristics for MKL and not others. |
will make the small matrix go faster |
won't help with his DGEMM problem on Kaby Lake though, until this gets implemented for more than "just" SGEMM on AVX512... |
uh yeah true. duh... too early, need coffee
sorry for the noise
…On Thu, Dec 13, 2018 at 6:11 AM Martin Kroeker ***@***.***> wrote:
#1914 <#1914>
will make the small matrix go faster
won't help with his DGEMM problem on Kaby Lake though, until this gets
implemented for more than "just" SGEMM on AVX512...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1840 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFaWPXsKCtkCWjvGGHmrQz9K8keB1ks5u4l_7gaJpZM4YB1ia>
.
|
When I perform calculations of the type C = (transpose(A))B I noticed that with openBLAS I don't gain any speedup when I call cblas_dgemm with the right flags to indicate that A is transposed, compared to if I just do a manual transposition myself, i.e. D = transpose(A) which involves a memory allocation for D followed by manually filling in D(i,j)=A(j,i), and only then C = DB via cblas_dgemm. I observe the same behavior when doing A*(transpose(B)). On the other hand using MKL BLAS, the first option which doesn't do a manual transposition results in 50% faster execution. Is this because OpenBLAS makes hidden copies when transA and transB are set to CBlasTrans?
The text was updated successfully, but these errors were encountered: