-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you please rather send the results directly via a direct text comment instead of a screenshot? |
Just out of curiosity, why do you prefer text? |
I see the screenshots scaled and I need to click to see a full screen. Moreover, one can't copy commands from screenshots. And I also don't see how were the tests run, it's scrolled down. |
The new AVX2 SGEMM kernel passed 2 reliability tests (using the test program attached in PR #2300 ). |
The modified AVX512 SGEMM kernel passed an 1-thread test. |
Recently I found a performance issue on Haswell-EP and Broadwell-EP processors: the execution of register-to-register vector permutation instructions can interfere with that of FMA instructions (which is not an issue on Skylake-client or Zen2 processors). Changing vector permutation instructions to memory-to-register type can avoid this issue. The 1-thread performances of CGEMM and ZGEMM rose about 15% (to 98% MKL2018) on these processors after the changes in this PR. |
that is an interesting quirk of the EP models - please advise if it would make coding easier for you if these were detected as a separate TARGET type, or with a specific |
Probably not necessary in my opinion. The usage of vector permutation instructions in level3 kernels is not common. |
May I please ask how did you notice that? It's an interesting observation. About the introduction of the |
Remove redundant prefetch instructions in avx2 c/z gemm kernels (they actually do a little harm). |
@marxin I found it accidentally in a test, when I received significant performance gain after changing those permute instructions. I confirmed that with the test program in "test_fma_perm_instflow.zip". |
Nice. That seems smart. |
PS: I forgot to change SKX GEMM3M parameters when I added those kernels 2 month ago, sorry for that. I fix it in this PR.