Skip to content

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 29, 2020

Conversation

wjc404
Copy link
Contributor

@wjc404 wjc404 commented Feb 16, 2020

PS: I forgot to change SKX GEMM3M parameters when I added those kernels 2 month ago, sorry for that. I fix it in this PR.

@martin-frbg martin-frbg changed the title AVX512 STRMM kernel Adjust SkylakeX GEMM3M parameters and add an AVX512 STRMM kernel Feb 16, 2020
@martin-frbg martin-frbg added this to the 0.3.9 milestone Feb 16, 2020
@wjc404
Copy link
Contributor Author

wjc404 commented Feb 17, 2020

The STRMM kernel passed 1-thread reliability test.
Screenshot from 2020-02-17 08-26-29

@marxin
Copy link
Contributor

marxin commented Feb 17, 2020

The STRMM kernel passed 1-thread reliability test.

Can you please rather send the results directly via a direct text comment instead of a screenshot?

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 17, 2020

@marxin I've closed those terminals running strmm tests, so the text output cannot be retrieved. The test program was attached in a comment of PR #2361 .
BTW have you tested 1-thread and 8-thread SGEMM performance of MKL, OpenBLAS-0.3.7 and OpenBLAS-0.3.8 on r7-2700X?

@marxin
Copy link
Contributor

marxin commented Feb 18, 2020

@marxin I've closed those terminals running strmm tests, so the text output cannot be retrieved. The test program was attached in a comment of PR #2361 .

Sure, I mean for the next time you'll paste some results. Thanks.

@TiborGY
Copy link
Contributor

TiborGY commented Feb 18, 2020

The STRMM kernel passed 1-thread reliability test.

Can you please rather send the results directly via a direct text comment instead of a screenshot?

Just out of curiosity, why do you prefer text?

@marxin
Copy link
Contributor

marxin commented Feb 18, 2020

Just out of curiosity, why do you prefer text?

I see the screenshots scaled and I need to click to see a full screen. Moreover, one can't copy commands from screenshots. And I also don't see how were the tests run, it's scrolled down.

@wjc404 wjc404 changed the title Adjust SkylakeX GEMM3M parameters and add an AVX512 STRMM kernel Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix a performance bug in AVX2 SGEMM Feb 22, 2020
@wjc404
Copy link
Contributor Author

wjc404 commented Feb 22, 2020

The new AVX2 SGEMM kernel passed 2 reliability tests (using the test program attached in PR #2300 ).
testlog_avx2_serial_sgemm_20190223_epycrome.txt
Screenshot from 2020-02-23 00-17-47

testlog_avx2_serial_sgemm_20190223_epycrome_2.txt
Screenshot from 2020-02-23 10-18-16

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 25, 2020

The modified AVX512 SGEMM kernel passed an 1-thread test.
Screenshot from 2020-02-25 13-24-41
testlog_aliyun_sgemm_avx512_1thread.txt

@wjc404 wjc404 changed the title Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix a performance bug in AVX2 SGEMM Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM Feb 26, 2020
@wjc404
Copy link
Contributor Author

wjc404 commented Feb 26, 2020

Recently I found a performance issue on Haswell-EP and Broadwell-EP processors: the execution of register-to-register vector permutation instructions can interfere with that of FMA instructions (which is not an issue on Skylake-client or Zen2 processors). Changing vector permutation instructions to memory-to-register type can avoid this issue. The 1-thread performances of CGEMM and ZGEMM rose about 15% (to 98% MKL2018) on these processors after the changes in this PR.

@martin-frbg
Copy link
Collaborator

that is an interesting quirk of the EP models - please advise if it would make coding easier for you if these were detected as a separate TARGET type, or with a specific #define available at compile time ?

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 26, 2020

Probably not necessary in my opinion. The usage of vector permutation instructions in level3 kernels is not common.

@marxin
Copy link
Contributor

marxin commented Feb 27, 2020

Recently I found a performance issue on Haswell-EP and Broadwell-EP processors: the execution of register-to-register vector permutation instructions can interfere with that of FMA instructions (which is not an issue on Skylake-client or Zen2 processors).

May I please ask how did you notice that? It's an interesting observation.

About the introduction of the new sgemm_kernel_8x4_haswell_2.c kernel: Can you please describe how is the implementation different?

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 27, 2020

Remove redundant prefetch instructions in avx2 c/z gemm kernels (they actually do a little harm).

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 27, 2020

@marxin I found it accidentally in a test, when I received significant performance gain after changing those permute instructions. I confirmed that with the test program in "test_fma_perm_instflow.zip".
In the kernel "sgemm_kernel_8x4_haswell_2.c", I reordered the calculation steps, rearranged 2 adjacent 8rowX12col stores (to matrix C) to 2 16rowX6col stores, while maintaining the same reading rate from L3 cache.
In a 16rowX6col write to a block of column-major matrix C in 8-way set associative cache, there's little risk of conflict misses which could occur in a 8rowX12col write.

@marxin
Copy link
Contributor

marxin commented Feb 28, 2020

In the kernel "sgemm_kernel_8x4_haswell_2.c", I reordered the calculation steps, rearranged 2 adjacent 8rowX12col stores (to matrix C) to 2 16rowX6col stores, while maintaining the same reading rate from L3 cache.

Nice. That seems smart.
Btw. I've spent some time with reading of the Anatomy of High-Performance Matrix Multiplication - 2008 and also the new kernels you written. I see it still quite difficult to understand the implementation. Is there anything that can help me to understand it? Thanks.

@martin-frbg martin-frbg merged commit ea8eec5 into OpenMathLib:develop Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants