Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

wjc404 · 2020-02-16T15:03:40Z

PS: I forgot to change SKX GEMM3M parameters when I added those kernels 2 month ago, sorry for that. I fix it in this PR.

wjc404 · 2020-02-17T00:28:41Z

The STRMM kernel passed 1-thread reliability test.

marxin · 2020-02-17T08:40:08Z

The STRMM kernel passed 1-thread reliability test.

Can you please rather send the results directly via a direct text comment instead of a screenshot?

wjc404 · 2020-02-17T15:34:06Z

@marxin I've closed those terminals running strmm tests, so the text output cannot be retrieved. The test program was attached in a comment of PR #2361 .
BTW have you tested 1-thread and 8-thread SGEMM performance of MKL, OpenBLAS-0.3.7 and OpenBLAS-0.3.8 on r7-2700X?

marxin · 2020-02-18T09:45:16Z

@marxin I've closed those terminals running strmm tests, so the text output cannot be retrieved. The test program was attached in a comment of PR #2361 .

Sure, I mean for the next time you'll paste some results. Thanks.

TiborGY · 2020-02-18T12:45:08Z

The STRMM kernel passed 1-thread reliability test.

Can you please rather send the results directly via a direct text comment instead of a screenshot?

Just out of curiosity, why do you prefer text?

marxin · 2020-02-18T12:51:43Z

Just out of curiosity, why do you prefer text?

I see the screenshots scaled and I need to click to see a full screen. Moreover, one can't copy commands from screenshots. And I also don't see how were the tests run, it's scrolled down.

wjc404 · 2020-02-22T16:20:36Z

The new AVX2 SGEMM kernel passed 2 reliability tests (using the test program attached in PR #2300 ).
testlog_avx2_serial_sgemm_20190223_epycrome.txt

testlog_avx2_serial_sgemm_20190223_epycrome_2.txt

wjc404 · 2020-02-25T05:27:10Z

The modified AVX512 SGEMM kernel passed an 1-thread test.

testlog_aliyun_sgemm_avx512_1thread.txt

wjc404 · 2020-02-26T10:45:15Z

Recently I found a performance issue on Haswell-EP and Broadwell-EP processors: the execution of register-to-register vector permutation instructions can interfere with that of FMA instructions (which is not an issue on Skylake-client or Zen2 processors). Changing vector permutation instructions to memory-to-register type can avoid this issue. The 1-thread performances of CGEMM and ZGEMM rose about 15% (to 98% MKL2018) on these processors after the changes in this PR.

martin-frbg · 2020-02-26T10:57:59Z

that is an interesting quirk of the EP models - please advise if it would make coding easier for you if these were detected as a separate TARGET type, or with a specific #define available at compile time ?

wjc404 · 2020-02-26T11:38:17Z

Probably not necessary in my opinion. The usage of vector permutation instructions in level3 kernels is not common.

marxin · 2020-02-27T09:44:46Z

Recently I found a performance issue on Haswell-EP and Broadwell-EP processors: the execution of register-to-register vector permutation instructions can interfere with that of FMA instructions (which is not an issue on Skylake-client or Zen2 processors).

May I please ask how did you notice that? It's an interesting observation.

About the introduction of the new sgemm_kernel_8x4_haswell_2.c kernel: Can you please describe how is the implementation different?

wjc404 · 2020-02-27T15:03:49Z

Remove redundant prefetch instructions in avx2 c/z gemm kernels (they actually do a little harm).

wjc404 · 2020-02-27T15:08:39Z

@marxin I found it accidentally in a test, when I received significant performance gain after changing those permute instructions. I confirmed that with the test program in "test_fma_perm_instflow.zip".
In the kernel "sgemm_kernel_8x4_haswell_2.c", I reordered the calculation steps, rearranged 2 adjacent 8rowX12col stores (to matrix C) to 2 16rowX6col stores, while maintaining the same reading rate from L3 cache.
In a 16rowX6col write to a block of column-major matrix C in 8-way set associative cache, there's little risk of conflict misses which could occur in a 8rowX12col write.

marxin · 2020-02-28T09:51:07Z

In the kernel "sgemm_kernel_8x4_haswell_2.c", I reordered the calculation steps, rearranged 2 adjacent 8rowX12col stores (to matrix C) to 2 16rowX6col stores, while maintaining the same reading rate from L3 cache.

Nice. That seems smart.
Btw. I've spent some time with reading of the Anatomy of High-Performance Matrix Multiplication - 2008 and also the new kernels you written. I see it still quite difficult to understand the implementation. Is there anything that can help me to understand it? Thanks.

wjc404 added 3 commits February 16, 2020 22:58

AVX512 STRMM kernel

e3368cb

Update KERNEL.SKYLAKEX

f566787

Update param.h

b0558c1

martin-frbg changed the title ~~AVX512 STRMM kernel~~ Adjust SkylakeX GEMM3M parameters and add an AVX512 STRMM kernel Feb 16, 2020

martin-frbg added this to the 0.3.9 milestone Feb 16, 2020

wjc404 added 5 commits February 22, 2020 23:37

Fix performance bug when LDC is a multiple of 1024

f6fcbd7

Delete sgemm_kernel_8x4_haswell_2.c

f1746e7

Update KERNEL.HASWELL

97a32cb

Update KERNEL.ZEN

a2ff577

Add files via upload

903854c

wjc404 changed the title ~~Adjust SkylakeX GEMM3M parameters and add an AVX512 STRMM kernel~~ Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix a performance bug in AVX2 SGEMM Feb 22, 2020

wjc404 added 2 commits February 26, 2020 18:36

Update cgemm_kernel_8x2_haswell.c

2515e11

Update zgemm_kernel_4x2_haswell.c

1b98000

wjc404 changed the title ~~Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix a performance bug in AVX2 SGEMM~~ Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM Feb 26, 2020

martin-frbg mentioned this pull request Feb 27, 2020

Restore ZEN SGEMM speed after #2361. #2430

Closed

wjc404 added 2 commits February 27, 2020 22:25

Update zgemm_kernel_4x2_haswell.c

2352331

Update cgemm_kernel_8x2_haswell.c

dd22eb7

martin-frbg merged commit ea8eec5 into OpenMathLib:develop Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

Uh oh!

wjc404 commented Feb 16, 2020 •

edited

Loading

Uh oh!

wjc404 commented Feb 17, 2020

Uh oh!

marxin commented Feb 17, 2020

Uh oh!

wjc404 commented Feb 17, 2020 •

edited

Loading

Uh oh!

marxin commented Feb 18, 2020

Uh oh!

TiborGY commented Feb 18, 2020

Uh oh!

marxin commented Feb 18, 2020

Uh oh!

wjc404 commented Feb 22, 2020 •

edited

Loading

Uh oh!

wjc404 commented Feb 25, 2020

Uh oh!

wjc404 commented Feb 26, 2020 •

edited

Loading

Uh oh!

martin-frbg commented Feb 26, 2020

Uh oh!

wjc404 commented Feb 26, 2020

Uh oh!

marxin commented Feb 27, 2020

Uh oh!

wjc404 commented Feb 27, 2020

Uh oh!

wjc404 commented Feb 27, 2020 •

edited

Loading

Uh oh!

marxin commented Feb 28, 2020

Uh oh!

Uh oh!

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM #2422

Uh oh!

Conversation

wjc404 commented Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjc404 commented Feb 17, 2020

Uh oh!

marxin commented Feb 17, 2020

Uh oh!

wjc404 commented Feb 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marxin commented Feb 18, 2020

Uh oh!

TiborGY commented Feb 18, 2020

Uh oh!

marxin commented Feb 18, 2020

Uh oh!

wjc404 commented Feb 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjc404 commented Feb 25, 2020

Uh oh!

wjc404 commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Feb 26, 2020

Uh oh!

wjc404 commented Feb 26, 2020

Uh oh!

marxin commented Feb 27, 2020

Uh oh!

wjc404 commented Feb 27, 2020

Uh oh!

wjc404 commented Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marxin commented Feb 28, 2020

Uh oh!

Uh oh!

wjc404 commented Feb 16, 2020 •

edited

Loading

wjc404 commented Feb 17, 2020 •

edited

Loading

wjc404 commented Feb 22, 2020 •

edited

Loading

wjc404 commented Feb 26, 2020 •

edited

Loading

wjc404 commented Feb 27, 2020 •

edited

Loading