Optimize AVX512 DGEMM (& DTRMM) #2384

wjc404 · 2020-02-03T14:38:12Z

Replace KERNEL_8x24 with KERNEL_16x12, which makes more room for raising GEMM_Q to slow down reading/writing on matrix C (in main memory), thus improves parallel performance.

wjc404 · 2020-02-03T14:44:05Z

DGEMM test on a 8c/16t KVM (huawei cloud), on Intel Xeon Gold 6266C at 2.3 GHz:

1-thread test with m=n=k=5999, theoretical 73.6 GFLOPS:
old kernel:

new kernel:

8-thread test with m=n=k=9999, theoretical 589 GFLOPS:
old kernel:

new kernel:

wjc404 · 2020-02-04T00:21:18Z

The new DTRMM kernel passed 1-thread reliability test.

test code:
dtrmm_compare_test.zip

wjc404 · 2020-02-04T08:48:33Z

The new DGEMM kernel passed 1-thread reliability test.

wjc404 · 2020-02-04T11:56:46Z

Reduce DGEMM_R to avoid segfault when executing serial dgemm with m=3000 & n=k=12000.

wjc404 · 2020-02-05T14:21:27Z

DGEMM test on a KVM(aliyun, bounded to 8 physical cores on 1 die) on Intel Xeon Platinum 8269CY at 2.5 GHz:

library..................................1-thread perf. (dim ~ 7000)...........8-thread perf. (dim~20000)
MKL2019.........................................69.8 GFLOPS..........................591 GFLOPS
OpenBLAS_dev_before_this_PR....69.6 GFLOPS..........................537 GFLOPS
OpenBLAS_dev_after_this_PR.......72.6 GFLOPS.........................579 GFLOPS
Theoretical.......................................80.0 GFLOPS..........................640 GFLOPS

@martin-frbg Have you tested the performance change on a SKX platform? (I cannot access my i7-9800X computer now because of the outbreak of 2019 Novel Coronavirus in my country.)

martin-frbg · 2020-02-05T15:02:21Z

Will try to get to that in a minute (but all I have is an older W2123, only four cores /eight threads).

martin-frbg · 2020-02-06T08:48:23Z

preliminary results on Xeon W2125, 4c/8t (before the very latest commits) - OpenBLAS numbers for 6/8 threads look wrong. dgemmtest_new obtained from your GEMM_AVX2_FMA3 repository

	1 thread
library	2000x2000	7000 x 7000	20000 x 20000
MKL	86.1	97.0	98.5
OB before	91.9	97.1
OB after	94.0	103.8	105.3

	4 threads
library	2000x2000	7000 x 7000	20000 x 20000
MKL		380.9	394.3
OB before		304.2	328.5
OB after		370.3	397.5

	6 threads
library	2000x2000	7000 x 7000	20000 x 20000
MKL			394.0
OB before			220.0
OB after			224.5

	8 threads
library	2000x2000	7000 x 7000	20000 x 20000
MKL			394.3
OB before			281.6
OB after			250.6

wjc404 · 2020-02-06T09:15:11Z

@martin-frbg Thank you very much.

martin-frbg · 2020-02-06T09:31:17Z

20000 may be too small to justify more than 4 threads - watching a rerun of the OMP_NUM_THREADS=6 case, I see it start out with 4 threads (MKL ?) and switch to 6 later (where top shows 4x100 percent utilisation, but only around 45 percent on the other two)

wjc404 · 2020-02-06T12:19:33Z

Probably MKL has a mechanism to detect the number of physical cores and limit the number of threads accordingly (to 1 thread per core). I noticed something strange when running dgemm tests on the "8c/16t" Huawei cloud mentioned above (not aliyun). OpenBLAS(after this PR) got ~1000 GFLOPS with OMP_NUM_THREADS=16 while MKL got only 500-600 on that VM. Latency tests between logical cores showed that there should be 16 physical cores, but lscpu showed 16 threads on 8 cores....

martin-frbg · 2020-02-06T12:40:54Z

Quite likely - and/or it puts an upper limit on the number of threads based on problem size.There is code by Agner Fog to do the physical/logical core counting on Intel and AMD in https://github.com/vectorclass/add-on/tree/master/physical_processors but it seems even he cannot do it on Intel processors without resorting to assumptions.

wjc404 · 2020-02-06T13:09:32Z

Performance degradation with >1 threads per core is reasonable with the new kernel. The packed matrix A occupies 576 kB (it can't be too small, because the limited bandwidth of main memory requires GEMM_Q and GEMM_P to be big enough), which can be cached in L2 when each core runs 1 thread. With 2 threads on a core, however, the size of L2 is not enough to hold both packed matrices from 2 threads. I saw a 14% performance degradation with 16 threads(compared to 8 threads) on a 8c/16t aliyun KVM.

martin-frbg · 2020-02-06T13:43:17Z

Oh, right. This should be acceptable given the serious improvement overall, though it might make sense to put a note about this cache flushing effect in the readme and/or wiki.

wjc404 added 3 commits February 3, 2020 21:32

AVX512 16x2 DGEMM kernel

8019e70

Update param.h

f3f969f

Update KERNEL.SKYLAKEX

081b188

Update param.h

83b6be7

wjc404 added 5 commits February 4, 2020 20:30

Update level3.c

1c3e20c

Update level3_thread.c

77b8f49

Update trmm_L.c

833bd0f

Update trmm_R.c

2f96a2c

Update dgemm_kernel_16x2_skylakex.c

096da2f

wjc404 added 3 commits February 6, 2020 01:46

Update dgemm_kernel_16x2_skylakex.c

4e00d96

Update sgemm_kernel_8x4_haswell.c

8b5cdcc

Update dgemm_kernel_16x2_skylakex.c

3447d04

martin-frbg merged commit c1c10cb into OpenMathLib:develop Feb 7, 2020

wjc404 mentioned this pull request Feb 19, 2020

Restore ZEN SGEMM speed after #2361. #2430

Closed

martin-frbg mentioned this pull request Nov 25, 2021

Multi-arch OpenBLAS with DYNAMIC_ARCH=1 yields wrong result when compiling on ivybridge and running on skylake #3454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize AVX512 DGEMM (& DTRMM) #2384

Optimize AVX512 DGEMM (& DTRMM) #2384

Uh oh!

wjc404 commented Feb 3, 2020

Uh oh!

wjc404 commented Feb 3, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 5, 2020

Uh oh!

martin-frbg commented Feb 5, 2020

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

wjc404 commented Feb 6, 2020

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

wjc404 commented Feb 6, 2020 •

edited

Loading

Uh oh!

martin-frbg commented Feb 6, 2020 •

edited

Loading

Uh oh!

wjc404 commented Feb 6, 2020 •

edited

Loading

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

Uh oh!

Optimize AVX512 DGEMM (& DTRMM) #2384

Optimize AVX512 DGEMM (& DTRMM) #2384

Uh oh!

Conversation

wjc404 commented Feb 3, 2020

Uh oh!

wjc404 commented Feb 3, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 4, 2020

Uh oh!

wjc404 commented Feb 5, 2020

Uh oh!

martin-frbg commented Feb 5, 2020

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

wjc404 commented Feb 6, 2020

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

wjc404 commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjc404 commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Feb 6, 2020

Uh oh!

Uh oh!

wjc404 commented Feb 6, 2020 •

edited

Loading

martin-frbg commented Feb 6, 2020 •

edited

Loading

wjc404 commented Feb 6, 2020 •

edited

Loading