Skip to content

Optimize AVX512 DGEMM (& DTRMM) #2384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 7, 2020
Merged

Conversation

wjc404
Copy link
Contributor

@wjc404 wjc404 commented Feb 3, 2020

Replace KERNEL_8x24 with KERNEL_16x12, which makes more room for raising GEMM_Q to slow down reading/writing on matrix C (in main memory), thus improves parallel performance.

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 3, 2020

DGEMM test on a 8c/16t KVM (huawei cloud), on Intel Xeon Gold 6266C at 2.3 GHz:

1-thread test with m=n=k=5999, theoretical 73.6 GFLOPS:
old kernel:
Screenshot from 2020-02-03 22-33-26

new kernel:
Screenshot from 2020-02-03 22-33-29

8-thread test with m=n=k=9999, theoretical 589 GFLOPS:
old kernel:
Screenshot from 2020-02-03 22-33-06

new kernel:
Screenshot from 2020-02-03 22-33-18

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 4, 2020

The new DTRMM kernel passed 1-thread reliability test.
Screenshot from 2020-02-04 08-15-33
test code:
dtrmm_compare_test.zip

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 4, 2020

The new DGEMM kernel passed 1-thread reliability test.
Screenshot from 2020-02-04 16-47-06

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 4, 2020

Reduce DGEMM_R to avoid segfault when executing serial dgemm with m=3000 & n=k=12000.

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 5, 2020

DGEMM test on a KVM(aliyun, bounded to 8 physical cores on 1 die) on Intel Xeon Platinum 8269CY at 2.5 GHz:

library..................................1-thread perf. (dim ~ 7000)...........8-thread perf. (dim~20000)
MKL2019.........................................69.8 GFLOPS..........................591 GFLOPS
OpenBLAS_dev_before_this_PR....69.6 GFLOPS..........................537 GFLOPS
OpenBLAS_dev_after_this_PR.......72.6 GFLOPS.........................579 GFLOPS
Theoretical.......................................80.0 GFLOPS..........................640 GFLOPS

@martin-frbg Have you tested the performance change on a SKX platform? (I cannot access my i7-9800X computer now because of the outbreak of 2019 Novel Coronavirus in my country.)

@martin-frbg
Copy link
Collaborator

Will try to get to that in a minute (but all I have is an older W2123, only four cores /eight threads).

@martin-frbg
Copy link
Collaborator

preliminary results on Xeon W2125, 4c/8t (before the very latest commits) - OpenBLAS numbers for 6/8 threads look wrong. dgemmtest_new obtained from your GEMM_AVX2_FMA3 repository

1 thread
library 2000x2000 7000 x 7000 20000 x 20000
MKL 86.1 97.0 98.5
OB before 91.9 97.1
OB after 94.0 103.8 105.3
4 threads
library 2000x2000 7000 x 7000 20000 x 20000
MKL 380.9 394.3
OB before 304.2 328.5
OB after 370.3 397.5
6 threads
library 2000x2000 7000 x 7000 20000 x 20000
MKL 394.0
OB before 220.0
OB after 224.5
8 threads
library 2000x2000 7000 x 7000 20000 x 20000
MKL 394.3
OB before 281.6
OB after 250.6

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 6, 2020

@martin-frbg Thank you very much.

@martin-frbg
Copy link
Collaborator

20000 may be too small to justify more than 4 threads - watching a rerun of the OMP_NUM_THREADS=6 case, I see it start out with 4 threads (MKL ?) and switch to 6 later (where top shows 4x100 percent utilisation, but only around 45 percent on the other two)

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 6, 2020

Probably MKL has a mechanism to detect the number of physical cores and limit the number of threads accordingly (to 1 thread per core). I noticed something strange when running dgemm tests on the "8c/16t" Huawei cloud mentioned above (not aliyun). OpenBLAS(after this PR) got ~1000 GFLOPS with OMP_NUM_THREADS=16 while MKL got only 500-600 on that VM. Latency tests between logical cores showed that there should be 16 physical cores, but lscpu showed 16 threads on 8 cores....

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2020

Quite likely - and/or it puts an upper limit on the number of threads based on problem size.There is code by Agner Fog to do the physical/logical core counting on Intel and AMD in https://github.com/vectorclass/add-on/tree/master/physical_processors but it seems even he cannot do it on Intel processors without resorting to assumptions.

@wjc404
Copy link
Contributor Author

wjc404 commented Feb 6, 2020

Performance degradation with >1 threads per core is reasonable with the new kernel. The packed matrix A occupies 576 kB (it can't be too small, because the limited bandwidth of main memory requires GEMM_Q and GEMM_P to be big enough), which can be cached in L2 when each core runs 1 thread. With 2 threads on a core, however, the size of L2 is not enough to hold both packed matrices from 2 threads. I saw a 14% performance degradation with 16 threads(compared to 8 threads) on a 8c/16t aliyun KVM.

@martin-frbg
Copy link
Collaborator

Oh, right. This should be acceptable given the serious improvement overall, though it might make sense to put a note about this cache flushing effect in the readme and/or wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants