-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize AVX512 DGEMM (& DTRMM) #2384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The new DTRMM kernel passed 1-thread reliability test. |
Reduce DGEMM_R to avoid segfault when executing serial dgemm with m=3000 & n=k=12000. |
DGEMM test on a KVM(aliyun, bounded to 8 physical cores on 1 die) on Intel Xeon Platinum 8269CY at 2.5 GHz: library..................................1-thread perf. (dim ~ 7000)...........8-thread perf. (dim~20000) @martin-frbg Have you tested the performance change on a SKX platform? (I cannot access my i7-9800X computer now because of the outbreak of 2019 Novel Coronavirus in my country.) |
Will try to get to that in a minute (but all I have is an older W2123, only four cores /eight threads). |
preliminary results on Xeon W2125, 4c/8t (before the very latest commits) - OpenBLAS numbers for 6/8 threads look wrong. dgemmtest_new obtained from your GEMM_AVX2_FMA3 repository
|
@martin-frbg Thank you very much. |
20000 may be too small to justify more than 4 threads - watching a rerun of the OMP_NUM_THREADS=6 case, I see it start out with 4 threads (MKL ?) and switch to 6 later (where top shows 4x100 percent utilisation, but only around 45 percent on the other two) |
Probably MKL has a mechanism to detect the number of physical cores and limit the number of threads accordingly (to 1 thread per core). I noticed something strange when running dgemm tests on the "8c/16t" Huawei cloud mentioned above (not aliyun). OpenBLAS(after this PR) got ~1000 GFLOPS with OMP_NUM_THREADS=16 while MKL got only 500-600 on that VM. Latency tests between logical cores showed that there should be 16 physical cores, but lscpu showed 16 threads on 8 cores.... |
Quite likely - and/or it puts an upper limit on the number of threads based on problem size.There is code by Agner Fog to do the physical/logical core counting on Intel and AMD in https://github.com/vectorclass/add-on/tree/master/physical_processors but it seems even he cannot do it on Intel processors without resorting to assumptions. |
Performance degradation with >1 threads per core is reasonable with the new kernel. The packed matrix A occupies 576 kB (it can't be too small, because the limited bandwidth of main memory requires GEMM_Q and GEMM_P to be big enough), which can be cached in L2 when each core runs 1 thread. With 2 threads on a core, however, the size of L2 is not enough to hold both packed matrices from 2 threads. I saw a 14% performance degradation with 16 threads(compared to 8 threads) on a 8c/16t aliyun KVM. |
Oh, right. This should be acceptable given the serious improvement overall, though it might make sense to put a note about this cache flushing effect in the readme and/or wiki. |
Replace KERNEL_8x24 with KERNEL_16x12, which makes more room for raising GEMM_Q to slow down reading/writing on matrix C (in main memory), thus improves parallel performance.