-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize AVX2 SGEMM & STRMM #2361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1).1-thread SGEMM test with m=n=k=7023, transa=N and transb=T, on i9 9900K at 4.4 GHz, theoretical 140.8 GFLOPS: 2).4-thread SGEMM test with m=n=k=10000, transa=T and transb=N, on the same CPU at 4.2 GHz, theoretical 538 GFLOPS: 3).8-thread test with m=n=k=20000, transa=transb=N on the same CPU at 4.1 GHz, theoretical 1050 GFLOPS: 4).1-thread STRMM test with m=n=6971, side=L, uplo=U, transa=N and diag=N, on the same CPU at 4.4 GHz, theoretical 140.8 GFLOPS: |
The STRMM kernel passed 1-thread reliability test. Test code: |
Great, thanks a lot. Still fascinating to see how much performance can be improved by making things slower. |
I partially reverted the changes in OpenMathLib#2361 and I received the following speed up on: ./xsl3blastst -R gemm -N 2048 2048 1 -a 5 1 1 1 1 1 AMD Ryzen 7 2700X (Zen+): 61400 to 63300 MFlops AMD EPYC 7742 (Zen v2): 91400 to 94500 MFlops These numbers are single-threaded performance.
Replace KERNEL_16x6 with KERNEL_8x12 to slow down reading on packed matrix A(in L3 cache), as mentioned in issue #2210 . The performance catchs up with MKL2019 after the change.