-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for the patch. Does your optimized |
This optimization is specially customized according to cortexa53's pipe, so i have changed it to only work for cortexa53. |
Interesting design. Once I noticed that the instruction "fmov vA.d[1], xB" (0<= A, B <32) does not interfere with the execution of 128-bit fmla on cortex-A55, but significantly slows down that on cortex A76. Does OpenBLAS have a mechanism of dispatching different kernels to different cores in parallel calculation? |
If there is a way to calculate the threaded level3 blas that Big cores and little cores in parallelling, you may need a static load balancing strategy, which will make the code design very complicated |
There is currenly no mechanism in OpenBLAS to select different kinds of kernels onto different cores in a mixed configuration at runtime, and no active handling of big.little configurations in general. (Basically whichever core type gets scheduled first by the operating system "wins" in the autodetection). Adding this would indeed be quite complicated, and does not look particularly attractive given that only the ARMV8 architecture would currently make use of it. |
Thanks for your answers. .macro KERNEL8x8_I .macro KERNEL8x8_M1 .macro KERNEL8x8_M2 .macro KERNEL8x8_E |
Thanks for your advice and this is my mkernel benchmark result Since there is a difference between a53 and a55, fmov cannot dual-i |
Thank you very much. I've learned a lot from your answer. |
How about storing elements from matrix B in 4 64-bit neon registers per iteration ? I think this can eliminate half of the fmov instructions. |
Sorry about my English, hope i can Explain clearly. |
Sorry for my stupid suggestions. So on A53 there're no instructions operating on neon registers (read or write) dual-issuable with 128-bit fmla, is that correct? (If this is true there's no way to achieve maximum flops theoretically in GEMM, pity for that...) |
This is something worth discussing. According to some of my experiments and inferences, it should be like this. But i can't guarantee : ) |
Signed-off-by: zhangdanfeng <[email protected]>
Signed-off-by: zhangdanfeng <[email protected]>
Just one more stupid question, how many cycles are required for a post-indexed load instruction to update address register on A53 (from issue to being available for subsequent load) ? |
Sorry, i haven't measured the latency, maybe one or two cycles |
optimization for sgemm on cortex-a53,single thread benchmark result on rk3399 Little cluster
before:
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.538e-07,0.004408 GFLOPS
2,2,2,4.962e-07,0.03224 GFLOPS
4,4,4,6.011e-07,0.213 GFLOPS
8,8,8,1.06e-06,0.9657 GFLOPS
11,11,11,2.632e-06,1.012 GFLOPS
16,16,16,3.727e-06,2.198 GFLOPS
19,19,19,7.425e-06,1.848 GFLOPS
23,23,23,1.19e-05,2.045 GFLOPS
27,27,27,1.666e-05,2.363 GFLOPS
32,32,32,1.841e-05,3.56 GFLOPS
38,38,38,3.598e-05,3.05 GFLOPS
45,45,45,5.855e-05,3.113 GFLOPS
54,54,54,9.529e-05,3.305 GFLOPS
64,64,64,0.0001362,3.848 GFLOPS
76,76,76,0.000234,3.752 GFLOPS
91,91,91,0.0004197,3.591 GFLOPS
108,108,108,0.0006323,3.984 GFLOPS
128,128,128,0.001005,4.172 GFLOPS
152,152,152,0.001727,4.067 GFLOPS
181,181,181,0.003109,3.815 GFLOPS
215,215,215,0.005348,3.717 GFLOPS
256,256,256,0.008465,3.964 GFLOPS
304,304,304,0.01417,3.967 GFLOPS
362,362,362,0.02484,3.819 GFLOPS
431,431,431,0.04246,3.771 GFLOPS
512,512,512,0.06607,4.063 GFLOPS
724,724,724,0.1867,4.066 GFLOPS
1024,1024,1024,0.5178,4.148 GFLOPS
1448,1448,1448,1.466,4.141 GFLOPS
2048,2048,2048,4.051,4.241 GFLOPS
after:
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.094e-07,0.004885 GFLOPS
2,2,2,4.542e-07,0.03523 GFLOPS
4,4,4,5.48e-07,0.2336 GFLOPS
8,8,8,1.003e-06,1.021 GFLOPS
11,11,11,2.491e-06,1.068 GFLOPS
16,16,16,3.186e-06,2.571 GFLOPS
19,19,19,6.784e-06,2.022 GFLOPS
23,23,23,1.116e-05,2.18 GFLOPS
27,27,27,1.443e-05,2.728 GFLOPS
32,32,32,1.394e-05,4.7 GFLOPS
38,38,38,2.94e-05,3.733 GFLOPS
45,45,45,4.78e-05,3.813 GFLOPS
54,54,54,7.736e-05,4.071 GFLOPS
64,64,64,9.676e-05,5.419 GFLOPS
76,76,76,0.0001761,4.985 GFLOPS
91,91,91,0.0003204,4.703 GFLOPS
108,108,108,0.0004628,5.444 GFLOPS
128,128,128,0.0007058,5.943 GFLOPS
152,152,152,0.001178,5.963 GFLOPS
181,181,181,0.002462,4.817 GFLOPS
215,215,215,0.004601,4.32 GFLOPS
256,256,256,0.007187,4.669 GFLOPS
304,304,304,0.01287,4.366 GFLOPS
362,362,362,0.02115,4.487 GFLOPS
431,431,431,0.03498,4.578 GFLOPS
512,512,512,0.05367,5.002 GFLOPS
724,724,724,0.1465,5.18 GFLOPS
1024,1024,1024,0.4089,5.252 GFLOPS
1448,1448,1448,1.115,5.445 GFLOPS
2048,2048,2048,3.206,5.359 GFLOPS