Skip to content

Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 25, 2020

Conversation

craft-zhang
Copy link
Contributor

optimization for sgemm on cortex-a53,single thread benchmark result on rk3399 Little cluster

before:
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.538e-07,0.004408 GFLOPS
2,2,2,4.962e-07,0.03224 GFLOPS
4,4,4,6.011e-07,0.213 GFLOPS
8,8,8,1.06e-06,0.9657 GFLOPS
11,11,11,2.632e-06,1.012 GFLOPS
16,16,16,3.727e-06,2.198 GFLOPS
19,19,19,7.425e-06,1.848 GFLOPS
23,23,23,1.19e-05,2.045 GFLOPS
27,27,27,1.666e-05,2.363 GFLOPS
32,32,32,1.841e-05,3.56 GFLOPS
38,38,38,3.598e-05,3.05 GFLOPS
45,45,45,5.855e-05,3.113 GFLOPS
54,54,54,9.529e-05,3.305 GFLOPS
64,64,64,0.0001362,3.848 GFLOPS
76,76,76,0.000234,3.752 GFLOPS
91,91,91,0.0004197,3.591 GFLOPS
108,108,108,0.0006323,3.984 GFLOPS
128,128,128,0.001005,4.172 GFLOPS
152,152,152,0.001727,4.067 GFLOPS
181,181,181,0.003109,3.815 GFLOPS
215,215,215,0.005348,3.717 GFLOPS
256,256,256,0.008465,3.964 GFLOPS
304,304,304,0.01417,3.967 GFLOPS
362,362,362,0.02484,3.819 GFLOPS
431,431,431,0.04246,3.771 GFLOPS
512,512,512,0.06607,4.063 GFLOPS
724,724,724,0.1867,4.066 GFLOPS
1024,1024,1024,0.5178,4.148 GFLOPS
1448,1448,1448,1.466,4.141 GFLOPS
2048,2048,2048,4.051,4.241 GFLOPS

after:
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.094e-07,0.004885 GFLOPS
2,2,2,4.542e-07,0.03523 GFLOPS
4,4,4,5.48e-07,0.2336 GFLOPS
8,8,8,1.003e-06,1.021 GFLOPS
11,11,11,2.491e-06,1.068 GFLOPS
16,16,16,3.186e-06,2.571 GFLOPS
19,19,19,6.784e-06,2.022 GFLOPS
23,23,23,1.116e-05,2.18 GFLOPS
27,27,27,1.443e-05,2.728 GFLOPS
32,32,32,1.394e-05,4.7 GFLOPS
38,38,38,2.94e-05,3.733 GFLOPS
45,45,45,4.78e-05,3.813 GFLOPS
54,54,54,7.736e-05,4.071 GFLOPS
64,64,64,9.676e-05,5.419 GFLOPS
76,76,76,0.0001761,4.985 GFLOPS
91,91,91,0.0003204,4.703 GFLOPS
108,108,108,0.0004628,5.444 GFLOPS
128,128,128,0.0007058,5.943 GFLOPS
152,152,152,0.001178,5.963 GFLOPS
181,181,181,0.002462,4.817 GFLOPS
215,215,215,0.004601,4.32 GFLOPS
256,256,256,0.007187,4.669 GFLOPS
304,304,304,0.01287,4.366 GFLOPS
362,362,362,0.02115,4.487 GFLOPS
431,431,431,0.03498,4.578 GFLOPS
512,512,512,0.05367,5.002 GFLOPS
724,724,724,0.1465,5.18 GFLOPS
1024,1024,1024,0.4089,5.252 GFLOPS
1448,1448,1448,1.115,5.445 GFLOPS
2048,2048,2048,3.206,5.359 GFLOPS

@xianyi
Copy link
Collaborator

xianyi commented May 18, 2020

Thank you for the patch.

Does your optimized sgemm_kernel_8x8 have better performance ( compared to old sgemm_kernel_8x8 implementation )for other ARMv8 CPU (e.g. cortex-a73).

@craft-zhang
Copy link
Contributor Author

Thank you for the patch.

Does your optimized sgemm_kernel_8x8 have better performance ( compared to old sgemm_kernel_8x8 implementation )for other ARMv8 CPU (e.g. cortex-a73).

This optimization is specially customized according to cortexa53's pipe, so i have changed it to only work for cortexa53.

@wjc404
Copy link
Contributor

wjc404 commented May 18, 2020

Interesting design. Once I noticed that the instruction "fmov vA.d[1], xB" (0<= A, B <32) does not interfere with the execution of 128-bit fmla on cortex-A55, but significantly slows down that on cortex A76. Does OpenBLAS have a mechanism of dispatching different kernels to different cores in parallel calculation?

@craft-zhang
Copy link
Contributor Author

Interesting design. Once I noticed that the instruction "fmov vA.d[1], xB" (0<= A, B <32) does not interfere with the execution of 128-bit fmla on cortex-A55, but significantly slows down that on cortex A76. Does OpenBLAS have a mechanism of dispatching different kernels to different cores in parallel calculation?

If there is a way to calculate the threaded level3 blas that Big cores and little cores in parallelling, you may need a static load balancing strategy, which will make the code design very complicated

@martin-frbg
Copy link
Collaborator

There is currenly no mechanism in OpenBLAS to select different kinds of kernels onto different cores in a mixed configuration at runtime, and no active handling of big.little configurations in general. (Basically whichever core type gets scheduled first by the operating system "wins" in the autodetection). Adding this would indeed be quite complicated, and does not look particularly attractive given that only the ARMV8 architecture would currently make use of it.

@wjc404
Copy link
Contributor

wjc404 commented May 18, 2020

Thanks for your answers.
I suspect that there's still room for improvement by reordering of some instructions:

.macro KERNEL8x8_I
ld1 {v0.4s, v1.4s}, [pA], #32
ld1 {v4.4s, v5.4s}, [pB], #32
ldr d2, [pA], #8
ldr d6, [pB], #8
ldr d3, [pA, #8]
ldr d7, [pB, #8]
ldr x20, [pA], #16
fmul v16.4s, v0.4s, v4.s[0]
ldr x24, [pB], #16
fmul v17.4s, v1.4s, v4.s[0]
ldr x21, [pA], #8
fmul v18.4s, v0.4s, v4.s[1]
ldr x25, [pB], #8
fmul v19.4s, v1.4s, v4.s[1]
fmul v20.4s, v0.4s, v4.s[2]
fmul v21.4s, v1.4s, v4.s[2]
fmul v22.4s, v0.4s, v4.s[3]
fmul v23.4s, v1.4s, v4.s[3]
fmul v24.4s, v0.4s, v5.s[0]
fmul v25.4s, v1.4s, v5.s[0]
fmov v2.d[1], x20
fmul v26.4s, v0.4s, v5.s[1]
fmov v3.d[1], x21
fmul v27.4s, v1.4s, v5.s[1]
fmov v6.d[1], x24
fmul v28.4s, v0.4s, v5.s[2]
fmov v7.d[1], x25
fmul v29.4s, v1.4s, v5.s[2]
fmul v30.4s, v0.4s, v5.s[3]
fmul v31.4s, v1.4s, v5.s[3]
.endm

.macro KERNEL8x8_M1
ldr d2, [pA], #32
fmla v16.4s, v0.4s, v4.s[0]
ldr d6, [pB], #32
fmla v17.4s, v1.4s, v4.s[0]
ldr x20, [pA, #-24]
fmla v18.4s, v0.4s, v4.s[1]
ldr d3, [pA, #-16]
fmla v19.4s, v1.4s, v4.s[1]
ldr x21, [pA, #-8]
fmla v20.4s, v0.4s, v4.s[2]
ldr x24, [pB, #-24]
fmla v21.4s, v1.4s, v4.s[2]
ldr d7, [pB, #-16]
fmla v22.4s, v0.4s, v4.s[3]
ldr x25, [pB, #-8]
fmla v23.4s, v1.4s, v4.s[3]
fmla v24.4s, v0.4s, v5.s[0]
fmov v2.d[1], x20
fmla v25.4s, v1.4s, v5.s[0]
fmla v26.4s, v0.4s, v5.s[1]
fmov v3.d[1], x21
fmla v27.4s, v1.4s, v5.s[1]
fmov v6.d[1], x24
fmla v28.4s, v0.4s, v5.s[2]
fmov v7.d[1], x25
fmla v29.4s, v1.4s, v5.s[2]
fmla v30.4s, v0.4s, v5.s[3]
fmla v31.4s, v1.4s, v5.s[3]
.endm

.macro KERNEL8x8_M2
ldr d0, [pA], #32
fmla v16.4s, v2.4s, v6.s[0]
ldr d4, [pB], #32
fmla v17.4s, v3.4s, v6.s[0]
ldr x18, [pA, #-24]
fmla v18.4s, v2.4s, v6.s[1]
ldr d1, [pA, #-16]
fmla v19.4s, v3.4s, v6.s[1]
ldr x19, [pA, #-8]
fmla v20.4s, v2.4s, v6.s[2]
ldr x22, [pB, #-24]
fmla v21.4s, v3.4s, v6.s[2]
ldr d5, [pB, #-16]
fmla v22.4s, v2.4s, v6.s[3]
ldr x23, [pB, #-8]
fmla v23.4s, v3.4s, v6.s[3]
fmla v24.4s, v2.4s, v7.s[0]
fmov v0.d[1], x18
fmla v25.4s, v3.4s, v7.s[0]
fmla v26.4s, v2.4s, v7.s[1]
fmov v1.d[1], x19
fmla v27.4s, v3.4s, v7.s[1]
fmov v4.d[1], x22
fmla v28.4s, v2.4s, v7.s[2]
fmov v5.d[1], x23
fmla v29.4s, v3.4s, v7.s[2]
fmla v30.4s, v2.4s, v7.s[3]
fmla v31.4s, v3.4s, v7.s[3]
.endm

.macro KERNEL8x8_E
fmla v16.4s, v2.4s, v6.s[0]
fmla v17.4s, v3.4s, v6.s[0]
fmla v18.4s, v2.4s, v6.s[1]
fmla v19.4s, v3.4s, v6.s[1]
fmla v20.4s, v2.4s, v6.s[2]
fmla v21.4s, v3.4s, v6.s[2]
fmla v22.4s, v2.4s, v6.s[3]
fmla v23.4s, v3.4s, v6.s[3]
fmla v24.4s, v2.4s, v7.s[0]
fmla v25.4s, v3.4s, v7.s[0]
fmla v26.4s, v2.4s, v7.s[1]
fmla v27.4s, v3.4s, v7.s[1]
fmla v28.4s, v2.4s, v7.s[2]
fmla v29.4s, v3.4s, v7.s[2]
fmla v30.4s, v2.4s, v7.s[3]
fmla v31.4s, v3.4s, v7.s[3]
.endm

@craft-zhang
Copy link
Contributor Author

Thanks for your advice and this is my mkernel benchmark result
before reordering:
Run on (6 X 1416 MHz CPU s)
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:128/ldc:8 2310 ns 2310 ns 303036 items_per_second=7.09287G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:256/ldc:8 4496 ns 4496 ns 155735 items_per_second=7.28775G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:512/ldc:8 9282 ns 9282 ns 75405 items_per_second=7.06079G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:1024/ldc:8 20103 ns 20102 ns 34971 items_per_second=6.52021G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:2048/ldc:8 39452 ns 39432 ns 17773 items_per_second=6.64797G/s
after reordering:
Run on (6 X 1416 MHz CPU s)
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:128/ldc:8 2749 ns 2749 ns 247725 items_per_second=5.96G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:256/ldc:8 5378 ns 5378 ns 130119 items_per_second=6.09242G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:512/ldc:8 11226 ns 11226 ns 62352 items_per_second=5.838G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:1024/ldc:8 23292 ns 23292 ns 30037 items_per_second=5.62729G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:2048/ldc:8 45528 ns 45527 ns 15373 items_per_second=5.75803G/s

Since there is a difference between a53 and a55, fmov cannot dual-i
ssue with fmla on a53, reordering may reduce performance.

@wjc404
Copy link
Contributor

wjc404 commented May 18, 2020

Thank you very much. I've learned a lot from your answer.

@wjc404
Copy link
Contributor

wjc404 commented May 18, 2020

How about storing elements from matrix B in 4 64-bit neon registers per iteration ? I think this can eliminate half of the fmov instructions.

@craft-zhang
Copy link
Contributor Author

Sorry about my English, hope i can Explain clearly.
Firstly, a53's fmla pipeline is after 4 cycles issues add, In other words, fmla cannot be dual-issued with any other simd instruction. Once a simd instruction is inserted, it will wait for 4cycle to be issued. In order to amortize, we need the mkernel to be the maximum calculation access memory ratio, that is 8x8.
Secondly, fmov can be dual-issued with ld1 and fmla can be dual-issued with ldr gr. So laterly we only need to ensure that ld1 and fmov are fully dual-issued.

@wjc404
Copy link
Contributor

wjc404 commented May 18, 2020

Sorry for my stupid suggestions. So on A53 there're no instructions operating on neon registers (read or write) dual-issuable with 128-bit fmla, is that correct? (If this is true there's no way to achieve maximum flops theoretically in GEMM, pity for that...)

@craft-zhang
Copy link
Contributor Author

craft-zhang commented May 18, 2020

This is something worth discussing. According to some of my experiments and inferences, it should be like this. But i can't guarantee : )

@craft-zhang craft-zhang marked this pull request as draft May 19, 2020 06:39
@craft-zhang craft-zhang marked this pull request as ready for review May 20, 2020 14:39
@craft-zhang craft-zhang changed the title Cortex a53 sgemm optimization Improve performance of SGEMM and STRMM on Arm Cortex-A53 May 20, 2020
@wjc404
Copy link
Contributor

wjc404 commented May 21, 2020

Just one more stupid question, how many cycles are required for a post-indexed load instruction to update address register on A53 (from issue to being available for subsequent load) ?

@craft-zhang
Copy link
Contributor Author

Just one more stupid question, how many cycles are required for a post-indexed load instruction to update address register on A53 (from issue to being available for subsequent load) ?

Sorry, i haven't measured the latency, maybe one or two cycles

@martin-frbg martin-frbg merged commit f1a18d2 into OpenMathLib:develop May 25, 2020
@craft-zhang craft-zhang deleted the cortex-A53 branch June 4, 2020 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants