Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

craft-zhang · 2020-05-18T06:46:11Z

optimization for sgemm on cortex-a53，single thread benchmark result on rk3399 Little cluster

before：
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.538e-07,0.004408 GFLOPS
2,2,2,4.962e-07,0.03224 GFLOPS
4,4,4,6.011e-07,0.213 GFLOPS
8,8,8,1.06e-06,0.9657 GFLOPS
11,11,11,2.632e-06,1.012 GFLOPS
16,16,16,3.727e-06,2.198 GFLOPS
19,19,19,7.425e-06,1.848 GFLOPS
23,23,23,1.19e-05,2.045 GFLOPS
27,27,27,1.666e-05,2.363 GFLOPS
32,32,32,1.841e-05,3.56 GFLOPS
38,38,38,3.598e-05,3.05 GFLOPS
45,45,45,5.855e-05,3.113 GFLOPS
54,54,54,9.529e-05,3.305 GFLOPS
64,64,64,0.0001362,3.848 GFLOPS
76,76,76,0.000234,3.752 GFLOPS
91,91,91,0.0004197,3.591 GFLOPS
108,108,108,0.0006323,3.984 GFLOPS
128,128,128,0.001005,4.172 GFLOPS
152,152,152,0.001727,4.067 GFLOPS
181,181,181,0.003109,3.815 GFLOPS
215,215,215,0.005348,3.717 GFLOPS
256,256,256,0.008465,3.964 GFLOPS
304,304,304,0.01417,3.967 GFLOPS
362,362,362,0.02484,3.819 GFLOPS
431,431,431,0.04246,3.771 GFLOPS
512,512,512,0.06607,4.063 GFLOPS
724,724,724,0.1867,4.066 GFLOPS
1024,1024,1024,0.5178,4.148 GFLOPS
1448,1448,1448,1.466,4.141 GFLOPS
2048,2048,2048,4.051,4.241 GFLOPS

after：
depth,rows,cols,latency(s),GFLOPS/s
1,1,1,4.094e-07,0.004885 GFLOPS
2,2,2,4.542e-07,0.03523 GFLOPS
4,4,4,5.48e-07,0.2336 GFLOPS
8,8,8,1.003e-06,1.021 GFLOPS
11,11,11,2.491e-06,1.068 GFLOPS
16,16,16,3.186e-06,2.571 GFLOPS
19,19,19,6.784e-06,2.022 GFLOPS
23,23,23,1.116e-05,2.18 GFLOPS
27,27,27,1.443e-05,2.728 GFLOPS
32,32,32,1.394e-05,4.7 GFLOPS
38,38,38,2.94e-05,3.733 GFLOPS
45,45,45,4.78e-05,3.813 GFLOPS
54,54,54,7.736e-05,4.071 GFLOPS
64,64,64,9.676e-05,5.419 GFLOPS
76,76,76,0.0001761,4.985 GFLOPS
91,91,91,0.0003204,4.703 GFLOPS
108,108,108,0.0004628,5.444 GFLOPS
128,128,128,0.0007058,5.943 GFLOPS
152,152,152,0.001178,5.963 GFLOPS
181,181,181,0.002462,4.817 GFLOPS
215,215,215,0.004601,4.32 GFLOPS
256,256,256,0.007187,4.669 GFLOPS
304,304,304,0.01287,4.366 GFLOPS
362,362,362,0.02115,4.487 GFLOPS
431,431,431,0.03498,4.578 GFLOPS
512,512,512,0.05367,5.002 GFLOPS
724,724,724,0.1465,5.18 GFLOPS
1024,1024,1024,0.4089,5.252 GFLOPS
1448,1448,1448,1.115,5.445 GFLOPS
2048,2048,2048,3.206,5.359 GFLOPS

xianyi · 2020-05-18T07:13:01Z

Thank you for the patch.

Does your optimized sgemm_kernel_8x8 have better performance ( compared to old sgemm_kernel_8x8 implementation )for other ARMv8 CPU (e.g. cortex-a73).

craft-zhang · 2020-05-18T09:04:10Z

Thank you for the patch.

Does your optimized sgemm_kernel_8x8 have better performance ( compared to old sgemm_kernel_8x8 implementation )for other ARMv8 CPU (e.g. cortex-a73).

This optimization is specially customized according to cortexa53's pipe, so i have changed it to only work for cortexa53.

wjc404 · 2020-05-18T13:26:04Z

Interesting design. Once I noticed that the instruction "fmov vA.d[1], xB" (0<= A, B <32) does not interfere with the execution of 128-bit fmla on cortex-A55, but significantly slows down that on cortex A76. Does OpenBLAS have a mechanism of dispatching different kernels to different cores in parallel calculation?

craft-zhang · 2020-05-18T14:19:16Z

Interesting design. Once I noticed that the instruction "fmov vA.d[1], xB" (0<= A, B <32) does not interfere with the execution of 128-bit fmla on cortex-A55, but significantly slows down that on cortex A76. Does OpenBLAS have a mechanism of dispatching different kernels to different cores in parallel calculation?

If there is a way to calculate the threaded level3 blas that Big cores and little cores in parallelling, you may need a static load balancing strategy, which will make the code design very complicated

martin-frbg · 2020-05-18T14:43:39Z

There is currenly no mechanism in OpenBLAS to select different kinds of kernels onto different cores in a mixed configuration at runtime, and no active handling of big.little configurations in general. (Basically whichever core type gets scheduled first by the operating system "wins" in the autodetection). Adding this would indeed be quite complicated, and does not look particularly attractive given that only the ARMV8 architecture would currently make use of it.

wjc404 · 2020-05-18T16:24:46Z

Thanks for your answers.
I suspect that there's still room for improvement by reordering of some instructions:

.macro KERNEL8x8_I
ld1 {v0.4s, v1.4s}, [pA], #32
ld1 {v4.4s, v5.4s}, [pB], #32
ldr d2, [pA], #8
ldr d6, [pB], #8
ldr d3, [pA, #8]
ldr d7, [pB, #8]
ldr x20, [pA], #16
fmul v16.4s, v0.4s, v4.s[0]
ldr x24, [pB], #16
fmul v17.4s, v1.4s, v4.s[0]
ldr x21, [pA], #8
fmul v18.4s, v0.4s, v4.s[1]
ldr x25, [pB], #8
fmul v19.4s, v1.4s, v4.s[1]
fmul v20.4s, v0.4s, v4.s[2]
fmul v21.4s, v1.4s, v4.s[2]
fmul v22.4s, v0.4s, v4.s[3]
fmul v23.4s, v1.4s, v4.s[3]
fmul v24.4s, v0.4s, v5.s[0]
fmul v25.4s, v1.4s, v5.s[0]
fmov v2.d[1], x20
fmul v26.4s, v0.4s, v5.s[1]
fmov v3.d[1], x21
fmul v27.4s, v1.4s, v5.s[1]
fmov v6.d[1], x24
fmul v28.4s, v0.4s, v5.s[2]
fmov v7.d[1], x25
fmul v29.4s, v1.4s, v5.s[2]
fmul v30.4s, v0.4s, v5.s[3]
fmul v31.4s, v1.4s, v5.s[3]
.endm

.macro KERNEL8x8_M1
ldr d2, [pA], #32
fmla v16.4s, v0.4s, v4.s[0]
ldr d6, [pB], #32
fmla v17.4s, v1.4s, v4.s[0]
ldr x20, [pA, #-24]
fmla v18.4s, v0.4s, v4.s[1]
ldr d3, [pA, #-16]
fmla v19.4s, v1.4s, v4.s[1]
ldr x21, [pA, #-8]
fmla v20.4s, v0.4s, v4.s[2]
ldr x24, [pB, #-24]
fmla v21.4s, v1.4s, v4.s[2]
ldr d7, [pB, #-16]
fmla v22.4s, v0.4s, v4.s[3]
ldr x25, [pB, #-8]
fmla v23.4s, v1.4s, v4.s[3]
fmla v24.4s, v0.4s, v5.s[0]
fmov v2.d[1], x20
fmla v25.4s, v1.4s, v5.s[0]
fmla v26.4s, v0.4s, v5.s[1]
fmov v3.d[1], x21
fmla v27.4s, v1.4s, v5.s[1]
fmov v6.d[1], x24
fmla v28.4s, v0.4s, v5.s[2]
fmov v7.d[1], x25
fmla v29.4s, v1.4s, v5.s[2]
fmla v30.4s, v0.4s, v5.s[3]
fmla v31.4s, v1.4s, v5.s[3]
.endm

.macro KERNEL8x8_M2
ldr d0, [pA], #32
fmla v16.4s, v2.4s, v6.s[0]
ldr d4, [pB], #32
fmla v17.4s, v3.4s, v6.s[0]
ldr x18, [pA, #-24]
fmla v18.4s, v2.4s, v6.s[1]
ldr d1, [pA, #-16]
fmla v19.4s, v3.4s, v6.s[1]
ldr x19, [pA, #-8]
fmla v20.4s, v2.4s, v6.s[2]
ldr x22, [pB, #-24]
fmla v21.4s, v3.4s, v6.s[2]
ldr d5, [pB, #-16]
fmla v22.4s, v2.4s, v6.s[3]
ldr x23, [pB, #-8]
fmla v23.4s, v3.4s, v6.s[3]
fmla v24.4s, v2.4s, v7.s[0]
fmov v0.d[1], x18
fmla v25.4s, v3.4s, v7.s[0]
fmla v26.4s, v2.4s, v7.s[1]
fmov v1.d[1], x19
fmla v27.4s, v3.4s, v7.s[1]
fmov v4.d[1], x22
fmla v28.4s, v2.4s, v7.s[2]
fmov v5.d[1], x23
fmla v29.4s, v3.4s, v7.s[2]
fmla v30.4s, v2.4s, v7.s[3]
fmla v31.4s, v3.4s, v7.s[3]
.endm

.macro KERNEL8x8_E
fmla v16.4s, v2.4s, v6.s[0]
fmla v17.4s, v3.4s, v6.s[0]
fmla v18.4s, v2.4s, v6.s[1]
fmla v19.4s, v3.4s, v6.s[1]
fmla v20.4s, v2.4s, v6.s[2]
fmla v21.4s, v3.4s, v6.s[2]
fmla v22.4s, v2.4s, v6.s[3]
fmla v23.4s, v3.4s, v6.s[3]
fmla v24.4s, v2.4s, v7.s[0]
fmla v25.4s, v3.4s, v7.s[0]
fmla v26.4s, v2.4s, v7.s[1]
fmla v27.4s, v3.4s, v7.s[1]
fmla v28.4s, v2.4s, v7.s[2]
fmla v29.4s, v3.4s, v7.s[2]
fmla v30.4s, v2.4s, v7.s[3]
fmla v31.4s, v3.4s, v7.s[3]
.endm

craft-zhang · 2020-05-18T17:02:30Z

Thanks for your advice and this is my mkernel benchmark result
before reordering:
Run on (6 X 1416 MHz CPU s)
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:128/ldc:8 2310 ns 2310 ns 303036 items_per_second=7.09287G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:256/ldc:8 4496 ns 4496 ns 155735 items_per_second=7.28775G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:512/ldc:8 9282 ns 9282 ns 75405 items_per_second=7.06079G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:1024/ldc:8 20103 ns 20102 ns 34971 items_per_second=6.52021G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:2048/ldc:8 39452 ns 39432 ns 17773 items_per_second=6.64797G/s
after reordering:
Run on (6 X 1416 MHz CPU s)
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:128/ldc:8 2749 ns 2749 ns 247725 items_per_second=5.96G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:256/ldc:8 5378 ns 5378 ns 130119 items_per_second=6.09242G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:512/ldc:8 11226 ns 11226 ns 62352 items_per_second=5.838G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:1024/ldc:8 23292 ns 23292 ns 30037 items_per_second=5.62729G/s
sgemm_kernel_8x8/kernel_8x8/mr:8/nr:8/k:2048/ldc:8 45528 ns 45527 ns 15373 items_per_second=5.75803G/s

Since there is a difference between a53 and a55, fmov cannot dual-i
ssue with fmla on a53, reordering may reduce performance.

wjc404 · 2020-05-18T17:11:56Z

Thank you very much. I've learned a lot from your answer.

wjc404 · 2020-05-18T17:19:15Z

How about storing elements from matrix B in 4 64-bit neon registers per iteration ? I think this can eliminate half of the fmov instructions.

craft-zhang · 2020-05-18T17:34:58Z

Sorry about my English, hope i can Explain clearly.
Firstly, a53's fmla pipeline is after 4 cycles issues add, In other words, fmla cannot be dual-issued with any other simd instruction. Once a simd instruction is inserted, it will wait for 4cycle to be issued. In order to amortize, we need the mkernel to be the maximum calculation access memory ratio, that is 8x8.
Secondly, fmov can be dual-issued with ld1 and fmla can be dual-issued with ldr gr. So laterly we only need to ensure that ld1 and fmov are fully dual-issued.

wjc404 · 2020-05-18T17:53:08Z

Sorry for my stupid suggestions. So on A53 there're no instructions operating on neon registers (read or write) dual-issuable with 128-bit fmla, is that correct? (If this is true there's no way to achieve maximum flops theoretically in GEMM, pity for that...)

craft-zhang · 2020-05-18T17:57:31Z

This is something worth discussing. According to some of my experiments and inferences, it should be like this. But i can't guarantee : )

Signed-off-by: zhangdanfeng <[email protected]>

wjc404 · 2020-05-21T12:39:54Z

Just one more stupid question, how many cycles are required for a post-indexed load instruction to update address register on A53 (from issue to being available for subsequent load) ?

craft-zhang · 2020-05-21T12:51:56Z

Just one more stupid question, how many cycles are required for a post-indexed load instruction to update address register on A53 (from issue to being available for subsequent load) ?

Sorry, i haven't measured the latency, maybe one or two cycles

craft-zhang force-pushed the cortex-A53 branch from 717ac1e to 992da92 Compare May 18, 2020 08:53

craft-zhang marked this pull request as draft May 19, 2020 06:39

zhangdanfeng and others added 5 commits May 20, 2020 22:26

optimized for cortex-a53

d475db2

Signed-off-by: zhangdanfeng <[email protected]>

sgemm kernel use sgemm_kernel_8x8_cortexa53

0e6eb8c

Signed-off-by: zhangdanfeng <[email protected]>

align general register using to strmm_kernel_8x8

edb423d

use general register to speedup

a1fc604

update sgemm and strmm kernel selecting strategy

9df79ae

craft-zhang force-pushed the cortex-A53 branch from 9d4a011 to c4c9ab2 Compare May 20, 2020 14:27

张丹枫 added 2 commits May 20, 2020 22:34

split cortex-a53 param to match 8x8 kernel

ea5bdc3

update CONTRIBUTORS.md, adding myself

2a3aa91

craft-zhang force-pushed the cortex-A53 branch from c4c9ab2 to 2a3aa91 Compare May 20, 2020 14:35

craft-zhang marked this pull request as ready for review May 20, 2020 14:39

craft-zhang changed the title ~~Cortex a53 sgemm optimization~~ Improve performance of SGEMM and STRMM on Arm Cortex-A53 May 20, 2020

martin-frbg merged commit f1a18d2 into OpenMathLib:develop May 25, 2020

craft-zhang mentioned this pull request Jun 3, 2020

Improve performance of SGEMM on Arm Cortex-A53 #2643

Merged

craft-zhang deleted the cortex-A53 branch June 4, 2020 06:06

martin-frbg mentioned this pull request Jun 23, 2021

sgemm on arm A53/A55 #3279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

craft-zhang commented May 18, 2020

xianyi commented May 18, 2020

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020 •

edited

Loading

craft-zhang commented May 18, 2020

martin-frbg commented May 18, 2020

wjc404 commented May 18, 2020

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020

wjc404 commented May 18, 2020 •

edited

Loading

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020 •

edited

Loading

craft-zhang commented May 18, 2020 •

edited

Loading

wjc404 commented May 21, 2020

craft-zhang commented May 21, 2020

Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

Improve performance of SGEMM and STRMM on Arm Cortex-A53 #2618

Conversation

craft-zhang commented May 18, 2020

xianyi commented May 18, 2020

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020 • edited Loading

craft-zhang commented May 18, 2020

martin-frbg commented May 18, 2020

wjc404 commented May 18, 2020

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020

wjc404 commented May 18, 2020 • edited Loading

craft-zhang commented May 18, 2020

wjc404 commented May 18, 2020 • edited Loading

craft-zhang commented May 18, 2020 • edited Loading

wjc404 commented May 21, 2020

craft-zhang commented May 21, 2020

wjc404 commented May 18, 2020 •

edited

Loading

wjc404 commented May 18, 2020 •

edited

Loading

wjc404 commented May 18, 2020 •

edited

Loading

craft-zhang commented May 18, 2020 •

edited

Loading