-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target #5427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great job! |
I did some performance testing outside of the OpenBLAS benchmarks and I'm see a 4X slowdown for this MR.
|
vr1 = VFMULVV_FLOAT(va1, vx, gvl); | ||
vr2 = VFMULVV_FLOAT(va2, vx, gvl); | ||
vr3 = VFMULVV_FLOAT(va3, vx, gvl); | ||
// Floating-point addition does not satisfy the associative law, that is, (a + b) + c ≠ a + (b + c), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried ordered (vs unordered) reduction sums?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried redosum. Like redusum, it has precision issues and failed the correctness verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to ask if you enabled -O3 optimization when compiling? Or could you provide the test code and test process so that I can reproduce the problem?Negative optimization shouldn't happen, the performance degradation is too severe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when building for rvv 1.0 targets, the inner loop in this kernel is unrolled four times, and vector temporaries are used to hold the accumulator and the input:
#ifndef RISCV_0p10_INTRINSICS
...
vec0 = VFMVVF_FLOAT_M1(0.0, vlmax);
vec1 = VFMVVF_FLOAT_M1(0.0, vlmax);
vec2 = VFMVVF_FLOAT_M1(0.0, vlmax);
vec3 = VFMVVF_FLOAT_M1(0.0, vlmax);
...
for (k = 0; k < m / gvl; k++) {
va0 = VLEV_FLOAT(a_ptrs[0] + j, gvl);
va1 = VLEV_FLOAT(a_ptrs[1] + j, gvl);
va2 = VLEV_FLOAT(a_ptrs[2] + j, gvl);
va3 = VLEV_FLOAT(a_ptrs[3] + j, gvl);
This means 8 vector temporaries total. However, there are only 32 architectural vector registers, and when LMUL=8, each temporary uses 8 architectural vector registers; it is therefore impossible to avoid stack spills and catastrophic performance with LMUL=8
The code block that is /not/ using rvv 1.0 intrinsics is not unrolled in this way. Do you perhaps not have RISCV_0p10_INTRINSICS
defined in your build?
Some of the other kernels use different lmul depending on build parameters. Perhaps that is the way forward here? - lmul 2 when the rvv 1.0 path is in use, lmul 8 when it is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuanjia111 What tests are failing if you remove the vfredusum_vs out of the inner loop? gemv_t_rvv.c doesn't have it in the loop and is used by some platforms.
I had a similar problem for PowerPC with GEMV (for BF16). It was where the beta value was applied that caused the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChipKerchner
(1)my gcc version is: gcc version 14.2.1 20240801 (openEuler 14.2.1-6.oe2403sp1)
(2)If remove the vfredusum_vs out of the inner loop, utest will report an error when compiling, as follows:
It turns out that the checksum here (accurate to 12 decimal places) fails:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sergei-lewis I added print to the code and found that the code is based on the rvv1.0 branch, not RISCV_0p10_INTRINSICS. I will take some time today to analyze the performance test and the inconsistency between yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChipKerchner, thank you, i will test when removing the manual unrolling in gemv_t_vector.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChipKerchner ,I retested and found that removing the manual four-way loop unrolling in gemv_t_vector.c and simply changing LMUL to 8 gave the best performance. (Comparative test scenarios: manual four-way loop unrolling with LMUL=4; manual four-way loop unrolling with LMUL=8; no manual loop unrolling with LMUL=2)
@ChipKerchner ,@sergei-lewis , @martin-frbg ,thank you very much for your suggestions and discussions !
vec0 = VFREDSUM_FLOAT(vr0, vec0, gvl); | ||
vec1 = VFREDSUM_FLOAT(vr1, vec1, gvl); | ||
vec2 = VFREDSUM_FLOAT(vr2, vec2, gvl); | ||
vec3 = VFREDSUM_FLOAT(vr3, vec3, gvl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the issue is register spillage of the vectors. It looks like you have more than 32 in play here. You shouldn't use more than m4 (instead m8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the inner loop with m4
.LBB1_7: # Parent Loop BB1_4 Depth=1
# => This Inner Loop Header: Depth=2
vsetvli zero, a0, e32, m4, ta, ma
vle32.v v16, (s0)
add s1, s0, t5
vle32.v v20, (s1)
add s1, s0, t6
vle32.v v24, (s1)
add s1, s0, s6
vle32.v v28, (a2)
add s8, s8, t4
addi s3, s3, 1
add s0, s0, s7
vle32.v v4, (s1)
vfmul.vv v16, v16, v28
vfmul.vv v20, v20, v28
vfmul.vv v24, v24, v28
vfmul.vv v28, v4, v28
vfredusum.vs v12, v16, v12
vfredusum.vs v11, v20, v11
vfredusum.vs v10, v24, v10
vfredusum.vs v9, v28, v9
add a2, a2, s7
bltu s3, t2, .LBB1_7
With m8 (see how bad it is)
.LBB1_8: # Parent Loop BB1_5 Depth=1
# => This Inner Loop Header: Depth=2
csrr s0, vlenb
li s4, 24
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v18, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
li s4, 27
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v17, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
li s4, 25
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v24, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
li s4, 26
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v9, (s0) # Unknown-size Folded Spill
vsetvli zero, a0, e32, m8, ta, ma
vle32.v v16, (s1)
add s0, s1, t5
vle32.v v24, (s0)
add s0, s1, t6
vle32.v v0, (a2)
vle32.v v8, (s0)
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vs8r.v v8, (s0) # Unknown-size Folded Spill
add s0, s1, s6
vle32.v v8, (s0)
csrr s0, vlenb
slli s0, s0, 3
add s0, s0, sp
addi s0, s0, 16
vs8r.v v8, (s0) # Unknown-size Folded Spill
vfmul.vv v16, v16, v0
vfmul.vv v24, v24, v0
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vl8r.v v8, (s0) # Unknown-size Folded Reload
vfmul.vv v8, v8, v0
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vs8r.v v8, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
slli s0, s0, 3
add s0, s0, sp
addi s0, s0, 16
vl8r.v v8, (s0) # Unknown-size Folded Reload
vfmul.vv v8, v8, v0
csrr s0, vlenb
li s4, 27
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v7, (s0) # Unknown-size Folded Reload
vfredusum.vs v7, v16, v7
csrr s0, vlenb
li s4, 24
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v18, (s0) # Unknown-size Folded Reload
csrr s0, vlenb
li s4, 27
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v7, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
li s4, 27
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v17, (s0) # Unknown-size Folded Reload
vfredusum.vs v18, v24, v18
csrr s0, vlenb
li s4, 25
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v24, (s0) # Unknown-size Folded Reload
csrr s0, vlenb
slli s0, s0, 4
add s0, s0, sp
addi s0, s0, 16
vl8r.v v0, (s0) # Unknown-size Folded Reload
vfredusum.vs v24, v0, v24
csrr s0, vlenb
li s4, 26
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v16, (s0) # Unknown-size Folded Reload
vfredusum.vs v16, v8, v16
csrr s0, vlenb
li s4, 26
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vs1r.v v16, (s0) # Unknown-size Folded Spill
csrr s0, vlenb
li s4, 26
mul s0, s0, s4
add s0, s0, sp
addi s0, s0, 16
vl1r.v v9, (s0) # Unknown-size Folded Reload
add s8, s8, t4
addi s3, s3, 1
add s1, s1, s7
add a2, a2, s7
bltu s3, t2, .LBB1_8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, even with m4, the code is slower than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISTR that the benchmarks do not use non-trivial values of alpha and beta by default, which may be part of the difference (if I interpret parameters to your private testOpenBLAS correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also seeing register spillage in gemv_n_vector.c due to the manual 4-way unrolling and the use of LMUL = 8. This should be reworked as well. |
The best performance I've found so far is removing the manual unrolling in gemv_t_vector.c. gemm_t_rvv.c has better performance. |
1. The following changes were made:
(1) Adjust LMUL=2 to LMUL=8;
(2) The optimization is for the scenario of inc_x=1, mainly to increase the parallel processing of n-directional data;
(3) Adjusted code format.
2. All BLAS tests passed:



3.The performance verified on K1 [C908, vlen = 256].




(1)Using the built-in benchmark for testing, the optimized performance data is as follows:
(2)The complete performance comparison data before and after optimization is as follows: