Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target #5427

yuanjia111 · 2025-08-22T09:06:38Z

1. The following changes were made：
(1) Adjust LMUL=2 to LMUL=8；
(2) The optimization is for the scenario of inc_x=1, mainly to increase the parallel processing of n-directional data；
(3) Adjusted code format.

2. All BLAS tests passed:

3.The performance verified on K1 [C908, vlen = 256].
（1）Using the built-in benchmark for testing, the optimized performance data is as follows：

（2）The complete performance comparison data before and after optimization is as follows：

ChipKerchner · 2025-08-25T13:09:12Z

Great job!

ChipKerchner · 2025-08-26T17:05:45Z

I did some performance testing outside of the OpenBLAS benchmarks and I'm see a 4X slowdown for this MR.

Before:
./testOpenBLAS 1 1 1 1023 1023 1023 100
Testing LLVM FP32 GEMV 1 1 1 1023 1023 1023 100  2.0  4.0  1

Total time =        112235786

After:
]./testOpenBLAS 1 1 1 1023 1023 1023 100
Testing LLVM FP32 GEMV 1 1 1 1023 1023 1023 100  2.0  4.0  1

Total time =        443030260

ChipKerchner · 2025-08-26T17:32:44Z

kernel/riscv64/gemv_t_vector.c

+                vr1 = VFMULVV_FLOAT(va1, vx, gvl);
+                vr2 = VFMULVV_FLOAT(va2, vx, gvl);
+                vr3 = VFMULVV_FLOAT(va3, vx, gvl);
+                // Floating-point addition does not satisfy the associative law, that is, (a + b) + c ≠ a + (b + c),


Have you tried ordered (vs unordered) reduction sums?

Yes, I tried redosum. Like redusum, it has precision issues and failed the correctness verification.

I'd like to ask if you enabled -O3 optimization when compiling? Or could you provide the test code and test process so that I can reproduce the problem?Negative optimization shouldn't happen, the performance degradation is too severe.

Note that when building for rvv 1.0 targets, the inner loop in this kernel is unrolled four times, and vector temporaries are used to hold the accumulator and the input:

#ifndef RISCV_0p10_INTRINSICS ... vec0 = VFMVVF_FLOAT_M1(0.0, vlmax); vec1 = VFMVVF_FLOAT_M1(0.0, vlmax); vec2 = VFMVVF_FLOAT_M1(0.0, vlmax); vec3 = VFMVVF_FLOAT_M1(0.0, vlmax); ... for (k = 0; k < m / gvl; k++) { va0 = VLEV_FLOAT(a_ptrs[0] + j, gvl); va1 = VLEV_FLOAT(a_ptrs[1] + j, gvl); va2 = VLEV_FLOAT(a_ptrs[2] + j, gvl); va3 = VLEV_FLOAT(a_ptrs[3] + j, gvl);

This means 8 vector temporaries total. However, there are only 32 architectural vector registers, and when LMUL=8, each temporary uses 8 architectural vector registers; it is therefore impossible to avoid stack spills and catastrophic performance with LMUL=8

The code block that is /not/ using rvv 1.0 intrinsics is not unrolled in this way. Do you perhaps not have RISCV_0p10_INTRINSICS defined in your build?

Some of the other kernels use different lmul depending on build parameters. Perhaps that is the way forward here? - lmul 2 when the rvv 1.0 path is in use, lmul 8 when it is not?

This is my compilation log. You can see that rvv 1.0 is used. However, I found that -O2 is enabled by default when compiling openblas. I will compile with -O3 and add a performance comparison test.

@yuanjia111 What tests are failing if you remove the vfredusum_vs out of the inner loop? gemv_t_rvv.c doesn't have it in the loop and is used by some platforms.

I had a similar problem for PowerPC with GEMV (for BF16). It was where the beta value was applied that caused the issue.

@ChipKerchner
（1）my gcc version is： gcc version 14.2.1 20240801 (openEuler 14.2.1-6.oe2403sp1)

（2）If remove the vfredusum_vs out of the inner loop, utest will report an error when compiling, as follows:

It turns out that the checksum here (accurate to 12 decimal places) fails:

@sergei-lewis I added print to the code and found that the code is based on the rvv1.0 branch, not RISCV_0p10_INTRINSICS. I will take some time today to analyze the performance test and the inconsistency between yours.

@ChipKerchner, thank you, i will test when removing the manual unrolling in gemv_t_vector.c.

@ChipKerchner ,I retested and found that removing the manual four-way loop unrolling in gemv_t_vector.c and simply changing LMUL to 8 gave the best performance. (Comparative test scenarios: manual four-way loop unrolling with LMUL=4; manual four-way loop unrolling with LMUL=8; no manual loop unrolling with LMUL=2)
@ChipKerchner ，@sergei-lewis ， @martin-frbg ，thank you very much for your suggestions and discussions !

ChipKerchner · 2025-08-26T19:56:29Z

kernel/riscv64/gemv_t_vector.c

+                vec0 = VFREDSUM_FLOAT(vr0, vec0, gvl);
+                vec1 = VFREDSUM_FLOAT(vr1, vec1, gvl);
+                vec2 = VFREDSUM_FLOAT(vr2, vec2, gvl);
+                vec3 = VFREDSUM_FLOAT(vr3, vec3, gvl);


Part of the issue is register spillage of the vectors. It looks like you have more than 32 in play here. You shouldn't use more than m4 (instead m8).

This is the inner loop with m4

.LBB1_7: # Parent Loop BB1_4 Depth=1 # => This Inner Loop Header: Depth=2 vsetvli zero, a0, e32, m4, ta, ma vle32.v v16, (s0) add s1, s0, t5 vle32.v v20, (s1) add s1, s0, t6 vle32.v v24, (s1) add s1, s0, s6 vle32.v v28, (a2) add s8, s8, t4 addi s3, s3, 1 add s0, s0, s7 vle32.v v4, (s1) vfmul.vv v16, v16, v28 vfmul.vv v20, v20, v28 vfmul.vv v24, v24, v28 vfmul.vv v28, v4, v28 vfredusum.vs v12, v16, v12 vfredusum.vs v11, v20, v11 vfredusum.vs v10, v24, v10 vfredusum.vs v9, v28, v9 add a2, a2, s7 bltu s3, t2, .LBB1_7

With m8 (see how bad it is)

.LBB1_8: # Parent Loop BB1_5 Depth=1 # => This Inner Loop Header: Depth=2 csrr s0, vlenb li s4, 24 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v18, (s0) # Unknown-size Folded Spill csrr s0, vlenb li s4, 27 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v17, (s0) # Unknown-size Folded Spill csrr s0, vlenb li s4, 25 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v24, (s0) # Unknown-size Folded Spill csrr s0, vlenb li s4, 26 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v9, (s0) # Unknown-size Folded Spill vsetvli zero, a0, e32, m8, ta, ma vle32.v v16, (s1) add s0, s1, t5 vle32.v v24, (s0) add s0, s1, t6 vle32.v v0, (a2) vle32.v v8, (s0) csrr s0, vlenb slli s0, s0, 4 add s0, s0, sp addi s0, s0, 16 vs8r.v v8, (s0) # Unknown-size Folded Spill add s0, s1, s6 vle32.v v8, (s0) csrr s0, vlenb slli s0, s0, 3 add s0, s0, sp addi s0, s0, 16 vs8r.v v8, (s0) # Unknown-size Folded Spill vfmul.vv v16, v16, v0 vfmul.vv v24, v24, v0 csrr s0, vlenb slli s0, s0, 4 add s0, s0, sp addi s0, s0, 16 vl8r.v v8, (s0) # Unknown-size Folded Reload vfmul.vv v8, v8, v0 csrr s0, vlenb slli s0, s0, 4 add s0, s0, sp addi s0, s0, 16 vs8r.v v8, (s0) # Unknown-size Folded Spill csrr s0, vlenb slli s0, s0, 3 add s0, s0, sp addi s0, s0, 16 vl8r.v v8, (s0) # Unknown-size Folded Reload vfmul.vv v8, v8, v0 csrr s0, vlenb li s4, 27 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v7, (s0) # Unknown-size Folded Reload vfredusum.vs v7, v16, v7 csrr s0, vlenb li s4, 24 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v18, (s0) # Unknown-size Folded Reload csrr s0, vlenb li s4, 27 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v7, (s0) # Unknown-size Folded Spill csrr s0, vlenb li s4, 27 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v17, (s0) # Unknown-size Folded Reload vfredusum.vs v18, v24, v18 csrr s0, vlenb li s4, 25 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v24, (s0) # Unknown-size Folded Reload csrr s0, vlenb slli s0, s0, 4 add s0, s0, sp addi s0, s0, 16 vl8r.v v0, (s0) # Unknown-size Folded Reload vfredusum.vs v24, v0, v24 csrr s0, vlenb li s4, 26 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v16, (s0) # Unknown-size Folded Reload vfredusum.vs v16, v8, v16 csrr s0, vlenb li s4, 26 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vs1r.v v16, (s0) # Unknown-size Folded Spill csrr s0, vlenb li s4, 26 mul s0, s0, s4 add s0, s0, sp addi s0, s0, 16 vl1r.v v9, (s0) # Unknown-size Folded Reload add s8, s8, t4 addi s3, s3, 1 add s1, s1, s7 add a2, a2, s7 bltu s3, t2, .LBB1_8

BTW, even with m4, the code is slower than before.

ISTR that the benchmarks do not use non-trivial values of alpha and beta by default, which may be part of the difference (if I interpret parameters to your private testOpenBLAS correctly)

I considered the register reuse issue when LMUL=8, so I initially used m4 instead of m8. However, when I did a comparative test on the K1 [C908, vlen = 256], I found that m8 performed better. The comparative test data is as follows:

ChipKerchner · 2025-08-27T13:55:23Z

I'm also seeing register spillage in gemv_n_vector.c due to the manual 4-way unrolling and the use of LMUL = 8. This should be reworked as well.

ChipKerchner · 2025-08-27T17:25:36Z

The best performance I've found so far is removing the manual unrolling in gemv_t_vector.c. gemm_t_rvv.c has better performance.

riscv64: optimize gemv_t_vector.c

c2cc7a3

yuanjia111 changed the title ~~Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B targets~~ Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target Aug 22, 2025

martin-frbg added this to the 0.3.31 milestone Aug 24, 2025

martin-frbg merged commit da7d0f4 into OpenMathLib:develop Aug 25, 2025
91 of 95 checks passed

ChipKerchner reviewed Aug 26, 2025

View reviewed changes

ChipKerchner mentioned this pull request Aug 27, 2025

RISC-V - Register spillage in GEMV due to manual 4-way unrolling and use of LMUL > 2 #5430

Open

Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target #5427

Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target #5427

Conversation

yuanjia111 commented Aug 22, 2025

Uh oh!

ChipKerchner commented Aug 25, 2025

Uh oh!

Uh oh!

ChipKerchner commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergei-lewis Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChipKerchner Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChipKerchner Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChipKerchner commented Aug 27, 2025

Uh oh!

ChipKerchner commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sergei-lewis Aug 27, 2025 •

edited

Loading

ChipKerchner Aug 27, 2025 •

edited

Loading

ChipKerchner Aug 26, 2025 •

edited

Loading

ChipKerchner commented Aug 27, 2025 •

edited

Loading