Skip to content

Conversation

yuanjia111
Copy link
Contributor

1. The following changes were made:
(1) Adjust LMUL=2 to LMUL=8;
(2) The optimization is for the scenario of inc_x=1, mainly to increase the parallel processing of n-directional data;
(3) Adjusted code format.

2. All BLAS tests passed:
image
image
image

3.The performance verified on K1 [C908, vlen = 256].
(1)Using the built-in benchmark for testing, the optimized performance data is as follows:
image
image
(2)The complete performance comparison data before and after optimization is as follows:
image
image

@yuanjia111 yuanjia111 changed the title Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B targets Optimize the gemv_t_vector.c kernel for RISCV64_ZVL256B target Aug 22, 2025
@martin-frbg martin-frbg added this to the 0.3.31 milestone Aug 24, 2025
@ChipKerchner
Copy link
Contributor

Great job!

@martin-frbg martin-frbg merged commit da7d0f4 into OpenMathLib:develop Aug 25, 2025
91 of 95 checks passed
@ChipKerchner
Copy link
Contributor

I did some performance testing outside of the OpenBLAS benchmarks and I'm see a 4X slowdown for this MR.

Before:
./testOpenBLAS 1 1 1 1023 1023 1023 100
Testing LLVM FP32 GEMV 1 1 1 1023 1023 1023 100  2.0  4.0  1

Total time =        112235786

After:
]./testOpenBLAS 1 1 1 1023 1023 1023 100
Testing LLVM FP32 GEMV 1 1 1 1023 1023 1023 100  2.0  4.0  1

Total time =        443030260

vr1 = VFMULVV_FLOAT(va1, vx, gvl);
vr2 = VFMULVV_FLOAT(va2, vx, gvl);
vr3 = VFMULVV_FLOAT(va3, vx, gvl);
// Floating-point addition does not satisfy the associative law, that is, (a + b) + c ≠ a + (b + c),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried ordered (vs unordered) reduction sums?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried redosum. Like redusum, it has precision issues and failed the correctness verification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to ask if you enabled -O3 optimization when compiling? Or could you provide the test code and test process so that I can reproduce the problem?Negative optimization shouldn't happen, the performance degradation is too severe.

Copy link
Contributor

@sergei-lewis sergei-lewis Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when building for rvv 1.0 targets, the inner loop in this kernel is unrolled four times, and vector temporaries are used to hold the accumulator and the input:

#ifndef RISCV_0p10_INTRINSICS
...
            vec0 = VFMVVF_FLOAT_M1(0.0, vlmax);
            vec1 = VFMVVF_FLOAT_M1(0.0, vlmax);
            vec2 = VFMVVF_FLOAT_M1(0.0, vlmax);
            vec3 = VFMVVF_FLOAT_M1(0.0, vlmax);
...
            for (k = 0; k < m / gvl; k++) {
                va0 = VLEV_FLOAT(a_ptrs[0] + j, gvl);
                va1 = VLEV_FLOAT(a_ptrs[1] + j, gvl);
                va2 = VLEV_FLOAT(a_ptrs[2] + j, gvl);
                va3 = VLEV_FLOAT(a_ptrs[3] + j, gvl);

This means 8 vector temporaries total. However, there are only 32 architectural vector registers, and when LMUL=8, each temporary uses 8 architectural vector registers; it is therefore impossible to avoid stack spills and catastrophic performance with LMUL=8

The code block that is /not/ using rvv 1.0 intrinsics is not unrolled in this way. Do you perhaps not have RISCV_0p10_INTRINSICS defined in your build?

Some of the other kernels use different lmul depending on build parameters. Perhaps that is the way forward here? - lmul 2 when the rvv 1.0 path is in use, lmul 8 when it is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image This is my compilation log. You can see that rvv 1.0 is used. However, I found that -O2 is enabled by default when compiling openblas. I will compile with -O3 and add a performance comparison test.

Copy link
Contributor

@ChipKerchner ChipKerchner Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanjia111 What tests are failing if you remove the vfredusum_vs out of the inner loop? gemv_t_rvv.c doesn't have it in the loop and is used by some platforms.

I had a similar problem for PowerPC with GEMV (for BF16). It was where the beta value was applied that caused the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner
(1)my gcc version is: gcc version 14.2.1 20240801 (openEuler 14.2.1-6.oe2403sp1)
image
(2)If remove the vfredusum_vs out of the inner loop, utest will report an error when compiling, as follows:
image
It turns out that the checksum here (accurate to 12 decimal places) fails:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergei-lewis I added print to the code and found that the code is based on the rvv1.0 branch, not RISCV_0p10_INTRINSICS. I will take some time today to analyze the performance test and the inconsistency between yours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner, thank you, i will test when removing the manual unrolling in gemv_t_vector.c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner ,I retested and found that removing the manual four-way loop unrolling in gemv_t_vector.c and simply changing LMUL to 8 gave the best performance. (Comparative test scenarios: manual four-way loop unrolling with LMUL=4; manual four-way loop unrolling with LMUL=8; no manual loop unrolling with LMUL=2)
@ChipKerchner@sergei-lewis@martin-frbg ,thank you very much for your suggestions and discussions !

vec0 = VFREDSUM_FLOAT(vr0, vec0, gvl);
vec1 = VFREDSUM_FLOAT(vr1, vec1, gvl);
vec2 = VFREDSUM_FLOAT(vr2, vec2, gvl);
vec3 = VFREDSUM_FLOAT(vr3, vec3, gvl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the issue is register spillage of the vectors. It looks like you have more than 32 in play here. You shouldn't use more than m4 (instead m8).

Copy link
Contributor

@ChipKerchner ChipKerchner Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the inner loop with m4

.LBB1_7:                                #   Parent Loop BB1_4 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vsetvli zero, a0, e32, m4, ta, ma
        vle32.v v16, (s0)
        add     s1, s0, t5
        vle32.v v20, (s1)
        add     s1, s0, t6
        vle32.v v24, (s1)
        add     s1, s0, s6
        vle32.v v28, (a2)
        add     s8, s8, t4
        addi    s3, s3, 1
        add     s0, s0, s7
        vle32.v v4, (s1)
        vfmul.vv        v16, v16, v28
        vfmul.vv        v20, v20, v28
        vfmul.vv        v24, v24, v28
        vfmul.vv        v28, v4, v28
        vfredusum.vs    v12, v16, v12
        vfredusum.vs    v11, v20, v11
        vfredusum.vs    v10, v24, v10
        vfredusum.vs    v9, v28, v9
        add     a2, a2, s7
        bltu    s3, t2, .LBB1_7

With m8 (see how bad it is)

.LBB1_8:                                #   Parent Loop BB1_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        csrr    s0, vlenb
        li      s4, 24
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v18, (s0)                       # Unknown-size Folded Spill
        csrr    s0, vlenb
        li      s4, 27
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v17, (s0)                       # Unknown-size Folded Spill
        csrr    s0, vlenb
        li      s4, 25
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v24, (s0)                       # Unknown-size Folded Spill
        csrr    s0, vlenb
        li      s4, 26
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v9, (s0)                        # Unknown-size Folded Spill
        vsetvli zero, a0, e32, m8, ta, ma
        vle32.v v16, (s1)
        add     s0, s1, t5
        vle32.v v24, (s0)
        add     s0, s1, t6
        vle32.v v0, (a2)
        vle32.v v8, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v8, (s0)                        # Unknown-size Folded Spill
        add     s0, s1, s6
        vle32.v v8, (s0)
        csrr    s0, vlenb
        slli    s0, s0, 3
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v8, (s0)                        # Unknown-size Folded Spill
        vfmul.vv        v16, v16, v0
        vfmul.vv        v24, v24, v0
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)                        # Unknown-size Folded Reload
        vfmul.vv        v8, v8, v0
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs8r.v  v8, (s0)                        # Unknown-size Folded Spill
        csrr    s0, vlenb
        slli    s0, s0, 3
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v8, (s0)                        # Unknown-size Folded Reload
        vfmul.vv        v8, v8, v0
        csrr    s0, vlenb
        li      s4, 27
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v7, (s0)                        # Unknown-size Folded Reload
        vfredusum.vs    v7, v16, v7
        csrr    s0, vlenb
        li      s4, 24
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v18, (s0)                       # Unknown-size Folded Reload
        csrr    s0, vlenb
        li      s4, 27
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v7, (s0)                        # Unknown-size Folded Spill
        csrr    s0, vlenb
        li      s4, 27
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v17, (s0)                       # Unknown-size Folded Reload
        vfredusum.vs    v18, v24, v18
        csrr    s0, vlenb
        li      s4, 25
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v24, (s0)                       # Unknown-size Folded Reload
        csrr    s0, vlenb
        slli    s0, s0, 4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl8r.v  v0, (s0)                        # Unknown-size Folded Reload
        vfredusum.vs    v24, v0, v24
        csrr    s0, vlenb
        li      s4, 26
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v16, (s0)                       # Unknown-size Folded Reload
        vfredusum.vs    v16, v8, v16
        csrr    s0, vlenb
        li      s4, 26
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vs1r.v  v16, (s0)                       # Unknown-size Folded Spill
        csrr    s0, vlenb
        li      s4, 26
        mul     s0, s0, s4
        add     s0, s0, sp
        addi    s0, s0, 16
        vl1r.v  v9, (s0)                        # Unknown-size Folded Reload
        add     s8, s8, t4
        addi    s3, s3, 1
        add     s1, s1, s7
        add     a2, a2, s7
        bltu    s3, t2, .LBB1_8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, even with m4, the code is slower than before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISTR that the benchmarks do not use non-trivial values of alpha and beta by default, which may be part of the difference (if I interpret parameters to your private testOpenBLAS correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered the register reuse issue when LMUL=8, so I initially used m4 instead of m8. However, when I did a comparative test on the K1 [C908, vlen = 256], I found that m8 performed better. The comparative test data is as follows:
image
image

@ChipKerchner
Copy link
Contributor

I'm also seeing register spillage in gemv_n_vector.c due to the manual 4-way unrolling and the use of LMUL = 8. This should be reworked as well.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Aug 27, 2025

The best performance I've found so far is removing the manual unrolling in gemv_t_vector.c. gemm_t_rvv.c has better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants