Implementing SVE in `[SD]AXPY` Kernels for `A64FX` and `Graviton3E` #5426

hideaki-motoki · 2025-08-21T13:25:38Z

Resolves #5417.
This change improves the performance of [SD]AXPY on both A64FX and Graviton3E.
The graphs below show the single thread performance improvement of [D]AXPY on A64FX and Graviton3E, respectively.

The performance improved by 2.57 times on the A64FX and 1.13 times on the Graviton3E.
I have confirmed that this optimization also yields performance benefits for Level 2 BLAS kernels that utilize [SD]AXPY, such as [SD]SPMV and [SD]GER.

snadampal · 2025-08-21T16:29:36Z

kernel/arm64/KERNEL.NEOVERSEV1

@@ -32,6 +32,10 @@ SGEMVNKERNEL = gemv_n_sve_v1x3.c
 DGEMVNKERNEL = gemv_n_sve_v1x3.c
 SGEMVTKERNEL = gemv_t_sve_v1x3.c
 DGEMVTKERNEL = gemv_t_sve_v1x3.c
+
+SAXPYKERNEL = axpy_sve.c
+DAXPYKERNEL = axpy_sve.c


since you have used the SVL for the implementation instead of hardcoding the vector width, the kernel should work on NEOVERSEV2 as well. Please check this on Graviton4 and add it to KERNEL.NEOVERSEV2 as well.

I tried performance evaluations on a Grace equipped with Neoverse V2, as I did not have access to a Graviton4 for testing. The results showed that AXPY with SVE did not show significant performance improvement compared to the original version.

This graph shows the single thread performance of DAXPY on Grace. For this pull request, there is little advantage to implementing SVE on Neoverse V2.

snadampal · 2025-08-21T16:30:13Z

kernel/arm64/axpy_sve.c

+  BLASLONG sve_size = SV_COUNT();
+
+  if (n < 0) return (0);
+  if (da == 0.0) return (0);


why can't these two checks be combined into one?

Thank you for your comments.
There was another way you mentioned, but I followed kernel/arm/axpy.c#L45-L46.

snadampal · 2025-08-21T16:34:56Z

Hi @hideaki-motoki , thanks for the PR! I have added few comments.

hideaki-motoki added 2 commits August 21, 2025 20:56

Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E

855945b

Merge remote-tracking branch 'upstream/develop' into issue5417_axpy_sve

e23f9c6

martin-frbg added this to the 0.3.31 milestone Aug 21, 2025

snadampal reviewed Aug 21, 2025

View reviewed changes

martin-frbg merged commit 06c09de into OpenMathLib:develop Aug 26, 2025
98 of 103 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing SVE in `[SD]AXPY` Kernels for `A64FX` and `Graviton3E` #5426

Implementing SVE in `[SD]AXPY` Kernels for `A64FX` and `Graviton3E` #5426

Uh oh!

hideaki-motoki commented Aug 21, 2025

Uh oh!

snadampal Aug 21, 2025 •

edited

Loading

Uh oh!

hideaki-motoki Aug 26, 2025

Uh oh!

snadampal Aug 21, 2025 •

edited

Loading

Uh oh!

hideaki-motoki Aug 22, 2025

Uh oh!

snadampal commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E #5426

Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E #5426

Uh oh!

Conversation

hideaki-motoki commented Aug 21, 2025

Uh oh!

snadampal Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hideaki-motoki Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

snadampal Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hideaki-motoki Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

snadampal commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

Implementing SVE in `[SD]AXPY` Kernels for `A64FX` and `Graviton3E` #5426

Implementing SVE in `[SD]AXPY` Kernels for `A64FX` and `Graviton3E` #5426

snadampal Aug 21, 2025 •

edited

Loading

snadampal Aug 21, 2025 •

edited

Loading