[AArch64] Disable consecutive store merging when Neon is unavailable #111519

MacDue · 2024-10-08T11:03:47Z

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.

llvmbot · 2024-10-08T11:04:29Z

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.

Full diff: https://github.com/llvm/llvm-project/pull/111519.diff

3 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+17)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+1-10)
(added) llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll (+138)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 48e1b96d841efb..6e19bf1b4b175a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -27879,6 +27879,23 @@ bool AArch64TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {
   return OptSize && !VT.isVector();
 }
 
+bool AArch64TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
+                                             const MachineFunction &MF) const {
+  // Avoid merging stores into fixed-length vectors when Neon is unavailable.
+  // Until we have more general SVE lowerings for BUILD_VECTOR this may
+  // introduce stack spills.
+  if (MemVT.isFixedLengthVector() && !Subtarget->isNeonAvailable())
+    return false;
+
+  // Do not merge to float value size (128 bytes) if no implicit
+  // float attribute is set.
+  bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);
+
+  if (NoFloat)
+    return (MemVT.getSizeInBits() <= 64);
+  return true;
+}
+
 bool AArch64TargetLowering::preferIncOfAddToSubOfNot(EVT VT) const {
   // We want inc-of-add for scalars and sub-of-not for vectors.
   return VT.isScalarInteger();
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 480bf60360bf55..04ab5d974ccbf0 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -848,16 +848,7 @@ class AArch64TargetLowering : public TargetLowering {
   bool isIntDivCheap(EVT VT, AttributeList Attr) const override;
 
   bool canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
-                        const MachineFunction &MF) const override {
-    // Do not merge to float value size (128 bytes) if no implicit
-    // float attribute is set.
-
-    bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);
-
-    if (NoFloat)
-      return (MemVT.getSizeInBits() <= 64);
-    return true;
-  }
+                        const MachineFunction &MF) const override;
 
   bool isCheapToSpeculateCttz(Type *) const override {
     return true;
diff --git a/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll b/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll
new file mode 100644
index 00000000000000..13ef983501d26f
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll
@@ -0,0 +1,138 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-none-linux-gnu -mattr=+sve -O3 < %s -o - | FileCheck %s --check-prefixes=CHECK
+
+; Tests consecutive stores of @llvm.aarch64.sve.faddv. Within SDAG faddv is
+; lowered as a FADDV + EXTRACT_VECTOR_ELT (of lane 0). Stores of extracts can
+; be matched by DAGCombiner::mergeConsecutiveStores(), which we want to avoid in
+; some cases as it can lead to worse codegen.
+
+; TODO: A single `stp s0, s1, [x0]` may be preferred here.
+define void @consecutive_stores_pair(ptr noalias %dest0, ptr noalias %src0) {
+; CHECK-LABEL: consecutive_stores_pair:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    mov v0.s[1], v1.s[0]
+; CHECK-NEXT:    str d0, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  ret void
+}
+
+define void @consecutive_stores_quadruple(ptr noalias %dest0, ptr noalias %src0) {
+; CHECK-LABEL: consecutive_stores_quadruple:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x1, #2, mul vl]
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x1, #3, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    faddv s2, p0, z2.s
+; CHECK-NEXT:    mov v0.s[1], v1.s[0]
+; CHECK-NEXT:    faddv s3, p0, z3.s
+; CHECK-NEXT:    mov v2.s[1], v3.s[0]
+; CHECK-NEXT:    stp d0, d2, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
+  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %src2 = getelementptr inbounds float, ptr %src1, i64 %c4_vscale
+  %src3 = getelementptr inbounds float, ptr %src2, i64 %c4_vscale
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %vec2 = load <vscale x 4 x float>, ptr %src2, align 4
+  %vec3 = load <vscale x 4 x float>, ptr %src3, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec2)
+  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec3)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  store float %reduce2, ptr %dest2, align 4
+  store float %reduce3, ptr %dest3, align 4
+  ret void
+}
+
+define void @consecutive_stores_pair_streaming_function(ptr noalias %dest0, ptr noalias %src0) #0 "aarch64_pstate_sm_enabled"  {
+; CHECK-LABEL: consecutive_stores_pair_streaming_function:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    stp s0, s1, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  ret void
+}
+
+define void @consecutive_stores_quadruple_streaming_function(ptr noalias %dest0, ptr noalias %src0) #0 "aarch64_pstate_sm_enabled" {
+; CHECK-LABEL: consecutive_stores_quadruple_streaming_function:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x1, #2, mul vl]
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x1, #3, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    faddv s2, p0, z2.s
+; CHECK-NEXT:    stp s0, s1, [x0]
+; CHECK-NEXT:    faddv s3, p0, z3.s
+; CHECK-NEXT:    stp s2, s3, [x0, #8]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
+  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %src2 = getelementptr inbounds float, ptr %src1, i64 %c4_vscale
+  %src3 = getelementptr inbounds float, ptr %src2, i64 %c4_vscale
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %vec2 = load <vscale x 4 x float>, ptr %src2, align 4
+  %vec3 = load <vscale x 4 x float>, ptr %src3, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec2)
+  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec3)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  store float %reduce2, ptr %dest2, align 4
+  store float %reduce3, ptr %dest3, align 4
+  ret void
+}
+
+attributes #0 = { vscale_range(1, 16) "target-features"="+sve,+sme" }

MacDue · 2024-10-08T11:05:22Z

Note: See a635d7a for the difference between the precommit and with these changes.

paulwalker-arm

When I rewrite consecutive_stores_quadruple_streaming_function to use llvm.vector.reduce.fadd instead of the target specific intrinsic the problem goes away. It's worth investigating to see if we're missing a useful combine or something before flicking the canMergeStoresTo switch.

llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll

MacDue · 2024-10-08T12:59:28Z

When I rewrite consecutive_stores_quadruple_streaming_function to use llvm.vector.reduce.fadd instead of the target specific intrinsic the problem goes away. It's worth investigating to see if we're missing a useful combine or something before flicking the canMergeStoresTo switch.

Could you share your IR? (I've probably done something silly, but I'm hitting "Expanding reductions for scalable vectors is undefined." when simply subbing the operation https://godbolt.org/z/zvMEnnKzP).

That said, the reason it may appear fixed is due to when DAGCombiner::mergeConsecutiveStores() applies. It only applies in a few specific cases, which are stores of constants, loads, or extracts. A normal scalar result is usually not an extract, however, for SVE operations such as sve.faddv, the result is modeled as a vector + an extract element of lane 0 (to model the lane zeroing). Because of this, it results in something that can be matched by (tryStoreMergeOfExtracts()), and the stores are merged.

paulwalker-arm · 2024-10-08T13:03:25Z

define void @consecutive_stores_quadruple(ptr noalias %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" {
  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2)
  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3)
  store float %reduce0, ptr %dest0, align 4
  store float %reduce1, ptr %dest1, align 4
  store float %reduce2, ptr %dest2, align 4
  store float %reduce3, ptr %dest3, align 4
  ret void
}

define void @consecutive_stores_quadruple2(ptr noalias %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" {
  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
  %reduce0 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec0)
  %reduce1 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec1)
  %reduce2 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec2)
  %reduce3 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec3)
  store float %reduce0, ptr %dest0, align 4
  store float %reduce1, ptr %dest1, align 4
  store float %reduce2, ptr %dest2, align 4
  store float %reduce3, ptr %dest3, align 4
  ret void
}

MacDue · 2024-10-08T13:23:17Z

Yeah, so with consecutive_stores_quadruple() for each @llvm.aarch64.sve.faddv.nxv4f32 you get something like:

t33: nxv4f32 = AArch64ISD::FADDV_PRED t20, t10
t35: f32 = extract_vector_elt t33, Constant:i64<0>

Which matches tryStoreMergeOfExtracts().

With consecutive_stores_quadruple2() for each @llvm.vector.reduce.fadd.f32.nxv4f32 you get:

t20: f32 = vecreduce_fadd nnan ninf nsz arcp contract afn reassoc t8

Which does not match any of the store merging patterns.

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack spills, leading to more stores/reloads than if the stores were not merged. In some cases, it can also prevent using paired store instructions. In the future, we may want to relax when SVE is available, but currently, the SVE lowerings for BUILD_VECTOR are limited to a few specific cases.

llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

sdesmalen-arm · 2024-10-09T08:03:15Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // Avoid merging stores into fixed-length vectors when Neon is unavailable.
+  // In future, we could allow this when SVE is available, but currently,
+  // the SVE lowerings for BUILD_VECTOR are limited to a few specific cases (and
+  // the general lowering may introduce stack spills/reloads).


[Just thinking out loud here] My understanding is that for the example in the test, the reason we don't want to do this optimisation is because we can use the stp instructions instead, there is no upside to merging the stores although there is a possible downside that the insert operation is expensive. At the moment, it is expensive because we use a spill/reload, but for streaming[-compatible] SVE we could implement the operation using the SVE INSR instruction, which may not be any less efficient than the NEON operation if the value being inserted is also in a FPR/SIMD register. With the lack of upside, disabling the merging of stores avoids this complexity altogether, which understandably is the route chosen here.

I guess the question is; for which cases is merging stores beneficial when NEON is available? and for those cases, can we implement these efficiently using Streaming SVE?

I think there are two (slightly) independent cases here. There's unwanted store merging (for non-streaming functions), because we could just use a stp instead. E.g.

mov v0.s[1], v1.s[0] str d0, [x0]

->

stp s0, s1, [x0]

That's not fixed in this PR.

Then there's streaming mode store merging, which results in stack spills due to the BUILD_VECTOR lowering. Disabling store mering means, in some cases, we use a more preferable stp in streaming mode, but that's a secondary goal here; the main aim is to avoid the stack spills.

As for a streaming-mode/SVE BUILD_VECTOR lowering, I think there are a few options, but likely not as efficient as NEON (though maybe others have better ideas 😄).

E.g. for <4 x float>:

You could make a chain of INSR:

insr z3.s, s2 insr z3.s, s1 insr z3.s, s0 str q3, [x0]

But INSR has a higher latency than a MOV. Also, there is a dependency chain here, as each INSR depends on the previous one.

Another option is a chain of ZIP1:

zip1 z2.s, z2.s, z3.s zip1 z0.s, z0.s, z1.s zip1 z0.d, z0.d, z2.d str q0, [x0]

This seems like it may be more efficient than INSR, and also allows for a shorter dependency chain (logn), but it is still likely not as efficient as just MOVs.

I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

FWIW, I just manually tried the <4 x float> case: https://godbolt.org/z/MYzrdahjh
I'd say that the SVE version using zip1 is no less efficient than this.

The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

I'm not sure either, but the original commit (from way back when) for the DAG combine only handled stores of constants and loads and noted it's generally not profitable (see: 7cbc12a). The way it's merging stores of extracts here is a little odd and maybe unintentional? It'd make more sense to do the merge if it was storing lanes from the same vector, which is not the case here.

sdesmalen-arm

LGMT with nits addressed.

sdesmalen-arm · 2024-10-09T10:25:08Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // Avoid merging stores into fixed-length vectors when Neon is unavailable.
+  // In future, we could allow this when SVE is available, but currently,
+  // the SVE lowerings for BUILD_VECTOR are limited to a few specific cases (and
+  // the general lowering may introduce stack spills/reloads).


I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

…ent-indentonly * llvm-trunk/main: (6379 commits) [gn build] Port 1c94388 [RISCV] Introduce VLOptimizer pass (llvm#108640) [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895) [libc++] Add output groups to run-buildbot (llvm#111739) [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824) [clang] Ignore inline namespace for `hasName` (llvm#109147) [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519) [lldb] Fix finding make tool for tests (llvm#111980) Turn `-Wdeprecated-literal-operator` on by default (llvm#111027) [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994) Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)" Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)" Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)" [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990) Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)" [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828) [libc] Fix compilation of new trig functions (llvm#111987) [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752) [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733) CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982) ...

…lvm#111519) Lowering fixed-size BUILD_VECTORS without Neon may introduce stack spills, leading to more stores/reloads than if the stores were not merged. In some cases, it can also prevent using paired store instructions. In the future, we may want to relax when SVE is available, but currently, the SVE lowerings for BUILD_VECTOR are limited to a few specific cases.

MacDue requested review from paulwalker-arm, sdesmalen-arm and kmclaughlin-arm October 8, 2024 11:03

llvmbot added the backend:AArch64 label Oct 8, 2024

paulwalker-arm reviewed Oct 8, 2024

View reviewed changes

llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll Outdated Show resolved Hide resolved

MacDue added 4 commits October 8, 2024 13:53

Precommit consecutive-stores-of-faddv.ll

8e1da8e

Tidy up tests

b0e3b9d

Fixup comment

b1a66c5

MacDue force-pushed the sme_stores branch from 21458d4 to b1a66c5 Compare October 8, 2024 13:55

paulwalker-arm approved these changes Oct 8, 2024

View reviewed changes

llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll Outdated Show resolved Hide resolved

sdesmalen-arm reviewed Oct 9, 2024

View reviewed changes

sdesmalen-arm approved these changes Oct 9, 2024

View reviewed changes

Fixups

e8843d8

MacDue force-pushed the sme_stores branch from 270da88 to e8843d8 Compare October 9, 2024 12:20

MacDue merged commit c3a10dc into llvm:main Oct 11, 2024
9 checks passed

MacDue deleted the sme_stores branch October 11, 2024 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

Uh oh!

MacDue commented Oct 8, 2024

Uh oh!

llvmbot commented Oct 8, 2024

Uh oh!

MacDue commented Oct 8, 2024 •

edited

Loading

Uh oh!

paulwalker-arm left a comment

Uh oh!

Uh oh!

MacDue commented Oct 8, 2024

Uh oh!

paulwalker-arm commented Oct 8, 2024

Uh oh!

MacDue commented Oct 8, 2024

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm Oct 9, 2024

Uh oh!

MacDue Oct 9, 2024

Uh oh!

sdesmalen-arm Oct 9, 2024

Uh oh!

sdesmalen-arm Oct 9, 2024

Uh oh!

MacDue Oct 9, 2024

Uh oh!

sdesmalen-arm left a comment

Uh oh!

sdesmalen-arm Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

Uh oh!

Conversation

MacDue commented Oct 8, 2024

Uh oh!

llvmbot commented Oct 8, 2024

Uh oh!

MacDue commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MacDue commented Oct 8, 2024

Uh oh!

paulwalker-arm commented Oct 8, 2024

Uh oh!

MacDue commented Oct 8, 2024

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

MacDue Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

MacDue Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MacDue commented Oct 8, 2024 •

edited

Loading