Skip to content

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 11, 2024

Conversation

MacDue
Copy link
Member

@MacDue MacDue commented Oct 8, 2024

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.

@llvmbot
Copy link
Member

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.


Full diff: https://github.com/llvm/llvm-project/pull/111519.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+1-10)
  • (added) llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll (+138)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 48e1b96d841efb..6e19bf1b4b175a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -27879,6 +27879,23 @@ bool AArch64TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {
   return OptSize && !VT.isVector();
 }
 
+bool AArch64TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
+                                             const MachineFunction &MF) const {
+  // Avoid merging stores into fixed-length vectors when Neon is unavailable.
+  // Until we have more general SVE lowerings for BUILD_VECTOR this may
+  // introduce stack spills.
+  if (MemVT.isFixedLengthVector() && !Subtarget->isNeonAvailable())
+    return false;
+
+  // Do not merge to float value size (128 bytes) if no implicit
+  // float attribute is set.
+  bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);
+
+  if (NoFloat)
+    return (MemVT.getSizeInBits() <= 64);
+  return true;
+}
+
 bool AArch64TargetLowering::preferIncOfAddToSubOfNot(EVT VT) const {
   // We want inc-of-add for scalars and sub-of-not for vectors.
   return VT.isScalarInteger();
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 480bf60360bf55..04ab5d974ccbf0 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -848,16 +848,7 @@ class AArch64TargetLowering : public TargetLowering {
   bool isIntDivCheap(EVT VT, AttributeList Attr) const override;
 
   bool canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
-                        const MachineFunction &MF) const override {
-    // Do not merge to float value size (128 bytes) if no implicit
-    // float attribute is set.
-
-    bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);
-
-    if (NoFloat)
-      return (MemVT.getSizeInBits() <= 64);
-    return true;
-  }
+                        const MachineFunction &MF) const override;
 
   bool isCheapToSpeculateCttz(Type *) const override {
     return true;
diff --git a/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll b/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll
new file mode 100644
index 00000000000000..13ef983501d26f
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll
@@ -0,0 +1,138 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-none-linux-gnu -mattr=+sve -O3 < %s -o - | FileCheck %s --check-prefixes=CHECK
+
+; Tests consecutive stores of @llvm.aarch64.sve.faddv. Within SDAG faddv is
+; lowered as a FADDV + EXTRACT_VECTOR_ELT (of lane 0). Stores of extracts can
+; be matched by DAGCombiner::mergeConsecutiveStores(), which we want to avoid in
+; some cases as it can lead to worse codegen.
+
+; TODO: A single `stp s0, s1, [x0]` may be preferred here.
+define void @consecutive_stores_pair(ptr noalias %dest0, ptr noalias %src0) {
+; CHECK-LABEL: consecutive_stores_pair:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    mov v0.s[1], v1.s[0]
+; CHECK-NEXT:    str d0, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  ret void
+}
+
+define void @consecutive_stores_quadruple(ptr noalias %dest0, ptr noalias %src0) {
+; CHECK-LABEL: consecutive_stores_quadruple:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x1, #2, mul vl]
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x1, #3, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    faddv s2, p0, z2.s
+; CHECK-NEXT:    mov v0.s[1], v1.s[0]
+; CHECK-NEXT:    faddv s3, p0, z3.s
+; CHECK-NEXT:    mov v2.s[1], v3.s[0]
+; CHECK-NEXT:    stp d0, d2, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
+  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %src2 = getelementptr inbounds float, ptr %src1, i64 %c4_vscale
+  %src3 = getelementptr inbounds float, ptr %src2, i64 %c4_vscale
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %vec2 = load <vscale x 4 x float>, ptr %src2, align 4
+  %vec3 = load <vscale x 4 x float>, ptr %src3, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec2)
+  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec3)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  store float %reduce2, ptr %dest2, align 4
+  store float %reduce3, ptr %dest3, align 4
+  ret void
+}
+
+define void @consecutive_stores_pair_streaming_function(ptr noalias %dest0, ptr noalias %src0) #0 "aarch64_pstate_sm_enabled"  {
+; CHECK-LABEL: consecutive_stores_pair_streaming_function:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    stp s0, s1, [x0]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  ret void
+}
+
+define void @consecutive_stores_quadruple_streaming_function(ptr noalias %dest0, ptr noalias %src0) #0 "aarch64_pstate_sm_enabled" {
+; CHECK-LABEL: consecutive_stores_quadruple_streaming_function:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x1]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x1, #2, mul vl]
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x1, #3, mul vl]
+; CHECK-NEXT:    faddv s0, p0, z0.s
+; CHECK-NEXT:    faddv s1, p0, z1.s
+; CHECK-NEXT:    faddv s2, p0, z2.s
+; CHECK-NEXT:    stp s0, s1, [x0]
+; CHECK-NEXT:    faddv s3, p0, z3.s
+; CHECK-NEXT:    stp s2, s3, [x0, #8]
+; CHECK-NEXT:    ret
+  %ptrue = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
+  %vscale = call i64 @llvm.vscale.i64()
+  %c4_vscale = shl i64 %vscale, 2
+  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
+  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
+  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
+  %src1 = getelementptr inbounds float, ptr %src0, i64 %c4_vscale
+  %src2 = getelementptr inbounds float, ptr %src1, i64 %c4_vscale
+  %src3 = getelementptr inbounds float, ptr %src2, i64 %c4_vscale
+  %vec0 = load <vscale x 4 x float>, ptr %src0, align 4
+  %vec1 = load <vscale x 4 x float>, ptr %src1, align 4
+  %vec2 = load <vscale x 4 x float>, ptr %src2, align 4
+  %vec3 = load <vscale x 4 x float>, ptr %src3, align 4
+  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec0)
+  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec1)
+  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec2)
+  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> %ptrue, <vscale x 4 x float> %vec3)
+  store float %reduce0, ptr %dest0, align 4
+  store float %reduce1, ptr %dest1, align 4
+  store float %reduce2, ptr %dest2, align 4
+  store float %reduce3, ptr %dest3, align 4
+  ret void
+}
+
+attributes #0 = { vscale_range(1, 16) "target-features"="+sve,+sme" }

@MacDue
Copy link
Member Author

MacDue commented Oct 8, 2024

Note: See a635d7a for the difference between the precommit and with these changes.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I rewrite consecutive_stores_quadruple_streaming_function to use llvm.vector.reduce.fadd instead of the target specific intrinsic the problem goes away. It's worth investigating to see if we're missing a useful combine or something before flicking the canMergeStoresTo switch.

@MacDue
Copy link
Member Author

MacDue commented Oct 8, 2024

When I rewrite consecutive_stores_quadruple_streaming_function to use llvm.vector.reduce.fadd instead of the target specific intrinsic the problem goes away. It's worth investigating to see if we're missing a useful combine or something before flicking the canMergeStoresTo switch.

Could you share your IR? (I've probably done something silly, but I'm hitting "Expanding reductions for scalable vectors is undefined." when simply subbing the operation https://godbolt.org/z/zvMEnnKzP).

That said, the reason it may appear fixed is due to when DAGCombiner::mergeConsecutiveStores() applies. It only applies in a few specific cases, which are stores of constants, loads, or extracts. A normal scalar result is usually not an extract, however, for SVE operations such as sve.faddv, the result is modeled as a vector + an extract element of lane 0 (to model the lane zeroing). Because of this, it results in something that can be matched by (tryStoreMergeOfExtracts()), and the stores are merged.

@paulwalker-arm
Copy link
Collaborator

define void @consecutive_stores_quadruple(ptr noalias %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" {
  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
  %reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
  %reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
  %reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2)
  %reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3)
  store float %reduce0, ptr %dest0, align 4
  store float %reduce1, ptr %dest1, align 4
  store float %reduce2, ptr %dest2, align 4
  store float %reduce3, ptr %dest3, align 4
  ret void
}

define void @consecutive_stores_quadruple2(ptr noalias %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" {
  %dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
  %dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
  %dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
  %reduce0 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec0)
  %reduce1 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec1)
  %reduce2 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec2)
  %reduce3 = call fast float @llvm.vector.reduce.fadd.f32.nxv4f32(float zeroinitializer, <vscale x 4 x float> %vec3)
  store float %reduce0, ptr %dest0, align 4
  store float %reduce1, ptr %dest1, align 4
  store float %reduce2, ptr %dest2, align 4
  store float %reduce3, ptr %dest3, align 4
  ret void
}

@MacDue
Copy link
Member Author

MacDue commented Oct 8, 2024

Yeah, so with consecutive_stores_quadruple() for each @llvm.aarch64.sve.faddv.nxv4f32 you get something like:

t33: nxv4f32 = AArch64ISD::FADDV_PRED t20, t10
t35: f32 = extract_vector_elt t33, Constant:i64<0>

Which matches tryStoreMergeOfExtracts().

With consecutive_stores_quadruple2() for each @llvm.vector.reduce.fadd.f32.nxv4f32 you get:

t20: f32 = vecreduce_fadd nnan ninf nsz arcp contract afn reassoc t8

Which does not match any of the store merging patterns.

MacDue added 4 commits October 8, 2024 13:53
Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.
Comment on lines +27929 to +27932
// Avoid merging stores into fixed-length vectors when Neon is unavailable.
// In future, we could allow this when SVE is available, but currently,
// the SVE lowerings for BUILD_VECTOR are limited to a few specific cases (and
// the general lowering may introduce stack spills/reloads).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Just thinking out loud here] My understanding is that for the example in the test, the reason we don't want to do this optimisation is because we can use the stp instructions instead, there is no upside to merging the stores although there is a possible downside that the insert operation is expensive. At the moment, it is expensive because we use a spill/reload, but for streaming[-compatible] SVE we could implement the operation using the SVE INSR instruction, which may not be any less efficient than the NEON operation if the value being inserted is also in a FPR/SIMD register. With the lack of upside, disabling the merging of stores avoids this complexity altogether, which understandably is the route chosen here.

I guess the question is; for which cases is merging stores beneficial when NEON is available? and for those cases, can we implement these efficiently using Streaming SVE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two (slightly) independent cases here. There's unwanted store merging (for non-streaming functions), because we could just use a stp instead. E.g.

mov v0.s[1], v1.s[0]
str d0, [x0]

->

stp s0, s1, [x0]

That's not fixed in this PR.

Then there's streaming mode store merging, which results in stack spills due to the BUILD_VECTOR lowering. Disabling store mering means, in some cases, we use a more preferable stp in streaming mode, but that's a secondary goal here; the main aim is to avoid the stack spills.

As for a streaming-mode/SVE BUILD_VECTOR lowering, I think there are a few options, but likely not as efficient as NEON (though maybe others have better ideas 😄).

E.g. for <4 x float>:

You could make a chain of INSR:

insr    z3.s, s2
insr    z3.s, s1
insr    z3.s, s0
str     q3, [x0]

But INSR has a higher latency than a MOV. Also, there is a dependency chain here, as each INSR depends on the previous one.

Another option is a chain of ZIP1:

zip1    z2.s, z2.s, z3.s
zip1    z0.s, z0.s, z1.s
zip1    z0.d, z0.d, z2.d
str     q0, [x0]

This seems like it may be more efficient than INSR, and also allows for a shorter dependency chain (logn), but it is still likely not as efficient as just MOVs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I just manually tried the <4 x float> case: https://godbolt.org/z/MYzrdahjh
I'd say that the SVE version using zip1 is no less efficient than this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

I'm not sure either, but the original commit (from way back when) for the DAG combine only handled stores of constants and loads and noted it's generally not profitable (see: 7cbc12a). The way it's merging stores of extracts here is a little odd and maybe unintentional? It'd make more sense to do the merge if it was storing lanes from the same vector, which is not the case here.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT with nits addressed.

Comment on lines +27929 to +27932
// Avoid merging stores into fixed-length vectors when Neon is unavailable.
// In future, we could allow this when SVE is available, but currently,
// the SVE lowerings for BUILD_VECTOR are limited to a few specific cases (and
// the general lowering may introduce stack spills/reloads).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

@MacDue MacDue merged commit c3a10dc into llvm:main Oct 11, 2024
9 checks passed
@MacDue MacDue deleted the sme_stores branch October 11, 2024 13:15
ichaer added a commit to splunk/ichaer-llvm-project that referenced this pull request Oct 11, 2024
…ent-indentonly

* llvm-trunk/main: (6379 commits)
  [gn build] Port 1c94388
  [RISCV] Introduce VLOptimizer pass (llvm#108640)
  [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895)
  [libc++] Add output groups to run-buildbot (llvm#111739)
  [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824)
  [clang] Ignore inline namespace for `hasName` (llvm#109147)
  [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519)
  [lldb] Fix finding make tool for tests (llvm#111980)
  Turn `-Wdeprecated-literal-operator` on by default (llvm#111027)
  [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994)
  Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)"
  Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)"
  Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)"
  [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990)
  Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)"
  [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828)
  [libc] Fix compilation of new trig functions (llvm#111987)
  [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752)
  [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733)
  CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982)
  ...
DanielCChen pushed a commit to DanielCChen/llvm-project that referenced this pull request Oct 16, 2024
…lvm#111519)

Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants