Skip to content

[AArch64] Disable consecutive store merging when Neon is unavailable #111519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27924,6 +27924,21 @@ bool AArch64TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {
return OptSize && !VT.isVector();
}

bool AArch64TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
const MachineFunction &MF) const {
// Avoid merging stores into fixed-length vectors when Neon is unavailable.
// In future, we could allow this when SVE is available, but currently,
// the SVE lowerings for BUILD_VECTOR are limited to a few specific cases (and
// the general lowering may introduce stack spills/reloads).
Comment on lines +27929 to +27932
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Just thinking out loud here] My understanding is that for the example in the test, the reason we don't want to do this optimisation is because we can use the stp instructions instead, there is no upside to merging the stores although there is a possible downside that the insert operation is expensive. At the moment, it is expensive because we use a spill/reload, but for streaming[-compatible] SVE we could implement the operation using the SVE INSR instruction, which may not be any less efficient than the NEON operation if the value being inserted is also in a FPR/SIMD register. With the lack of upside, disabling the merging of stores avoids this complexity altogether, which understandably is the route chosen here.

I guess the question is; for which cases is merging stores beneficial when NEON is available? and for those cases, can we implement these efficiently using Streaming SVE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two (slightly) independent cases here. There's unwanted store merging (for non-streaming functions), because we could just use a stp instead. E.g.

mov v0.s[1], v1.s[0]
str d0, [x0]

->

stp s0, s1, [x0]

That's not fixed in this PR.

Then there's streaming mode store merging, which results in stack spills due to the BUILD_VECTOR lowering. Disabling store mering means, in some cases, we use a more preferable stp in streaming mode, but that's a secondary goal here; the main aim is to avoid the stack spills.

As for a streaming-mode/SVE BUILD_VECTOR lowering, I think there are a few options, but likely not as efficient as NEON (though maybe others have better ideas 😄).

E.g. for <4 x float>:

You could make a chain of INSR:

insr    z3.s, s2
insr    z3.s, s1
insr    z3.s, s0
str     q3, [x0]

But INSR has a higher latency than a MOV. Also, there is a dependency chain here, as each INSR depends on the previous one.

Another option is a chain of ZIP1:

zip1    z2.s, z2.s, z3.s
zip1    z0.s, z0.s, z1.s
zip1    z0.d, z0.d, z2.d
str     q0, [x0]

This seems like it may be more efficient than INSR, and also allows for a shorter dependency chain (logn), but it is still likely not as efficient as just MOVs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I just manually tried the <4 x float> case: https://godbolt.org/z/MYzrdahjh
I'd say that the SVE version using zip1 is no less efficient than this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.

I'm not sure either, but the original commit (from way back when) for the DAG combine only handled stores of constants and loads and noted it's generally not profitable (see: 7cbc12a). The way it's merging stores of extracts here is a little odd and maybe unintentional? It'd make more sense to do the merge if it was storing lanes from the same vector, which is not the case here.

if (MemVT.isFixedLengthVector() && !Subtarget->isNeonAvailable())
return false;

// Do not merge to float value size (128 bytes) if no implicit float attribute
// is set.
bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);
return !NoFloat || MemVT.getSizeInBits() <= 64;
}

bool AArch64TargetLowering::preferIncOfAddToSubOfNot(EVT VT) const {
// We want inc-of-add for scalars and sub-of-not for vectors.
return VT.isScalarInteger();
Expand Down
11 changes: 1 addition & 10 deletions llvm/lib/Target/AArch64/AArch64ISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -849,16 +849,7 @@ class AArch64TargetLowering : public TargetLowering {
bool isIntDivCheap(EVT VT, AttributeList Attr) const override;

bool canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
const MachineFunction &MF) const override {
// Do not merge to float value size (128 bytes) if no implicit
// float attribute is set.

bool NoFloat = MF.getFunction().hasFnAttribute(Attribute::NoImplicitFloat);

if (NoFloat)
return (MemVT.getSizeInBits() <= 64);
return true;
}
const MachineFunction &MF) const override;

bool isCheapToSpeculateCttz(Type *) const override {
return true;
Expand Down
92 changes: 92 additions & 0 deletions llvm/test/CodeGen/AArch64/consecutive-stores-of-faddv.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+sme -O3 < %s -o - | FileCheck %s --check-prefixes=CHECK

; Tests consecutive stores of @llvm.aarch64.sve.faddv. Within SDAG faddv is
; lowered as a FADDV + EXTRACT_VECTOR_ELT (of lane 0). Stores of extracts can
; be matched by DAGCombiner::mergeConsecutiveStores(), which we want to avoid in
; some cases as it can lead to worse codegen.

; TODO: A single `stp s0, s1, [x0]` may be preferred here.
define void @consecutive_stores_pair(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1) {
; CHECK-LABEL: consecutive_stores_pair:
; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: faddv s0, p0, z0.s
; CHECK-NEXT: faddv s1, p0, z1.s
; CHECK-NEXT: mov v0.s[1], v1.s[0]
; CHECK-NEXT: str d0, [x0]
; CHECK-NEXT: ret
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
store float %reduce0, ptr %dest0, align 4
store float %reduce1, ptr %dest1, align 4
ret void
}

define void @consecutive_stores_quadruple(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) {
; CHECK-LABEL: consecutive_stores_quadruple:
; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: faddv s0, p0, z0.s
; CHECK-NEXT: faddv s1, p0, z1.s
; CHECK-NEXT: faddv s2, p0, z2.s
; CHECK-NEXT: mov v0.s[1], v1.s[0]
; CHECK-NEXT: faddv s3, p0, z3.s
; CHECK-NEXT: mov v2.s[1], v3.s[0]
; CHECK-NEXT: stp d0, d2, [x0]
; CHECK-NEXT: ret
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
%dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
%dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
%reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2)
%reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3)
store float %reduce0, ptr %dest0, align 4
store float %reduce1, ptr %dest1, align 4
store float %reduce2, ptr %dest2, align 4
store float %reduce3, ptr %dest3, align 4
ret void
}

define void @consecutive_stores_pair_streaming_function(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1) "aarch64_pstate_sm_enabled" {
; CHECK-LABEL: consecutive_stores_pair_streaming_function:
; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: faddv s0, p0, z0.s
; CHECK-NEXT: faddv s1, p0, z1.s
; CHECK-NEXT: stp s0, s1, [x0]
; CHECK-NEXT: ret
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
store float %reduce0, ptr %dest0, align 4
store float %reduce1, ptr %dest1, align 4
ret void
}

define void @consecutive_stores_quadruple_streaming_function(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" {
; CHECK-LABEL: consecutive_stores_quadruple_streaming_function:
; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: faddv s0, p0, z0.s
; CHECK-NEXT: faddv s1, p0, z1.s
; CHECK-NEXT: faddv s2, p0, z2.s
; CHECK-NEXT: stp s0, s1, [x0]
; CHECK-NEXT: faddv s3, p0, z3.s
; CHECK-NEXT: stp s2, s3, [x0, #8]
; CHECK-NEXT: ret
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4
%dest2 = getelementptr inbounds i8, ptr %dest1, i64 4
%dest3 = getelementptr inbounds i8, ptr %dest2, i64 4
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0)
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1)
%reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2)
%reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3)
store float %reduce0, ptr %dest0, align 4
store float %reduce1, ptr %dest1, align 4
store float %reduce2, ptr %dest2, align 4
store float %reduce3, ptr %dest3, align 4
ret void
}
Loading