-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AArch64] Disable consecutive store merging when Neon is unavailable #111519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py | ||
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+sme -O3 < %s -o - | FileCheck %s --check-prefixes=CHECK | ||
|
||
; Tests consecutive stores of @llvm.aarch64.sve.faddv. Within SDAG faddv is | ||
; lowered as a FADDV + EXTRACT_VECTOR_ELT (of lane 0). Stores of extracts can | ||
; be matched by DAGCombiner::mergeConsecutiveStores(), which we want to avoid in | ||
; some cases as it can lead to worse codegen. | ||
|
||
; TODO: A single `stp s0, s1, [x0]` may be preferred here. | ||
define void @consecutive_stores_pair(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1) { | ||
; CHECK-LABEL: consecutive_stores_pair: | ||
; CHECK: // %bb.0: | ||
; CHECK-NEXT: ptrue p0.s | ||
; CHECK-NEXT: faddv s0, p0, z0.s | ||
; CHECK-NEXT: faddv s1, p0, z1.s | ||
; CHECK-NEXT: mov v0.s[1], v1.s[0] | ||
; CHECK-NEXT: str d0, [x0] | ||
; CHECK-NEXT: ret | ||
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4 | ||
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0) | ||
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1) | ||
store float %reduce0, ptr %dest0, align 4 | ||
store float %reduce1, ptr %dest1, align 4 | ||
ret void | ||
} | ||
|
||
define void @consecutive_stores_quadruple(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) { | ||
; CHECK-LABEL: consecutive_stores_quadruple: | ||
; CHECK: // %bb.0: | ||
; CHECK-NEXT: ptrue p0.s | ||
; CHECK-NEXT: faddv s0, p0, z0.s | ||
; CHECK-NEXT: faddv s1, p0, z1.s | ||
; CHECK-NEXT: faddv s2, p0, z2.s | ||
; CHECK-NEXT: mov v0.s[1], v1.s[0] | ||
; CHECK-NEXT: faddv s3, p0, z3.s | ||
; CHECK-NEXT: mov v2.s[1], v3.s[0] | ||
; CHECK-NEXT: stp d0, d2, [x0] | ||
; CHECK-NEXT: ret | ||
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4 | ||
%dest2 = getelementptr inbounds i8, ptr %dest1, i64 4 | ||
%dest3 = getelementptr inbounds i8, ptr %dest2, i64 4 | ||
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0) | ||
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1) | ||
%reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2) | ||
%reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3) | ||
store float %reduce0, ptr %dest0, align 4 | ||
store float %reduce1, ptr %dest1, align 4 | ||
store float %reduce2, ptr %dest2, align 4 | ||
store float %reduce3, ptr %dest3, align 4 | ||
ret void | ||
} | ||
|
||
define void @consecutive_stores_pair_streaming_function(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1) "aarch64_pstate_sm_enabled" { | ||
; CHECK-LABEL: consecutive_stores_pair_streaming_function: | ||
; CHECK: // %bb.0: | ||
; CHECK-NEXT: ptrue p0.s | ||
; CHECK-NEXT: faddv s0, p0, z0.s | ||
; CHECK-NEXT: faddv s1, p0, z1.s | ||
; CHECK-NEXT: stp s0, s1, [x0] | ||
; CHECK-NEXT: ret | ||
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4 | ||
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0) | ||
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1) | ||
store float %reduce0, ptr %dest0, align 4 | ||
store float %reduce1, ptr %dest1, align 4 | ||
ret void | ||
} | ||
|
||
define void @consecutive_stores_quadruple_streaming_function(ptr %dest0, <vscale x 4 x float> %vec0, <vscale x 4 x float> %vec1, <vscale x 4 x float> %vec2, <vscale x 4 x float> %vec3) "aarch64_pstate_sm_enabled" { | ||
; CHECK-LABEL: consecutive_stores_quadruple_streaming_function: | ||
; CHECK: // %bb.0: | ||
; CHECK-NEXT: ptrue p0.s | ||
; CHECK-NEXT: faddv s0, p0, z0.s | ||
; CHECK-NEXT: faddv s1, p0, z1.s | ||
; CHECK-NEXT: faddv s2, p0, z2.s | ||
; CHECK-NEXT: stp s0, s1, [x0] | ||
; CHECK-NEXT: faddv s3, p0, z3.s | ||
; CHECK-NEXT: stp s2, s3, [x0, #8] | ||
; CHECK-NEXT: ret | ||
%dest1 = getelementptr inbounds i8, ptr %dest0, i64 4 | ||
%dest2 = getelementptr inbounds i8, ptr %dest1, i64 4 | ||
%dest3 = getelementptr inbounds i8, ptr %dest2, i64 4 | ||
%reduce0 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec0) | ||
%reduce1 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec1) | ||
%reduce2 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec2) | ||
%reduce3 = call float @llvm.aarch64.sve.faddv.nxv4f32(<vscale x 4 x i1> splat(i1 true), <vscale x 4 x float> %vec3) | ||
store float %reduce0, ptr %dest0, align 4 | ||
store float %reduce1, ptr %dest1, align 4 | ||
store float %reduce2, ptr %dest2, align 4 | ||
store float %reduce3, ptr %dest3, align 4 | ||
ret void | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Just thinking out loud here] My understanding is that for the example in the test, the reason we don't want to do this optimisation is because we can use the
stp
instructions instead, there is no upside to merging the stores although there is a possible downside that the insert operation is expensive. At the moment, it is expensive because we use a spill/reload, but for streaming[-compatible] SVE we could implement the operation using the SVEINSR
instruction, which may not be any less efficient than the NEON operation if the value being inserted is also in a FPR/SIMD register. With the lack of upside, disabling the merging of stores avoids this complexity altogether, which understandably is the route chosen here.I guess the question is; for which cases is merging stores beneficial when NEON is available? and for those cases, can we implement these efficiently using Streaming SVE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two (slightly) independent cases here. There's unwanted store merging (for non-streaming functions), because we could just use a stp instead. E.g.
->
That's not fixed in this PR.
Then there's streaming mode store merging, which results in stack spills due to the BUILD_VECTOR lowering. Disabling store mering means, in some cases, we use a more preferable
stp
in streaming mode, but that's a secondary goal here; the main aim is to avoid the stack spills.As for a streaming-mode/SVE BUILD_VECTOR lowering, I think there are a few options, but likely not as efficient as NEON (though maybe others have better ideas 😄).
E.g. for <4 x float>:
You could make a chain of
INSR
:But
INSR
has a higher latency than aMOV
. Also, there is a dependency chain here, as eachINSR
depends on the previous one.Another option is a chain of
ZIP1
:This seems like it may be more efficient than
INSR
, and also allows for a shorter dependency chain (logn), but it is still likely not as efficient as justMOV
s.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, ZIP would indeed be a (much) better choice than INSR. I'm also happy with the intent of this PR. The part that isn't entirely clear to me yet is for which cases we'd want to enable this merging when we do have optimal SVE codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I just manually tried the <4 x float> case: https://godbolt.org/z/MYzrdahjh
I'd say that the SVE version using zip1 is no less efficient than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either, but the original commit (from way back when) for the DAG combine only handled stores of constants and loads and noted it's generally not profitable (see: 7cbc12a). The way it's merging stores of extracts here is a little odd and maybe unintentional? It'd make more sense to do the merge if it was storing lanes from the same vector, which is not the case here.