-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/X86/X86FixupVectorConstants.cpp llvm/lib/Target/X86/X86ISelLowering.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index b548f82f6..9b97b9d97 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -7758,7 +7758,6 @@ static SDValue lowerBuildVectorAsBroadcast(BuildVectorSDNode *BVOp,
unsigned ScalarSize = Ld.getValueSizeInBits();
bool IsGE256 = (VT.getSizeInBits() >= 256);
-
// Handle broadcasting a single constant scalar from the constant pool
// into a vector.
// On Sandybridge (no AVX2), it is still better to load a constant vector
|
21fad09
to
dae6506
Compare
fc410b2
to
c5db884
Compare
Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors |
@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr | |||
; AVX512F-NEXT: vpshufb %ymm0, %ymm1, %ymm2 | |||
; AVX512F-NEXT: vmovdqa 64(%rdi), %ymm3 | |||
; AVX512F-NEXT: vpshufb %ymm0, %ymm3, %ymm0 | |||
; AVX512F-NEXT: vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12] | |||
; AVX512F-NEXT: vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - still addressing regressions, that's why its still a draft :)
Can't memory ops microfuse on all targets? Why is this avx512vl only? |
c5db884
to
4fedfe0
Compare
We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509
I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this. I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step. |
adc89f0
to
3af0810
Compare
f738150
to
6d1519e
Compare
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509
86ed907
to
925a8d0
Compare
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509
925a8d0
to
77629d5
Compare
77629d5
to
e8d60f1
Compare
e8d60f1
to
27c0a8a
Compare
27c0a8a
to
c6e33bf
Compare
7f72657
to
0a147dc
Compare
// TODO: Add support for RegBitWidth, but currently rebuildSplatCst | ||
// doesn't require it (defaults to Constant::getPrimitiveSizeInBits). | ||
if (FixupConstant(Fixups, 0, OpNoBcst64)) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make inside of the condition a lambda to avoid 3x duplicate?
Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :) |
0a147dc
to
d9462dc
Compare
d9462dc
to
6eed82f
Compare
… to handle single bitwidth at a time. Attempt 32-bit broadcasts first, and then fallback to 64-bit broadcasts on failure. We lose an explicit assertion for matching operand numbers but X86InstrFoldTables already does something similar. Pulled out of WIP patch #73509
bf2be9b
to
5970ecc
Compare
5970ecc
to
a30549e
Compare
… targets On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us. Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now. We still use lowerBuildVectorAsBroadcast on AVX512 targets if we're optimizing for size.
a30549e
to
1cb1cb2
Compare
On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.
This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.
Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?
My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.