Skip to content

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Nov 27, 2023

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

Copy link

github-actions bot commented Nov 27, 2023

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/X86/X86FixupVectorConstants.cpp llvm/lib/Target/X86/X86ISelLowering.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index b548f82f6..9b97b9d97 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -7758,7 +7758,6 @@ static SDValue lowerBuildVectorAsBroadcast(BuildVectorSDNode *BVOp,
   unsigned ScalarSize = Ld.getValueSizeInBits();
   bool IsGE256 = (VT.getSizeInBits() >= 256);
 
-
   // Handle broadcasting a single constant scalar from the constant pool
   // into a vector.
   // On Sandybridge (no AVX2), it is still better to load a constant vector

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37
RKSimon added a commit that referenced this pull request Nov 30, 2023
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 7, 2023

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
; AVX512F-NEXT: vpshufb %ymm0, %ymm1, %ymm2
; AVX512F-NEXT: vmovdqa 64(%rdi), %ymm3
; AVX512F-NEXT: vpshufb %ymm0, %ymm3, %ymm0
; AVX512F-NEXT: vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
; AVX512F-NEXT: vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New broadcast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - still addressing regressions, that's why its still a draft :)

@goldsteinn
Copy link
Contributor

Can't memory ops microfuse on all targets? Why is this avx512vl only?

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from c5db884 to 4fedfe0 Compare December 8, 2023 11:17
RKSimon added a commit that referenced this pull request Dec 8, 2023
We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary

Noticed in #73509
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 8, 2023

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from f738150 to 6d1519e Compare February 5, 2024 12:58
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 925a8d0 to 77629d5 Compare February 28, 2024 10:58
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 77629d5 to e8d60f1 Compare April 8, 2024 11:10
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from e8d60f1 to 27c0a8a Compare April 18, 2024 21:14
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 27c0a8a to c6e33bf Compare June 12, 2024 10:46
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 7f72657 to 0a147dc Compare June 19, 2024 13:16
// TODO: Add support for RegBitWidth, but currently rebuildSplatCst
// doesn't require it (defaults to Constant::getPrimitiveSizeInBits).
if (FixupConstant(Fixups, 0, OpNoBcst64))
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make inside of the condition a lambda to avoid 3x duplicate?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 21, 2024

Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :)

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 0a147dc to d9462dc Compare December 23, 2024 14:10
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from d9462dc to 6eed82f Compare January 10, 2025 16:36
RKSimon added a commit that referenced this pull request Jan 18, 2025
… to handle single bitwidth at a time.

Attempt 32-bit broadcasts first, and then fallback to 64-bit broadcasts on failure.

We lose an explicit assertion for matching operand numbers but X86InstrFoldTables already does something similar.

Pulled out of WIP patch #73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 3 times, most recently from bf2be9b to 5970ecc Compare January 22, 2025 17:41
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from 5970ecc to a30549e Compare April 27, 2025 15:47
… targets

On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure.

If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.

We still use lowerBuildVectorAsBroadcast on AVX512 targets if we're optimizing for size.
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch from a30549e to 1cb1cb2 Compare April 27, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants