[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

RKSimon · 2023-11-27T12:36:48Z

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

github-actions · 2023-11-27T12:39:18Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/X86/X86FixupVectorConstants.cpp llvm/lib/Target/X86/X86ISelLowering.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index b548f82f6..9b97b9d97 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -7758,7 +7758,6 @@ static SDValue lowerBuildVectorAsBroadcast(BuildVectorSDNode *BVOp,
   unsigned ScalarSize = Ld.getValueSizeInBits();
   bool IsGE256 = (VT.getSizeInBits() >= 256);
 
-
   // Handle broadcasting a single constant scalar from the constant pool
   // into a vector.
   // On Sandybridge (no AVX2), it is still better to load a constant vector

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll

…tries Prep work for #73509 (missed in #73654)

RKSimon · 2023-12-07T14:23:26Z

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

llvm/test/CodeGen/X86/avx512fp16-arith.ll

goldsteinn · 2023-12-07T18:07:45Z

llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm1, %ymm2
 ; AVX512F-NEXT:    vmovdqa 64(%rdi), %ymm3
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm3, %ymm0
-; AVX512F-NEXT:    vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
+; AVX512F-NEXT:    vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]


New broadcast?

Yes - still addressing regressions, that's why its still a draft :)

goldsteinn · 2023-12-07T18:09:59Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon · 2023-12-08T11:45:44Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

goldsteinn · 2024-06-21T06:30:35Z

llvm/lib/Target/X86/X86FixupVectorConstants.cpp

+        // TODO: Add support for RegBitWidth, but currently rebuildSplatCst
+        // doesn't require it (defaults to Constant::getPrimitiveSizeInBits).
+        if (FixupConstant(Fixups, 0, OpNoBcst64))
+          return true;


Maybe make inside of the condition a lambda to avoid 3x duplicate?

RKSimon · 2024-06-21T09:54:59Z

Long term WIP patch while I (slowly) address the various regressions it exposes - don't bother reviewing for now :)

… to handle single bitwidth at a time. Attempt 32-bit broadcasts first, and then fallback to 64-bit broadcasts on failure. We lose an explicit assertion for matching operand numbers but X86InstrFoldTables already does something similar. Pulled out of WIP patch #73509

… targets On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us. Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now. We still use lowerBuildVectorAsBroadcast on AVX512 targets if we're optimizing for size.

RKSimon requested review from phoebewang, KanRobert, goldsteinn and yubingex007-a11y November 27, 2023 12:36

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll Outdated Show resolved Hide resolved

RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37

RKSimon added a commit that referenced this pull request Nov 30, 2023

[X86] X86InstrFoldTables.cpp - add Op4 Broadcast Fold/Unfold table en…

b8bbd5f

…tries Prep work for #73509 (missed in #73654)

RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22

goldsteinn reviewed Dec 7, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512fp16-arith.ll Outdated Show resolved Hide resolved

goldsteinn reviewed Dec 7, 2023

View reviewed changes

RKSimon force-pushed the perf/broadcast-avx512 branch from c5db884 to 4fedfe0 Compare December 8, 2023 11:17

RKSimon added a commit that referenced this pull request Dec 8, 2023

[X86] canonicalizeBitSelect - always use VPTERNLOGD for sub-32bit types

5f91335

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21

RKSimon force-pushed the perf/broadcast-avx512 branch from f738150 to 6d1519e Compare February 5, 2024 12:58

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09

RKSimon force-pushed the perf/broadcast-avx512 branch from 925a8d0 to 77629d5 Compare February 28, 2024 10:58

RKSimon force-pushed the perf/broadcast-avx512 branch from 77629d5 to e8d60f1 Compare April 8, 2024 11:10

RKSimon mentioned this pull request Apr 18, 2024

[X86] Use GFNI for vXi8 shifts/rotates #89115

Merged

RKSimon force-pushed the perf/broadcast-avx512 branch from e8d60f1 to 27c0a8a Compare April 18, 2024 21:14

RKSimon force-pushed the perf/broadcast-avx512 branch from 27c0a8a to c6e33bf Compare June 12, 2024 10:46

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 7f72657 to 0a147dc Compare June 19, 2024 13:16

goldsteinn reviewed Jun 21, 2024

View reviewed changes

RKSimon mentioned this pull request Dec 16, 2024

AVX mem broadcasts are cached on the stack #120015

Open

RKSimon force-pushed the perf/broadcast-avx512 branch from 0a147dc to d9462dc Compare December 23, 2024 14:10

RKSimon force-pushed the perf/broadcast-avx512 branch from d9462dc to 6eed82f Compare January 10, 2025 16:36

RKSimon force-pushed the perf/broadcast-avx512 branch 3 times, most recently from bf2be9b to 5970ecc Compare January 22, 2025 17:41

RKSimon force-pushed the perf/broadcast-avx512 branch from 5970ecc to a30549e Compare April 27, 2025 15:47

RKSimon force-pushed the perf/broadcast-avx512 branch from a30549e to 1cb1cb2 Compare April 27, 2025 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Uh oh!

RKSimon commented Nov 27, 2023

Uh oh!

github-actions bot commented Nov 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Dec 7, 2023

Uh oh!

Uh oh!

goldsteinn Dec 7, 2023

Uh oh!

RKSimon Dec 8, 2023

Uh oh!

goldsteinn commented Dec 7, 2023

Uh oh!

RKSimon commented Dec 8, 2023

Uh oh!

goldsteinn Jun 21, 2024

Uh oh!

RKSimon commented Jun 21, 2024

Uh oh!

Uh oh!

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Are you sure you want to change the base?

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Uh oh!

Conversation

RKSimon commented Nov 27, 2023

Uh oh!

github-actions bot commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Dec 7, 2023

Uh oh!

Uh oh!

goldsteinn Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

RKSimon Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

goldsteinn commented Dec 7, 2023

Uh oh!

RKSimon commented Dec 8, 2023

Uh oh!

goldsteinn Jun 21, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Jun 21, 2024

Uh oh!

Uh oh!

github-actions bot commented Nov 27, 2023 •

edited

Loading