[X86] Combine VPERMV3 to VPERMV for i8/i16 #96414

phoebewang · 2024-06-23T04:24:41Z

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.

llvmbot · 2024-06-23T04:25:16Z

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.

Full diff: https://github.com/llvm/llvm-project/pull/96414.diff

2 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+19-1)
(modified) llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll (+1-2)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 82d2b301d854e..c4d08206593f0 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14123,7 +14123,25 @@ static SDValue lowerShuffleWithPERMV(const SDLoc &DL, MVT VT,
   MVT MaskVT = VT.changeTypeToInteger();
   SDValue MaskNode;
   MVT ShuffleVT = VT;
-  if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
+  MVT ElementVT = VT.getVectorElementType();
+  // VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.
+  if (Subtarget.hasVBMI() && !V2.isUndef() &&
+      (VT.is128BitVector() ||
+       (VT.is256BitVector() && Subtarget.hasEVEX512())) &&
+      (ElementVT == MVT::i8 || ElementVT == MVT::i16) &&
+      V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+      V1.getConstantOperandVal(1) == 0 &&
+      V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+      V2.getConstantOperandVal(1) == VT.getVectorNumElements() &&
+      V1.getOperand(0) == V2.getOperand(0)) {
+    SmallVector<int, 32> AdjustedMask(Mask);
+    AdjustedMask.resize(VT.getVectorNumElements() * 2, SM_SentinelUndef);
+    ShuffleVT = VT.getDoubleNumVectorElementsVT();
+    V1 = V1.getOperand(0);
+    V2 = DAG.getUNDEF(ShuffleVT);
+    MaskNode = getConstVector(
+        AdjustedMask, MaskVT.getDoubleNumVectorElementsVT(), DAG, DL, true);
+  } else if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
     V1 = widenSubVector(V1, false, Subtarget, DAG, DL, 512);
     V2 = widenSubVector(V2, false, Subtarget, DAG, DL, 512);
     ShuffleVT = V1.getSimpleValueType();
diff --git a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
index a481aaef4257d..95e249984e184 100644
--- a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
+++ b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
@@ -348,8 +348,7 @@ define <16 x i8> @trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_
 ; AVX512VBMIVL-LABEL: trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_57_62:
 ; AVX512VBMIVL:       # %bb.0:
 ; AVX512VBMIVL-NEXT:    vmovdqa {{.*#+}} xmm1 = [1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,62]
-; AVX512VBMIVL-NEXT:    vextracti64x4 $1, %zmm0, %ymm2
-; AVX512VBMIVL-NEXT:    vpermt2b %ymm2, %ymm1, %ymm0
+; AVX512VBMIVL-NEXT:    vpermb %zmm0, %zmm1, %zmm0
 ; AVX512VBMIVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
 ; AVX512VBMIVL-NEXT:    vzeroupper
 ; AVX512VBMIVL-NEXT:    retq

RKSimon

It might be better to handle this inside combineTargetShufle

phoebewang · 2024-06-23T13:14:00Z

It might be better to handle this inside combineTargetShufle

Good point! Done.

RKSimon · 2024-06-23T16:31:49Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    // VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use
+    // VPERMV.
+    if (VT.is512BitVector() || (VT.is256BitVector() && !Subtarget.hasEVEX512()))
+      return SDValue();


Not sure, but it might be better to check the NVT type instead of VT?

The drawback is we need to construct NVT before we know if it's necessary. Let me know if you still think checking NVT is better.

MVTs are cheap to construct - I'd say go for it.

RKSimon · 2024-06-23T16:34:00Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      return SDValue();
+    MVT ElementVT = VT.getVectorElementType();
+    if (ElementVT != MVT::i8 && ElementVT != MVT::i16)
+      return SDValue();


Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?

But we are also trying to avoid generating 512-bit instructions. I think I can put it in a separate patch in case it has negative effects.

Can we try to scale the i8/i16 cases?

Do you mean to turn them into i32? I think it relies on specific mask patterns, I'm not sure if we have done that or not. Maybe worth a try if not.

It does rely on specific mask patterns (see scaleShuffleElements).

Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?

#97206

It does rely on specific mask patterns (see scaleShuffleElements).

#97212

goldsteinn · 2024-06-24T03:47:34Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    if (V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+        V1.getConstantOperandVal(1) == 0 &&
+        V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+        V2.getConstantOperandVal(1) == VT.getVectorNumElements() &&


Should we also handle the inverse (V2.getConstantOperandVal(1) == 0)?

When we built VECTOR_SHUFFLE with these two EXTRACT_SUBVECTOR, the order is V1 from 0 and V2 from the half. So unless there are intermediate combines switch them (which I haven't found), we don't need to handle the inverse one.

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV if possible.

RKSimon

LGTM

…ERMV combine A follow up of #96414

This is a follow up of llvm#96414

This is a follow up of #96414

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.

…ERMV combine A follow up of llvm#96414

This is a follow up of llvm#96414

#96414 + #97206 didn't ensure that we were extracting subvectors from a vector double the width of the destination. We can relax this in a future patch, but fix the #97968 crash first. Fixes #97968

phoebewang requested review from RKSimon and goldsteinn June 23, 2024 04:24

llvmbot added the backend:X86 label Jun 23, 2024

RKSimon reviewed Jun 23, 2024

View reviewed changes

phoebewang force-pushed the vperm branch from e6401cc to 9cb2fa2 Compare June 23, 2024 13:13

phoebewang changed the title ~~[X86] Widden binary shuffle to unary for i8/i16~~ [X86] Combine VPERMV3 to VPERMV for i8/i16 Jun 23, 2024

RKSimon reviewed Jun 23, 2024

View reviewed changes

goldsteinn reviewed Jun 24, 2024

View reviewed changes

[X86] Combine VPERMV3 to VPERMV for i8/i16

8b1c834

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV if possible.

phoebewang force-pushed the vperm branch from 9cb2fa2 to 8b1c834 Compare June 24, 2024 11:34

RKSimon approved these changes Jun 24, 2024

View reviewed changes

phoebewang merged commit d9f1166 into llvm:main Jun 30, 2024

phoebewang deleted the vperm branch June 30, 2024 08:53

phoebewang added a commit that referenced this pull request Jun 30, 2024

[X86] Peek through bitcast to find more opportunity for VPERMV3 -> VP…

eed7c5e

…ERMV combine A follow up of #96414

phoebewang mentioned this pull request Jun 30, 2024

[X86] Relax VPERMV3 to VPERMV combine for more types #97206

Merged

phoebewang added a commit to phoebewang/llvm-project that referenced this pull request Jun 30, 2024

[X86] Try to scale i8/i16 VPERMV3 to i32

2963948

This is a follow up of llvm#96414

phoebewang mentioned this pull request Jun 30, 2024

[X86] Try to scale i8/i16 VPERMV3 to i32 #97212

Closed

phoebewang added a commit that referenced this pull request Jul 1, 2024

[X86] Relax VPERMV3 to VPERMV combine for more types (#97206)

9b94056

This is a follow up of #96414

cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024

[X86] Combine VPERMV3 to VPERMV for i8/i16 (llvm#96414)

e9c7bc6

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.

cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024

[X86] Peek through bitcast to find more opportunity for VPERMV3 -> VP…

c1bd8e6

…ERMV combine A follow up of llvm#96414

cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024

[X86] Relax VPERMV3 to VPERMV combine for more types (llvm#97206)

608d1c1

This is a follow up of llvm#96414

[X86] Combine VPERMV3 to VPERMV for i8/i16 #96414

[X86] Combine VPERMV3 to VPERMV for i8/i16 #96414

Uh oh!

Conversation

phoebewang commented Jun 23, 2024

Uh oh!

llvmbot commented Jun 23, 2024

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

phoebewang commented Jun 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants