Skip to content

Conversation

phoebewang
Copy link
Contributor

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.

@llvmbot
Copy link
Member

llvmbot commented Jun 23, 2024

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.


Full diff: https://github.com/llvm/llvm-project/pull/96414.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+19-1)
  • (modified) llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll (+1-2)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 82d2b301d854e..c4d08206593f0 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14123,7 +14123,25 @@ static SDValue lowerShuffleWithPERMV(const SDLoc &DL, MVT VT,
   MVT MaskVT = VT.changeTypeToInteger();
   SDValue MaskNode;
   MVT ShuffleVT = VT;
-  if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
+  MVT ElementVT = VT.getVectorElementType();
+  // VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.
+  if (Subtarget.hasVBMI() && !V2.isUndef() &&
+      (VT.is128BitVector() ||
+       (VT.is256BitVector() && Subtarget.hasEVEX512())) &&
+      (ElementVT == MVT::i8 || ElementVT == MVT::i16) &&
+      V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+      V1.getConstantOperandVal(1) == 0 &&
+      V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+      V2.getConstantOperandVal(1) == VT.getVectorNumElements() &&
+      V1.getOperand(0) == V2.getOperand(0)) {
+    SmallVector<int, 32> AdjustedMask(Mask);
+    AdjustedMask.resize(VT.getVectorNumElements() * 2, SM_SentinelUndef);
+    ShuffleVT = VT.getDoubleNumVectorElementsVT();
+    V1 = V1.getOperand(0);
+    V2 = DAG.getUNDEF(ShuffleVT);
+    MaskNode = getConstVector(
+        AdjustedMask, MaskVT.getDoubleNumVectorElementsVT(), DAG, DL, true);
+  } else if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
     V1 = widenSubVector(V1, false, Subtarget, DAG, DL, 512);
     V2 = widenSubVector(V2, false, Subtarget, DAG, DL, 512);
     ShuffleVT = V1.getSimpleValueType();
diff --git a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
index a481aaef4257d..95e249984e184 100644
--- a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
+++ b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
@@ -348,8 +348,7 @@ define <16 x i8> @trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_
 ; AVX512VBMIVL-LABEL: trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_57_62:
 ; AVX512VBMIVL:       # %bb.0:
 ; AVX512VBMIVL-NEXT:    vmovdqa {{.*#+}} xmm1 = [1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,62]
-; AVX512VBMIVL-NEXT:    vextracti64x4 $1, %zmm0, %ymm2
-; AVX512VBMIVL-NEXT:    vpermt2b %ymm2, %ymm1, %ymm0
+; AVX512VBMIVL-NEXT:    vpermb %zmm0, %zmm1, %zmm0
 ; AVX512VBMIVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
 ; AVX512VBMIVL-NEXT:    vzeroupper
 ; AVX512VBMIVL-NEXT:    retq

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to handle this inside combineTargetShufle

@phoebewang
Copy link
Contributor Author

It might be better to handle this inside combineTargetShufle

Good point! Done.

@phoebewang phoebewang changed the title [X86] Widden binary shuffle to unary for i8/i16 [X86] Combine VPERMV3 to VPERMV for i8/i16 Jun 23, 2024
// VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use
// VPERMV.
if (VT.is512BitVector() || (VT.is256BitVector() && !Subtarget.hasEVEX512()))
return SDValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but it might be better to check the NVT type instead of VT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drawback is we need to construct NVT before we know if it's necessary. Let me know if you still think checking NVT is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MVTs are cheap to construct - I'd say go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return SDValue();
MVT ElementVT = VT.getVectorElementType();
if (ElementVT != MVT::i8 && ElementVT != MVT::i16)
return SDValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are also trying to avoid generating 512-bit instructions. I think I can put it in a separate patch in case it has negative effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to scale the i8/i16 cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to turn them into i32? I think it relies on specific mask patterns, I'm not sure if we have done that or not. Maybe worth a try if not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does rely on specific mask patterns (see scaleShuffleElements).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?

#97206

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does rely on specific mask patterns (see scaleShuffleElements).

#97212

if (V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
V1.getConstantOperandVal(1) == 0 &&
V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
V2.getConstantOperandVal(1) == VT.getVectorNumElements() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle the inverse (V2.getConstantOperandVal(1) == 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we built VECTOR_SHUFFLE with these two EXTRACT_SUBVECTOR, the order is V1 from 0 and V2 from the half. So unless there are intermediate combines switch them (which I haven't found), we don't need to handle the inverse one.

VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV if possible.
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phoebewang phoebewang merged commit d9f1166 into llvm:main Jun 30, 2024
@phoebewang phoebewang deleted the vperm branch June 30, 2024 08:53
phoebewang added a commit that referenced this pull request Jun 30, 2024
phoebewang added a commit to phoebewang/llvm-project that referenced this pull request Jun 30, 2024
phoebewang added a commit that referenced this pull request Jul 1, 2024
cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024
VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use
VPERMV.
cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024
cpiaseque pushed a commit to cpiaseque/llvm-project that referenced this pull request Jul 3, 2024
RKSimon added a commit that referenced this pull request Jul 8, 2024
#96414 + #97206 didn't ensure that we were extracting subvectors from a vector double the width of the destination.

We can relax this in a future patch, but fix the #97968 crash first.

Fixes #97968
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants