-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[X86] Combine VPERMV3 to VPERMV for i8/i16 #96414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-x86 Author: Phoebe Wang (phoebewang) ChangesVPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV. Full diff: https://github.com/llvm/llvm-project/pull/96414.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 82d2b301d854e..c4d08206593f0 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14123,7 +14123,25 @@ static SDValue lowerShuffleWithPERMV(const SDLoc &DL, MVT VT,
MVT MaskVT = VT.changeTypeToInteger();
SDValue MaskNode;
MVT ShuffleVT = VT;
- if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
+ MVT ElementVT = VT.getVectorElementType();
+ // VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.
+ if (Subtarget.hasVBMI() && !V2.isUndef() &&
+ (VT.is128BitVector() ||
+ (VT.is256BitVector() && Subtarget.hasEVEX512())) &&
+ (ElementVT == MVT::i8 || ElementVT == MVT::i16) &&
+ V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+ V1.getConstantOperandVal(1) == 0 &&
+ V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+ V2.getConstantOperandVal(1) == VT.getVectorNumElements() &&
+ V1.getOperand(0) == V2.getOperand(0)) {
+ SmallVector<int, 32> AdjustedMask(Mask);
+ AdjustedMask.resize(VT.getVectorNumElements() * 2, SM_SentinelUndef);
+ ShuffleVT = VT.getDoubleNumVectorElementsVT();
+ V1 = V1.getOperand(0);
+ V2 = DAG.getUNDEF(ShuffleVT);
+ MaskNode = getConstVector(
+ AdjustedMask, MaskVT.getDoubleNumVectorElementsVT(), DAG, DL, true);
+ } else if (!VT.is512BitVector() && !Subtarget.hasVLX()) {
V1 = widenSubVector(V1, false, Subtarget, DAG, DL, 512);
V2 = widenSubVector(V2, false, Subtarget, DAG, DL, 512);
ShuffleVT = V1.getSimpleValueType();
diff --git a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
index a481aaef4257d..95e249984e184 100644
--- a/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
+++ b/llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
@@ -348,8 +348,7 @@ define <16 x i8> @trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_
; AVX512VBMIVL-LABEL: trunc_shuffle_v64i8_01_05_09_13_17_21_25_29_33_37_41_45_49_53_57_62:
; AVX512VBMIVL: # %bb.0:
; AVX512VBMIVL-NEXT: vmovdqa {{.*#+}} xmm1 = [1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,62]
-; AVX512VBMIVL-NEXT: vextracti64x4 $1, %zmm0, %ymm2
-; AVX512VBMIVL-NEXT: vpermt2b %ymm2, %ymm1, %ymm0
+; AVX512VBMIVL-NEXT: vpermb %zmm0, %zmm1, %zmm0
; AVX512VBMIVL-NEXT: # kill: def $xmm0 killed $xmm0 killed $zmm0
; AVX512VBMIVL-NEXT: vzeroupper
; AVX512VBMIVL-NEXT: retq
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to handle this inside combineTargetShufle
Good point! Done. |
| // VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use | ||
| // VPERMV. | ||
| if (VT.is512BitVector() || (VT.is256BitVector() && !Subtarget.hasEVEX512())) | ||
| return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but it might be better to check the NVT type instead of VT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drawback is we need to construct NVT before we know if it's necessary. Let me know if you still think checking NVT is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MVTs are cheap to construct - I'd say go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return SDValue(); | ||
| MVT ElementVT = VT.getVectorElementType(); | ||
| if (ElementVT != MVT::i8 && ElementVT != MVT::i16) | ||
| return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we are also trying to avoid generating 512-bit instructions. I think I can put it in a separate patch in case it has negative effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to scale the i8/i16 cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to turn them into i32? I think it relies on specific mask patterns, I'm not sure if we have done that or not. Maybe worth a try if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does rely on specific mask patterns (see scaleShuffleElements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although VPERM2B/W are slower, do would not still avoid an extract subvector by widening D/Q/PS/PD cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does rely on specific mask patterns (see
scaleShuffleElements).
| if (V1.getOpcode() == ISD::EXTRACT_SUBVECTOR && | ||
| V1.getConstantOperandVal(1) == 0 && | ||
| V2.getOpcode() == ISD::EXTRACT_SUBVECTOR && | ||
| V2.getConstantOperandVal(1) == VT.getVectorNumElements() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also handle the inverse (V2.getConstantOperandVal(1) == 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we built VECTOR_SHUFFLE with these two EXTRACT_SUBVECTOR, the order is V1 from 0 and V2 from the half. So unless there are intermediate combines switch them (which I haven't found), we don't need to handle the inverse one.
VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ERMV combine A follow up of #96414
This is a follow up of llvm#96414
This is a follow up of #96414
VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.
…ERMV combine A follow up of llvm#96414
This is a follow up of llvm#96414
VPERM[I,T]2[B,W] are 3 uops on Skylake and Icelake so we try to use VPERMV.