-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[VPlan] Don't convert widen recipes to VP intrinsics in EVL transform #126177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-vectorizers Author: Luke Lau (lukel97) ChangesThis patch proposes to avoid widening recipes to VP intrinsics during the EVL transform. IIUC we initially did this to avoid Emitting regular IR instead of VP intrinsics allows more generic optimisations, both in the middle end and DAGCombiner, and we generally have better patterns in the RISC-V backend for non-VP nodes. Sticking to regular IR instructions is likely a lot less work than reimplementing all of these optimisations for VP intrinsics. On SPEC CPU 2017 we get noticeably better code generation:
- vzext.vf2 v16, v14
- vzext.vf2 v14, v15
- vwsubu.vv v15, v16, v14
- vsetvli zero, zero, e32, m1, ta, ma
- vmul.vv v14, v15, v15
+ vwsubu.vv v16, v14, v15
+ vsetvli zero, zero, e16, mf2, ta, ma
+ vwmul.vv v14, v16, v16
@@ -6896,24 +6825,19 @@
sub s5, t4, t6
add s1, a2, t6
vsetvli s5, s5, e32, m2, ta, ma
- vle8.v v8, (s1)
+ vle8.v v10, (s1)
add s1, t5, t6
sub s0, s0, t1
- vmv.v.i v12, 0
- vzext.vf4 v14, v8
- vmadd.vx v14, s2, v10
- vsra.vx v8, v14, s3
- vadd.vx v14, v8, s4
- vmsgt.vi v0, v14, 0
- vmsltu.vx v8, v14, t0
- vmerge.vim v12, v12, -1, v0
- vmv1r.v v0, v8
- vmerge.vvm v8, v12, v14, v0
- vsetvli zero, zero, e16, m1, ta, ma
- vnsrl.wi v12, v8, 0
+ vzext.vf4 v12, v10
+ vmadd.vx v12, s2, v8
+ vsra.vx v10, v12, s3
+ vadd.vx v10, v10, s4
+ vmax.vx v10, v10, zero
+ vsetvli zero, zero, e16, m1, ta, ma
+ vnclipu.wi v12, v10, 0
vluxei64.v v10, (a6), v12
sub a5, a5, a4
vmadd.vv v14, v10, v9
- vdiv.vx v10, v14, s1
+ vmulh.vx v10, v14, s0
+ vsra.vi v10, v10, 12
+ vsrl.vi v11, v10, 31
+ vadd.vv v10, v10, v11
vand.vx v9, v10, s4
vmsne.vi v0, v9, 0
- vmv.v.i v9, 0
- vmerge.vim v9, v9, 1, v0
- vsetvli zero, zero, e32, m1, tu, ma
- vadd.vv v8, v8, v9
+ sub a4, a4, a3
+ vsetvli zero, zero, e32, m1, tu, mu
+ vadd.vi v8, v8, 1, v0.t
- vsetvli s1, a4, e32, m1, ta, ma
+ vsetvli a4, a4, e32, m1, ta, ma
vle64.v v10, (a5)
sub a2, a2, s8
vluxei64.v v9, (zero), v10
vsetvli a5, zero, e64, m2, ta, ma
- vmv.v.x v10, s1
+ vmv.v.x v10, a4
vmsleu.vv v12, v10, v14
- vmsltu.vx v10, v14, s1
+ vmsltu.vx v10, v14, a4
vmand.mm v11, v8, v12
- vsetvli zero, a4, e32, m1, ta, ma
+ vsetvli zero, zero, e32, m1, ta, ma
vand.vx v9, v9, s6
- vsetvli a4, zero, e32, m1, ta, ma I've removed the VPWidenEVLRecipe in this patch since it's no longer used, but if people would prefer to keep it for other targets (PPC?) then I'd be happy to add it back in and gate it under a target hook. Patch is 107.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126177.diff 23 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index fac207287e0bcc7..064e916d6d81718 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -523,7 +523,6 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
case VPRecipeBase::VPWidenGEPSC:
case VPRecipeBase::VPWidenIntrinsicSC:
case VPRecipeBase::VPWidenSC:
- case VPRecipeBase::VPWidenEVLSC:
case VPRecipeBase::VPWidenSelectSC:
case VPRecipeBase::VPBlendSC:
case VPRecipeBase::VPPredInstPHISC:
@@ -710,7 +709,6 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe {
static inline bool classof(const VPRecipeBase *R) {
return R->getVPDefID() == VPRecipeBase::VPInstructionSC ||
R->getVPDefID() == VPRecipeBase::VPWidenSC ||
- R->getVPDefID() == VPRecipeBase::VPWidenEVLSC ||
R->getVPDefID() == VPRecipeBase::VPWidenGEPSC ||
R->getVPDefID() == VPRecipeBase::VPWidenCastSC ||
R->getVPDefID() == VPRecipeBase::VPWidenIntrinsicSC ||
@@ -1116,8 +1114,7 @@ class VPWidenRecipe : public VPRecipeWithIRFlags {
}
static inline bool classof(const VPRecipeBase *R) {
- return R->getVPDefID() == VPRecipeBase::VPWidenSC ||
- R->getVPDefID() == VPRecipeBase::VPWidenEVLSC;
+ return R->getVPDefID() == VPRecipeBase::VPWidenSC;
}
static inline bool classof(const VPUser *U) {
@@ -1142,54 +1139,6 @@ class VPWidenRecipe : public VPRecipeWithIRFlags {
#endif
};
-/// A recipe for widening operations with vector-predication intrinsics with
-/// explicit vector length (EVL).
-class VPWidenEVLRecipe : public VPWidenRecipe {
- using VPRecipeWithIRFlags::transferFlags;
-
-public:
- template <typename IterT>
- VPWidenEVLRecipe(Instruction &I, iterator_range<IterT> Operands, VPValue &EVL)
- : VPWidenRecipe(VPDef::VPWidenEVLSC, I, Operands) {
- addOperand(&EVL);
- }
- VPWidenEVLRecipe(VPWidenRecipe &W, VPValue &EVL)
- : VPWidenEVLRecipe(*W.getUnderlyingInstr(), W.operands(), EVL) {
- transferFlags(W);
- }
-
- ~VPWidenEVLRecipe() override = default;
-
- VPWidenRecipe *clone() override final {
- llvm_unreachable("VPWidenEVLRecipe cannot be cloned");
- return nullptr;
- }
-
- VP_CLASSOF_IMPL(VPDef::VPWidenEVLSC);
-
- VPValue *getEVL() { return getOperand(getNumOperands() - 1); }
- const VPValue *getEVL() const { return getOperand(getNumOperands() - 1); }
-
- /// Produce a vp-intrinsic using the opcode and operands of the recipe,
- /// processing EVL elements.
- void execute(VPTransformState &State) override final;
-
- /// Returns true if the recipe only uses the first lane of operand \p Op.
- bool onlyFirstLaneUsed(const VPValue *Op) const override {
- assert(is_contained(operands(), Op) &&
- "Op must be an operand of the recipe");
- // EVL in that recipe is always the last operand, thus any use before means
- // the VPValue should be vectorized.
- return getEVL() == Op;
- }
-
-#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
- /// Print the recipe.
- void print(raw_ostream &O, const Twine &Indent,
- VPSlotTracker &SlotTracker) const override final;
-#endif
-};
-
/// VPWidenCastRecipe is a recipe to create vector cast instructions.
class VPWidenCastRecipe : public VPRecipeWithIRFlags {
/// Cast instruction opcode.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 71fb6d42116cfe4..dcd8eeecdc15b06 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -245,9 +245,8 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
VPPartialReductionRecipe>([this](const VPRecipeBase *R) {
return inferScalarType(R->getOperand(0));
})
- .Case<VPBlendRecipe, VPInstruction, VPWidenRecipe, VPWidenEVLRecipe,
- VPReplicateRecipe, VPWidenCallRecipe, VPWidenMemoryRecipe,
- VPWidenSelectRecipe>(
+ .Case<VPBlendRecipe, VPInstruction, VPWidenRecipe, VPReplicateRecipe,
+ VPWidenCallRecipe, VPWidenMemoryRecipe, VPWidenSelectRecipe>(
[this](const auto *R) { return inferScalarTypeForRecipe(R); })
.Case<VPWidenIntrinsicRecipe>([](const VPWidenIntrinsicRecipe *R) {
return R->getResultType();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index c84a93d7398f73b..83c764b87953d9d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -85,7 +85,6 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPWidenLoadSC:
case VPWidenPHISC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -131,7 +130,6 @@ bool VPRecipeBase::mayReadFromMemory() const {
case VPWidenIntOrFpInductionSC:
case VPWidenPHISC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -172,7 +170,6 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPWidenPHISC:
case VPWidenPointerInductionSC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -1550,42 +1547,6 @@ InstructionCost VPWidenRecipe::computeCost(ElementCount VF,
}
}
-void VPWidenEVLRecipe::execute(VPTransformState &State) {
- unsigned Opcode = getOpcode();
- // TODO: Support other opcodes
- if (!Instruction::isBinaryOp(Opcode) && !Instruction::isUnaryOp(Opcode))
- llvm_unreachable("Unsupported opcode in VPWidenEVLRecipe::execute");
-
- State.setDebugLocFrom(getDebugLoc());
-
- assert(State.get(getOperand(0))->getType()->isVectorTy() &&
- "VPWidenEVLRecipe should not be used for scalars");
-
- VPValue *EVL = getEVL();
- Value *EVLArg = State.get(EVL, /*NeedsScalar=*/true);
- IRBuilderBase &BuilderIR = State.Builder;
- VectorBuilder Builder(BuilderIR);
- Value *Mask = BuilderIR.CreateVectorSplat(State.VF, BuilderIR.getTrue());
-
- SmallVector<Value *, 4> Ops;
- for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) {
- VPValue *VPOp = getOperand(I);
- Ops.push_back(State.get(VPOp));
- }
-
- Builder.setMask(Mask).setEVL(EVLArg);
- Value *VPInst =
- Builder.createVectorInstruction(Opcode, Ops[0]->getType(), Ops, "vp.op");
- // Currently vp-intrinsics only accept FMF flags.
- // TODO: Enable other flags when support is added.
- if (isa<FPMathOperator>(VPInst))
- setFlags(cast<Instruction>(VPInst));
-
- State.set(this, VPInst);
- State.addMetadata(VPInst,
- dyn_cast_or_null<Instruction>(getUnderlyingValue()));
-}
-
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
void VPWidenRecipe::print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const {
@@ -1595,15 +1556,6 @@ void VPWidenRecipe::print(raw_ostream &O, const Twine &Indent,
printFlags(O);
printOperands(O, SlotTracker);
}
-
-void VPWidenEVLRecipe::print(raw_ostream &O, const Twine &Indent,
- VPSlotTracker &SlotTracker) const {
- O << Indent << "WIDEN ";
- printAsOperand(O, SlotTracker);
- O << " = vp." << Instruction::getOpcodeName(getOpcode());
- printFlags(O);
- printOperands(O, SlotTracker);
-}
#endif
void VPWidenCastRecipe::execute(VPTransformState &State) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 7e9ef46133936d7..18323a078f1ee0c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1665,40 +1665,10 @@ static VPRecipeBase *createEVLRecipe(VPValue *HeaderMask,
VPValue *NewMask = GetNewMask(S->getMask());
return new VPWidenStoreEVLRecipe(*S, EVL, NewMask);
})
- .Case<VPWidenRecipe>([&](VPWidenRecipe *W) -> VPRecipeBase * {
- unsigned Opcode = W->getOpcode();
- if (!Instruction::isBinaryOp(Opcode) && !Instruction::isUnaryOp(Opcode))
- return nullptr;
- return new VPWidenEVLRecipe(*W, EVL);
- })
.Case<VPReductionRecipe>([&](VPReductionRecipe *Red) {
VPValue *NewMask = GetNewMask(Red->getCondOp());
return new VPReductionEVLRecipe(*Red, EVL, NewMask);
})
- .Case<VPWidenIntrinsicRecipe, VPWidenCastRecipe>(
- [&](auto *CR) -> VPRecipeBase * {
- Intrinsic::ID VPID;
- if (auto *CallR = dyn_cast<VPWidenIntrinsicRecipe>(CR)) {
- VPID =
- VPIntrinsic::getForIntrinsic(CallR->getVectorIntrinsicID());
- } else {
- auto *CastR = cast<VPWidenCastRecipe>(CR);
- VPID = VPIntrinsic::getForOpcode(CastR->getOpcode());
- }
-
- // Not all intrinsics have a corresponding VP intrinsic.
- if (VPID == Intrinsic::not_intrinsic)
- return nullptr;
- assert(VPIntrinsic::getMaskParamPos(VPID) &&
- VPIntrinsic::getVectorLengthParamPos(VPID) &&
- "Expected VP intrinsic to have mask and EVL");
-
- SmallVector<VPValue *> Ops(CR->operands());
- Ops.push_back(&AllOneMask);
- Ops.push_back(&EVL);
- return new VPWidenIntrinsicRecipe(
- VPID, Ops, TypeInfo.inferScalarType(CR), CR->getDebugLoc());
- })
.Case<VPWidenSelectRecipe>([&](VPWidenSelectRecipe *Sel) {
SmallVector<VPValue *> Ops(Sel->operands());
Ops.push_back(&EVL);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h
index aabc4ab571e7a91..a058b2a121d59f3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanValue.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h
@@ -351,7 +351,6 @@ class VPDef {
VPWidenStoreEVLSC,
VPWidenStoreSC,
VPWidenSC,
- VPWidenEVLSC,
VPWidenSelectSC,
VPBlendSC,
VPHistogramSC,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 96156de444f8813..a4b309d6dcd9fe3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -145,10 +145,6 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
[&](const VPRecipeBase *S) { return VerifyEVLUse(*S, 2); })
.Case<VPWidenLoadEVLRecipe, VPReverseVectorPointerRecipe>(
[&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); })
- .Case<VPWidenEVLRecipe>([&](const VPWidenEVLRecipe *W) {
- return VerifyEVLUse(*W,
- Instruction::isUnaryOp(W->getOpcode()) ? 1 : 2);
- })
.Case<VPScalarCastRecipe>(
[&](const VPScalarCastRecipe *S) { return VerifyEVLUse(*S, 0); })
.Case<VPInstruction>([&](const VPInstruction *I) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
index 14818199072c285..b96a44a546a141a 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
@@ -143,8 +143,8 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-OUTLOOP-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[TMP6]]
; IF-EVL-OUTLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[TMP7]], i32 0
; IF-EVL-OUTLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i16> @llvm.vp.load.nxv4i16.p0(ptr align 2 [[TMP8]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
-; IF-EVL-OUTLOOP-NEXT: [[TMP9:%.*]] = call <vscale x 4 x i32> @llvm.vp.sext.nxv4i32.nxv4i16(<vscale x 4 x i16> [[VP_OP_LOAD]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
-; IF-EVL-OUTLOOP-NEXT: [[VP_OP:%.*]] = call <vscale x 4 x i32> @llvm.vp.add.nxv4i32(<vscale x 4 x i32> [[VEC_PHI]], <vscale x 4 x i32> [[TMP9]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
+; IF-EVL-OUTLOOP-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i16> [[VP_OP_LOAD]] to <vscale x 4 x i32>
+; IF-EVL-OUTLOOP-NEXT: [[VP_OP:%.*]] = add <vscale x 4 x i32> [[VEC_PHI]], [[TMP9]]
; IF-EVL-OUTLOOP-NEXT: [[TMP10]] = call <vscale x 4 x i32> @llvm.vp.merge.nxv4i32(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i32> [[VP_OP]], <vscale x 4 x i32> [[VEC_PHI]], i32 [[TMP5]])
; IF-EVL-OUTLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP5]], [[EVL_BASED_IV]]
; IF-EVL-OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
@@ -200,7 +200,7 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-INLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[TMP7]]
; IF-EVL-INLOOP-NEXT: [[TMP9:%.*]] = getelementptr inbounds i16, ptr [[TMP8]], i32 0
; IF-EVL-INLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 8 x i16> @llvm.vp.load.nxv8i16.p0(ptr align 2 [[TMP9]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
-; IF-EVL-INLOOP-NEXT: [[TMP14:%.*]] = call <vscale x 8 x i32> @llvm.vp.sext.nxv8i32.nxv8i16(<vscale x 8 x i16> [[VP_OP_LOAD]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
+; IF-EVL-INLOOP-NEXT: [[TMP14:%.*]] = sext <vscale x 8 x i16> [[VP_OP_LOAD]] to <vscale x 8 x i32>
; IF-EVL-INLOOP-NEXT: [[TMP10:%.*]] = call i32 @llvm.vp.reduce.add.nxv8i32(i32 0, <vscale x 8 x i32> [[TMP14]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
; IF-EVL-INLOOP-NEXT: [[TMP11]] = add i32 [[TMP10]], [[VEC_PHI]]
; IF-EVL-INLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP6]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll b/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
index 68b36f23de4b053..ba7158eb02d904a 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
@@ -27,10 +27,10 @@ define void @truncate_to_minimal_bitwidths_widen_cast_recipe(ptr %src) {
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP4]]
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[TMP5]], i32 0
; CHECK-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 1 x i8> @llvm.vp.load.nxv1i8.p0(ptr align 1 [[TMP6]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 1 x i16> @llvm.vp.zext.nxv1i16.nxv1i8(<vscale x 1 x i8> [[VP_OP_LOAD]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[VP_OP:%.*]] = call <vscale x 1 x i16> @llvm.vp.mul.nxv1i16(<vscale x 1 x i16> zeroinitializer, <vscale x 1 x i16> [[TMP7]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[VP_OP1:%.*]] = call <vscale x 1 x i16> @llvm.vp.lshr.nxv1i16(<vscale x 1 x i16> [[VP_OP]], <vscale x 1 x i16> trunc (<vscale x 1 x i32> splat (i32 1) to <vscale x 1 x i16>), <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[TMP8:%.*]] = call <vscale x 1 x i8> @llvm.vp.trunc.nxv1i8.nxv1i16(<vscale x 1 x i16> [[VP_OP1]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP7:%.*]] = zext <vscale x 1 x i8> [[VP_OP_LOAD]] to <vscale x 1 x i16>
+; CHECK-NEXT: [[TMP12:%.*]] = mul <vscale x 1 x i16> zeroinitializer, [[TMP7]]
+; CHECK-NEXT: [[VP_OP1:%.*]] = lshr <vscale x 1 x i16> [[TMP12]], trunc (<vscale x 1 x i32> splat (i32 1) to <vscale x 1 x i16>)
+; CHECK-NEXT: [[TMP8:%.*]] = trunc <vscale x 1 x i16> [[VP_OP1]] to <vscale x 1 x i8>
; CHECK-NEXT: call void @llvm.vp.scatter.nxv1i8.nxv1p0(<vscale x 1 x i8> [[TMP8]], <vscale x 1 x ptr> align 1 zeroinitializer, <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
; CHECK-NEXT: [[TMP9:%.*]] = zext i32 [[TMP3]] to i64
; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add nuw i64 [[TMP9]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll b/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
index 48b73c7f1a4deef..03e4b661a941bdc 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
@@ -44,15 +44,15 @@ define void @type_info_cache_clobber(ptr %dstv, ptr %src, i64 %wide.trip.count)
; CHECK-NEXT: [[TMP13:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP12]]
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[TMP13]], i32 0
; CHECK-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr align 1 [[TMP14]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META0:![0-9]+]]
-; CHECK-NEXT: [[TMP15:%.*]] = call <vscale x 8 x i32> @llvm.vp.zext.nxv8i32.nxv8i8(<vscale x 8 x i8> [[VP_OP_LOAD]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP:%.*]] = call <vscale x 8 x i32> @llvm.vp.mul.nxv8i32(<vscale x 8 x i32> [[TMP15]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP2:%.*]] = call <vscale x 8 x i32> @llvm.vp.ashr.nxv8i32(<vscale x 8 x i32> [[TMP15]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP3:%.*]] = call <vscale x 8 x i32> @llvm.vp.or.nxv8i32(<vscale x 8 x i32> [[VP_OP2]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
+; CHECK-NEXT: [[TMP15:%.*]] = zext <vscale x 8 x i8> [[VP_OP_LOAD]] to <vscale x 8 x i32>
+; CHECK-NEXT: [[VP_OP:%.*]] = mul <vscale x 8 x i32> [[TMP15]], zeroinitializer
+; CHECK-NEXT: [[TMP23:%.*]] = ashr <vscale x 8 x i32> [[TMP15]], zeroinitializer
+; CHECK-NEXT: [[VP_OP3:%.*]] = or <vscale x 8 x i32> [[TMP23]], zeroinitializer
; CHECK-NEXT: [[TMP16:%.*]] = icmp ult <vscale x 8 x i32> [[TMP15]], zeroinitializer
; CHECK-NEXT: [[TMP17:%.*]] = call <vscale x 8 x i32> @llvm.vp.select.nxv8i32(<vscale x 8 x i1> [[TMP16]], <vscale x 8 x i32> [[VP_OP3]], <vscale x 8 x i32> zeroinitializer, i32 [[TMP11]])
-; CHECK-NEXT: [[TMP18:%.*]] = call <vscale x 8 x i8> @llvm.vp.trunc.nxv8i8.nxv8i32(<vscale x 8 x i32> [[TMP17]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i8.nxv8p0(<vscale x 8 x i8> [[TMP18]], <vscale x 8 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META3:![0-9]+]], !noalias [[META0]]
-; CHECK-NEXT: [[TMP19:%.*]] = call <vscale x 8 x i16> @llvm.vp.trunc.nxv8i16.nxv8i32(<vscale x 8 x i32> [[VP_OP]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
+; CHECK-NEXT: [[TMP24:%.*]] = trunc <vscale x 8 x i32> [[TMP17]] to <vscale x 8 x i8>
+; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i8.nxv8p0(<vscale x 8 x i8> [[TMP24]], <vscale x 8 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META3:![0-9]+]], !noalias [[META0]]
+; CHECK-NEXT: [[TMP19:%.*]] = trunc <vscale x 8 x i32> [[VP_OP]] to <vscale x 8 x i16>
; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i16.nxv8p0(<vscale x 8 x i16> [[TMP19]], <vscale x 8 x ptr> align 2 zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
; CHECK-NEXT: [[TMP20:%.*]] = zext i32 [[TMP11]] to i64
; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[TMP20]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
index df9ca218aad7031..2c111ff674eaea8 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
@@ -42,7 +42,7 @@ define void @test_and(ptr nocapture %a, ptr nocapture readonly %b) {
; IF-EVL-NEXT: [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP12]]
; IF-EVL-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[TMP13]], i32 0
; IF-EV...
[truncated]
|
@llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesThis patch proposes to avoid converting widening recipes to VP intrinsics during the EVL transform. IIUC we initially did this to avoid Emitting regular IR instead of VP intrinsics allows more generic optimisations, both in the middle end and DAGCombiner, and we generally have better patterns in the RISC-V backend for non-VP nodes. Sticking to regular IR instructions is likely a lot less work than reimplementing all of these optimisations for VP intrinsics. On SPEC CPU 2017 we get noticeably better code generation:
- vzext.vf2 v16, v14
- vzext.vf2 v14, v15
- vwsubu.vv v15, v16, v14
- vsetvli zero, zero, e32, m1, ta, ma
- vmul.vv v14, v15, v15
+ vwsubu.vv v16, v14, v15
+ vsetvli zero, zero, e16, mf2, ta, ma
+ vwmul.vv v14, v16, v16
@@ -6896,24 +6825,19 @@
sub s5, t4, t6
add s1, a2, t6
vsetvli s5, s5, e32, m2, ta, ma
- vle8.v v8, (s1)
+ vle8.v v10, (s1)
add s1, t5, t6
sub s0, s0, t1
- vmv.v.i v12, 0
- vzext.vf4 v14, v8
- vmadd.vx v14, s2, v10
- vsra.vx v8, v14, s3
- vadd.vx v14, v8, s4
- vmsgt.vi v0, v14, 0
- vmsltu.vx v8, v14, t0
- vmerge.vim v12, v12, -1, v0
- vmv1r.v v0, v8
- vmerge.vvm v8, v12, v14, v0
- vsetvli zero, zero, e16, m1, ta, ma
- vnsrl.wi v12, v8, 0
+ vzext.vf4 v12, v10
+ vmadd.vx v12, s2, v8
+ vsra.vx v10, v12, s3
+ vadd.vx v10, v10, s4
+ vmax.vx v10, v10, zero
+ vsetvli zero, zero, e16, m1, ta, ma
+ vnclipu.wi v12, v10, 0
vluxei64.v v10, (a6), v12
sub a5, a5, a4
vmadd.vv v14, v10, v9
- vdiv.vx v10, v14, s1
+ vmulh.vx v10, v14, s0
+ vsra.vi v10, v10, 12
+ vsrl.vi v11, v10, 31
+ vadd.vv v10, v10, v11
vand.vx v9, v10, s4
vmsne.vi v0, v9, 0
- vmv.v.i v9, 0
- vmerge.vim v9, v9, 1, v0
- vsetvli zero, zero, e32, m1, tu, ma
- vadd.vv v8, v8, v9
+ sub a4, a4, a3
+ vsetvli zero, zero, e32, m1, tu, mu
+ vadd.vi v8, v8, 1, v0.t
- vsetvli s1, a4, e32, m1, ta, ma
+ vsetvli a4, a4, e32, m1, ta, ma
vle64.v v10, (a5)
sub a2, a2, s8
vluxei64.v v9, (zero), v10
vsetvli a5, zero, e64, m2, ta, ma
- vmv.v.x v10, s1
+ vmv.v.x v10, a4
vmsleu.vv v12, v10, v14
- vmsltu.vx v10, v14, s1
+ vmsltu.vx v10, v14, a4
vmand.mm v11, v8, v12
- vsetvli zero, a4, e32, m1, ta, ma
+ vsetvli zero, zero, e32, m1, ta, ma
vand.vx v9, v9, s6
- vsetvli a4, zero, e32, m1, ta, ma I've removed the VPWidenEVLRecipe in this patch since it's no longer used, but if people would prefer to keep it for other targets (PPC?) then I'd be happy to add it back in and gate it under a target hook. Patch is 107.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126177.diff 23 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index fac207287e0bcc7..064e916d6d81718 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -523,7 +523,6 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
case VPRecipeBase::VPWidenGEPSC:
case VPRecipeBase::VPWidenIntrinsicSC:
case VPRecipeBase::VPWidenSC:
- case VPRecipeBase::VPWidenEVLSC:
case VPRecipeBase::VPWidenSelectSC:
case VPRecipeBase::VPBlendSC:
case VPRecipeBase::VPPredInstPHISC:
@@ -710,7 +709,6 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe {
static inline bool classof(const VPRecipeBase *R) {
return R->getVPDefID() == VPRecipeBase::VPInstructionSC ||
R->getVPDefID() == VPRecipeBase::VPWidenSC ||
- R->getVPDefID() == VPRecipeBase::VPWidenEVLSC ||
R->getVPDefID() == VPRecipeBase::VPWidenGEPSC ||
R->getVPDefID() == VPRecipeBase::VPWidenCastSC ||
R->getVPDefID() == VPRecipeBase::VPWidenIntrinsicSC ||
@@ -1116,8 +1114,7 @@ class VPWidenRecipe : public VPRecipeWithIRFlags {
}
static inline bool classof(const VPRecipeBase *R) {
- return R->getVPDefID() == VPRecipeBase::VPWidenSC ||
- R->getVPDefID() == VPRecipeBase::VPWidenEVLSC;
+ return R->getVPDefID() == VPRecipeBase::VPWidenSC;
}
static inline bool classof(const VPUser *U) {
@@ -1142,54 +1139,6 @@ class VPWidenRecipe : public VPRecipeWithIRFlags {
#endif
};
-/// A recipe for widening operations with vector-predication intrinsics with
-/// explicit vector length (EVL).
-class VPWidenEVLRecipe : public VPWidenRecipe {
- using VPRecipeWithIRFlags::transferFlags;
-
-public:
- template <typename IterT>
- VPWidenEVLRecipe(Instruction &I, iterator_range<IterT> Operands, VPValue &EVL)
- : VPWidenRecipe(VPDef::VPWidenEVLSC, I, Operands) {
- addOperand(&EVL);
- }
- VPWidenEVLRecipe(VPWidenRecipe &W, VPValue &EVL)
- : VPWidenEVLRecipe(*W.getUnderlyingInstr(), W.operands(), EVL) {
- transferFlags(W);
- }
-
- ~VPWidenEVLRecipe() override = default;
-
- VPWidenRecipe *clone() override final {
- llvm_unreachable("VPWidenEVLRecipe cannot be cloned");
- return nullptr;
- }
-
- VP_CLASSOF_IMPL(VPDef::VPWidenEVLSC);
-
- VPValue *getEVL() { return getOperand(getNumOperands() - 1); }
- const VPValue *getEVL() const { return getOperand(getNumOperands() - 1); }
-
- /// Produce a vp-intrinsic using the opcode and operands of the recipe,
- /// processing EVL elements.
- void execute(VPTransformState &State) override final;
-
- /// Returns true if the recipe only uses the first lane of operand \p Op.
- bool onlyFirstLaneUsed(const VPValue *Op) const override {
- assert(is_contained(operands(), Op) &&
- "Op must be an operand of the recipe");
- // EVL in that recipe is always the last operand, thus any use before means
- // the VPValue should be vectorized.
- return getEVL() == Op;
- }
-
-#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
- /// Print the recipe.
- void print(raw_ostream &O, const Twine &Indent,
- VPSlotTracker &SlotTracker) const override final;
-#endif
-};
-
/// VPWidenCastRecipe is a recipe to create vector cast instructions.
class VPWidenCastRecipe : public VPRecipeWithIRFlags {
/// Cast instruction opcode.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 71fb6d42116cfe4..dcd8eeecdc15b06 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -245,9 +245,8 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
VPPartialReductionRecipe>([this](const VPRecipeBase *R) {
return inferScalarType(R->getOperand(0));
})
- .Case<VPBlendRecipe, VPInstruction, VPWidenRecipe, VPWidenEVLRecipe,
- VPReplicateRecipe, VPWidenCallRecipe, VPWidenMemoryRecipe,
- VPWidenSelectRecipe>(
+ .Case<VPBlendRecipe, VPInstruction, VPWidenRecipe, VPReplicateRecipe,
+ VPWidenCallRecipe, VPWidenMemoryRecipe, VPWidenSelectRecipe>(
[this](const auto *R) { return inferScalarTypeForRecipe(R); })
.Case<VPWidenIntrinsicRecipe>([](const VPWidenIntrinsicRecipe *R) {
return R->getResultType();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index c84a93d7398f73b..83c764b87953d9d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -85,7 +85,6 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPWidenLoadSC:
case VPWidenPHISC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -131,7 +130,6 @@ bool VPRecipeBase::mayReadFromMemory() const {
case VPWidenIntOrFpInductionSC:
case VPWidenPHISC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -172,7 +170,6 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPWidenPHISC:
case VPWidenPointerInductionSC:
case VPWidenSC:
- case VPWidenEVLSC:
case VPWidenSelectSC: {
const Instruction *I =
dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
@@ -1550,42 +1547,6 @@ InstructionCost VPWidenRecipe::computeCost(ElementCount VF,
}
}
-void VPWidenEVLRecipe::execute(VPTransformState &State) {
- unsigned Opcode = getOpcode();
- // TODO: Support other opcodes
- if (!Instruction::isBinaryOp(Opcode) && !Instruction::isUnaryOp(Opcode))
- llvm_unreachable("Unsupported opcode in VPWidenEVLRecipe::execute");
-
- State.setDebugLocFrom(getDebugLoc());
-
- assert(State.get(getOperand(0))->getType()->isVectorTy() &&
- "VPWidenEVLRecipe should not be used for scalars");
-
- VPValue *EVL = getEVL();
- Value *EVLArg = State.get(EVL, /*NeedsScalar=*/true);
- IRBuilderBase &BuilderIR = State.Builder;
- VectorBuilder Builder(BuilderIR);
- Value *Mask = BuilderIR.CreateVectorSplat(State.VF, BuilderIR.getTrue());
-
- SmallVector<Value *, 4> Ops;
- for (unsigned I = 0, E = getNumOperands() - 1; I < E; ++I) {
- VPValue *VPOp = getOperand(I);
- Ops.push_back(State.get(VPOp));
- }
-
- Builder.setMask(Mask).setEVL(EVLArg);
- Value *VPInst =
- Builder.createVectorInstruction(Opcode, Ops[0]->getType(), Ops, "vp.op");
- // Currently vp-intrinsics only accept FMF flags.
- // TODO: Enable other flags when support is added.
- if (isa<FPMathOperator>(VPInst))
- setFlags(cast<Instruction>(VPInst));
-
- State.set(this, VPInst);
- State.addMetadata(VPInst,
- dyn_cast_or_null<Instruction>(getUnderlyingValue()));
-}
-
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
void VPWidenRecipe::print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const {
@@ -1595,15 +1556,6 @@ void VPWidenRecipe::print(raw_ostream &O, const Twine &Indent,
printFlags(O);
printOperands(O, SlotTracker);
}
-
-void VPWidenEVLRecipe::print(raw_ostream &O, const Twine &Indent,
- VPSlotTracker &SlotTracker) const {
- O << Indent << "WIDEN ";
- printAsOperand(O, SlotTracker);
- O << " = vp." << Instruction::getOpcodeName(getOpcode());
- printFlags(O);
- printOperands(O, SlotTracker);
-}
#endif
void VPWidenCastRecipe::execute(VPTransformState &State) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 7e9ef46133936d7..18323a078f1ee0c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1665,40 +1665,10 @@ static VPRecipeBase *createEVLRecipe(VPValue *HeaderMask,
VPValue *NewMask = GetNewMask(S->getMask());
return new VPWidenStoreEVLRecipe(*S, EVL, NewMask);
})
- .Case<VPWidenRecipe>([&](VPWidenRecipe *W) -> VPRecipeBase * {
- unsigned Opcode = W->getOpcode();
- if (!Instruction::isBinaryOp(Opcode) && !Instruction::isUnaryOp(Opcode))
- return nullptr;
- return new VPWidenEVLRecipe(*W, EVL);
- })
.Case<VPReductionRecipe>([&](VPReductionRecipe *Red) {
VPValue *NewMask = GetNewMask(Red->getCondOp());
return new VPReductionEVLRecipe(*Red, EVL, NewMask);
})
- .Case<VPWidenIntrinsicRecipe, VPWidenCastRecipe>(
- [&](auto *CR) -> VPRecipeBase * {
- Intrinsic::ID VPID;
- if (auto *CallR = dyn_cast<VPWidenIntrinsicRecipe>(CR)) {
- VPID =
- VPIntrinsic::getForIntrinsic(CallR->getVectorIntrinsicID());
- } else {
- auto *CastR = cast<VPWidenCastRecipe>(CR);
- VPID = VPIntrinsic::getForOpcode(CastR->getOpcode());
- }
-
- // Not all intrinsics have a corresponding VP intrinsic.
- if (VPID == Intrinsic::not_intrinsic)
- return nullptr;
- assert(VPIntrinsic::getMaskParamPos(VPID) &&
- VPIntrinsic::getVectorLengthParamPos(VPID) &&
- "Expected VP intrinsic to have mask and EVL");
-
- SmallVector<VPValue *> Ops(CR->operands());
- Ops.push_back(&AllOneMask);
- Ops.push_back(&EVL);
- return new VPWidenIntrinsicRecipe(
- VPID, Ops, TypeInfo.inferScalarType(CR), CR->getDebugLoc());
- })
.Case<VPWidenSelectRecipe>([&](VPWidenSelectRecipe *Sel) {
SmallVector<VPValue *> Ops(Sel->operands());
Ops.push_back(&EVL);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h
index aabc4ab571e7a91..a058b2a121d59f3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanValue.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h
@@ -351,7 +351,6 @@ class VPDef {
VPWidenStoreEVLSC,
VPWidenStoreSC,
VPWidenSC,
- VPWidenEVLSC,
VPWidenSelectSC,
VPBlendSC,
VPHistogramSC,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 96156de444f8813..a4b309d6dcd9fe3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -145,10 +145,6 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
[&](const VPRecipeBase *S) { return VerifyEVLUse(*S, 2); })
.Case<VPWidenLoadEVLRecipe, VPReverseVectorPointerRecipe>(
[&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); })
- .Case<VPWidenEVLRecipe>([&](const VPWidenEVLRecipe *W) {
- return VerifyEVLUse(*W,
- Instruction::isUnaryOp(W->getOpcode()) ? 1 : 2);
- })
.Case<VPScalarCastRecipe>(
[&](const VPScalarCastRecipe *S) { return VerifyEVLUse(*S, 0); })
.Case<VPInstruction>([&](const VPInstruction *I) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
index 14818199072c285..b96a44a546a141a 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll
@@ -143,8 +143,8 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-OUTLOOP-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[TMP6]]
; IF-EVL-OUTLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[TMP7]], i32 0
; IF-EVL-OUTLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i16> @llvm.vp.load.nxv4i16.p0(ptr align 2 [[TMP8]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
-; IF-EVL-OUTLOOP-NEXT: [[TMP9:%.*]] = call <vscale x 4 x i32> @llvm.vp.sext.nxv4i32.nxv4i16(<vscale x 4 x i16> [[VP_OP_LOAD]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
-; IF-EVL-OUTLOOP-NEXT: [[VP_OP:%.*]] = call <vscale x 4 x i32> @llvm.vp.add.nxv4i32(<vscale x 4 x i32> [[VEC_PHI]], <vscale x 4 x i32> [[TMP9]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP5]])
+; IF-EVL-OUTLOOP-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i16> [[VP_OP_LOAD]] to <vscale x 4 x i32>
+; IF-EVL-OUTLOOP-NEXT: [[VP_OP:%.*]] = add <vscale x 4 x i32> [[VEC_PHI]], [[TMP9]]
; IF-EVL-OUTLOOP-NEXT: [[TMP10]] = call <vscale x 4 x i32> @llvm.vp.merge.nxv4i32(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i32> [[VP_OP]], <vscale x 4 x i32> [[VEC_PHI]], i32 [[TMP5]])
; IF-EVL-OUTLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP5]], [[EVL_BASED_IV]]
; IF-EVL-OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
@@ -200,7 +200,7 @@ define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
; IF-EVL-INLOOP-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[TMP7]]
; IF-EVL-INLOOP-NEXT: [[TMP9:%.*]] = getelementptr inbounds i16, ptr [[TMP8]], i32 0
; IF-EVL-INLOOP-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 8 x i16> @llvm.vp.load.nxv8i16.p0(ptr align 2 [[TMP9]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
-; IF-EVL-INLOOP-NEXT: [[TMP14:%.*]] = call <vscale x 8 x i32> @llvm.vp.sext.nxv8i32.nxv8i16(<vscale x 8 x i16> [[VP_OP_LOAD]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
+; IF-EVL-INLOOP-NEXT: [[TMP14:%.*]] = sext <vscale x 8 x i16> [[VP_OP_LOAD]] to <vscale x 8 x i32>
; IF-EVL-INLOOP-NEXT: [[TMP10:%.*]] = call i32 @llvm.vp.reduce.add.nxv8i32(i32 0, <vscale x 8 x i32> [[TMP14]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP6]])
; IF-EVL-INLOOP-NEXT: [[TMP11]] = add i32 [[TMP10]], [[VEC_PHI]]
; IF-EVL-INLOOP-NEXT: [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP6]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll b/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
index 68b36f23de4b053..ba7158eb02d904a 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-evl-crash.ll
@@ -27,10 +27,10 @@ define void @truncate_to_minimal_bitwidths_widen_cast_recipe(ptr %src) {
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP4]]
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[TMP5]], i32 0
; CHECK-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 1 x i8> @llvm.vp.load.nxv1i8.p0(ptr align 1 [[TMP6]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 1 x i16> @llvm.vp.zext.nxv1i16.nxv1i8(<vscale x 1 x i8> [[VP_OP_LOAD]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[VP_OP:%.*]] = call <vscale x 1 x i16> @llvm.vp.mul.nxv1i16(<vscale x 1 x i16> zeroinitializer, <vscale x 1 x i16> [[TMP7]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[VP_OP1:%.*]] = call <vscale x 1 x i16> @llvm.vp.lshr.nxv1i16(<vscale x 1 x i16> [[VP_OP]], <vscale x 1 x i16> trunc (<vscale x 1 x i32> splat (i32 1) to <vscale x 1 x i16>), <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
-; CHECK-NEXT: [[TMP8:%.*]] = call <vscale x 1 x i8> @llvm.vp.trunc.nxv1i8.nxv1i16(<vscale x 1 x i16> [[VP_OP1]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP7:%.*]] = zext <vscale x 1 x i8> [[VP_OP_LOAD]] to <vscale x 1 x i16>
+; CHECK-NEXT: [[TMP12:%.*]] = mul <vscale x 1 x i16> zeroinitializer, [[TMP7]]
+; CHECK-NEXT: [[VP_OP1:%.*]] = lshr <vscale x 1 x i16> [[TMP12]], trunc (<vscale x 1 x i32> splat (i32 1) to <vscale x 1 x i16>)
+; CHECK-NEXT: [[TMP8:%.*]] = trunc <vscale x 1 x i16> [[VP_OP1]] to <vscale x 1 x i8>
; CHECK-NEXT: call void @llvm.vp.scatter.nxv1i8.nxv1p0(<vscale x 1 x i8> [[TMP8]], <vscale x 1 x ptr> align 1 zeroinitializer, <vscale x 1 x i1> splat (i1 true), i32 [[TMP3]])
; CHECK-NEXT: [[TMP9:%.*]] = zext i32 [[TMP3]] to i64
; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add nuw i64 [[TMP9]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll b/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
index 48b73c7f1a4deef..03e4b661a941bdc 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll
@@ -44,15 +44,15 @@ define void @type_info_cache_clobber(ptr %dstv, ptr %src, i64 %wide.trip.count)
; CHECK-NEXT: [[TMP13:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP12]]
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[TMP13]], i32 0
; CHECK-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr align 1 [[TMP14]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META0:![0-9]+]]
-; CHECK-NEXT: [[TMP15:%.*]] = call <vscale x 8 x i32> @llvm.vp.zext.nxv8i32.nxv8i8(<vscale x 8 x i8> [[VP_OP_LOAD]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP:%.*]] = call <vscale x 8 x i32> @llvm.vp.mul.nxv8i32(<vscale x 8 x i32> [[TMP15]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP2:%.*]] = call <vscale x 8 x i32> @llvm.vp.ashr.nxv8i32(<vscale x 8 x i32> [[TMP15]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: [[VP_OP3:%.*]] = call <vscale x 8 x i32> @llvm.vp.or.nxv8i32(<vscale x 8 x i32> [[VP_OP2]], <vscale x 8 x i32> zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
+; CHECK-NEXT: [[TMP15:%.*]] = zext <vscale x 8 x i8> [[VP_OP_LOAD]] to <vscale x 8 x i32>
+; CHECK-NEXT: [[VP_OP:%.*]] = mul <vscale x 8 x i32> [[TMP15]], zeroinitializer
+; CHECK-NEXT: [[TMP23:%.*]] = ashr <vscale x 8 x i32> [[TMP15]], zeroinitializer
+; CHECK-NEXT: [[VP_OP3:%.*]] = or <vscale x 8 x i32> [[TMP23]], zeroinitializer
; CHECK-NEXT: [[TMP16:%.*]] = icmp ult <vscale x 8 x i32> [[TMP15]], zeroinitializer
; CHECK-NEXT: [[TMP17:%.*]] = call <vscale x 8 x i32> @llvm.vp.select.nxv8i32(<vscale x 8 x i1> [[TMP16]], <vscale x 8 x i32> [[VP_OP3]], <vscale x 8 x i32> zeroinitializer, i32 [[TMP11]])
-; CHECK-NEXT: [[TMP18:%.*]] = call <vscale x 8 x i8> @llvm.vp.trunc.nxv8i8.nxv8i32(<vscale x 8 x i32> [[TMP17]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
-; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i8.nxv8p0(<vscale x 8 x i8> [[TMP18]], <vscale x 8 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META3:![0-9]+]], !noalias [[META0]]
-; CHECK-NEXT: [[TMP19:%.*]] = call <vscale x 8 x i16> @llvm.vp.trunc.nxv8i16.nxv8i32(<vscale x 8 x i32> [[VP_OP]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
+; CHECK-NEXT: [[TMP24:%.*]] = trunc <vscale x 8 x i32> [[TMP17]] to <vscale x 8 x i8>
+; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i8.nxv8p0(<vscale x 8 x i8> [[TMP24]], <vscale x 8 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]]), !alias.scope [[META3:![0-9]+]], !noalias [[META0]]
+; CHECK-NEXT: [[TMP19:%.*]] = trunc <vscale x 8 x i32> [[VP_OP]] to <vscale x 8 x i16>
; CHECK-NEXT: call void @llvm.vp.scatter.nxv8i16.nxv8p0(<vscale x 8 x i16> [[TMP19]], <vscale x 8 x ptr> align 2 zeroinitializer, <vscale x 8 x i1> splat (i1 true), i32 [[TMP11]])
; CHECK-NEXT: [[TMP20:%.*]] = zext i32 [[TMP11]] to i64
; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[TMP20]], [[EVL_BASED_IV]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
index df9ca218aad7031..2c111ff674eaea8 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-bin-unary-ops-args.ll
@@ -42,7 +42,7 @@ define void @test_and(ptr nocapture %a, ptr nocapture readonly %b) {
; IF-EVL-NEXT: [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP12]]
; IF-EVL-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[TMP13]], i32 0
; IF-EV...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is the right approach, but it really improves the codegen quality now. We still lack the abilities to handle vp intrinsics in instcombine and DAGCombiner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good start. :)
I have an question—could you check whether div is handled by VPWidenEVLRecipe? If so, we may need to handle div separately to avoid division-by-zero issues, rather than simply discarding VPWidenEVLRecipe.
I had the same question too, I checked beforehand and it looks like LoopVectorizationLegality is used when the VPlan is constructed to mask off any lanes that would trap e.g.: void f(int *x, int *y, int n) {
for (int i = 0; i < n; i++)
x[i] /= y[i];
}
The .LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
sub t0, a2, a3
sh2add a6, a3, a1
sh2add a4, a3, a0
vsetvli t0, t0, e8, mf2, ta, ma
vmv2r.v v10, v8
vle32.v v12, (a4)
vsetvli zero, zero, e32, m2, tu, ma
vle32.v v10, (a6)
sub a5, a5, a7
vsetvli zero, zero, e32, m2, ta, ma
vdiv.vv v10, v12, v10
vse32.v v10, (a4)
add a3, a3, t0
bnez a5, .LBB0_5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very excited to see this! I think this was also proposed originally when EVL support was added, but at that time apparently there were benefits for having EVL everywhere. Glad to hear that has changed!
What a surprise! I didn't realize that converting select to vp.merge had this use case. |
I could easily miss some decision of the EVL vectorizer as I rarely check what's going on there. So if you and @alexey-bataev @Mel-Chen @arcbbb @ppenzin @preames agreed to come back to proposal to convert instructions to vp-intrinsics after vectorizer that change seems ok. |
The patch relies on VLOptimizer, which (is supposed, at least) should optimize VL for the instructions |
@npanchen As Alexey said the RISCVVLOptimizer pass does exactly that, see #108640. It's more general than the EVL on VP intrinsics as it's able to reason about demanded elements, and with this patch it looks like it's actually able to reduce VL in a few more places due to there being less constraints.
Ideally yes, but I think this is a large amount of work that's unlikely to be finished any time soon. The goal here is to enable EVL tail folding within a reasonable time frame, and this addresses the codegen issues due to the lack of VP support in InstCombine etc. |
I messed up and opened this PR from the main branch of my fork, and force pushing to it seems to have automatically closed this PR with no option to reopen it, so I had to open up a new PR here: #127180 Sorry for the noise! |
I see. So current approach is to generate explicit EVL only for memory operations and reductions ? |
…#127180) This is a copy of #126177, since it was automatically and permanently closed because I messed up the source branch on my remote This patch proposes to avoid converting widening recipes to VP intrinsics during the EVL transform. IIUC we initially did this to avoid `vl` toggles on RISC-V. However we now have the RISCVVLOptimizer pass which mostly makes this redundant. Emitting regular IR instead of VP intrinsics allows more generic optimisations, both in the middle end and DAGCombiner, and we generally have better patterns in the RISC-V backend for non-VP nodes. Sticking to regular IR instructions is likely a lot less work than reimplementing all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get noticeably better code generation.
I'm sorry for the delayed response. Initially, I only considered converting recipes with a mask operand, such as memory accesses and reductions, as you mentioned. This is because these recipes might directly use the header mask as an operand. However, with @arcbbb introducing Fixed-order recurrence is a special case. Due to the non-VF×UF penultimate EVL issue, llvm.splice must be converted to vp.splice. In summary, the strategy I currently recommend is to check whether a recipe has a mask operand or relies on VF to determine if it needs to be transformed into VP intrinsics. |
Don't you need to generate vp-load/vp-store for unmasked load/store too? What if EVL crosses the cache line that may result to segfault ?
That's interesting. I assume the main reason for having EVL there is nowrap flags on a GEP. Having "wrap" flag will impact subsequent optimizations, right ? |
All loads and stores are still converted to VP intrinsics after this patch, since widening recipes are only used when for non-trapping instructions, or when the trapping parts have been selected-off e.g. div |
This patch proposes to avoid converting widening recipes to VP intrinsics during the EVL transform.
IIUC we initially did this to avoid
vl
toggles on RISC-V. However we now have the RISCVVLOptimizer pass which mostly makes this redundant.Emitting regular IR instead of VP intrinsics allows more generic optimisations, both in the middle end and DAGCombiner, and we generally have better patterns in the RISC-V backend for non-VP nodes. Sticking to regular IR instructions is likely a lot less work than reimplementing all of these optimisations for VP intrinsics.
On SPEC CPU 2017 we get noticeably better code generation:
vl
which allows RISCVLOptimizer to remove somevsetvli
sI've removed the VPWidenEVLRecipe in this patch since it's no longer used, but if people would prefer to keep it for other targets (PPC?) then I'd be happy to add it back in and gate it under a target hook.