-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[VPlan] Impl VPlan-based pattern match for ExtendedRed and MulAccRed #113903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Elvis Wang (ElvisWang123) ChangesThis patch implement the VPlan-based pattern match for extendedReduction and MulAccReduction. In above reduction patterns, extened instructions and mul instruction can fold into reduction instruction and the cost is free. We add ExtendedReductionPatterns: Ref: Original instruction based implementation: This patch is based on #113902 . Patch is 21.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113903.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 60a94ca1f86e42..483e039fe133d6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7303,51 +7303,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
Cost += ReductionCost;
continue;
}
-
- const auto &ChainOps = RdxDesc.getReductionOpChain(RedPhi, OrigLoop);
- SetVector<Instruction *> ChainOpsAndOperands(ChainOps.begin(),
- ChainOps.end());
- auto IsZExtOrSExt = [](const unsigned Opcode) -> bool {
- return Opcode == Instruction::ZExt || Opcode == Instruction::SExt;
- };
- // Also include the operands of instructions in the chain, as the cost-model
- // may mark extends as free.
- //
- // For ARM, some of the instruction can folded into the reducion
- // instruction. So we need to mark all folded instructions free.
- // For example: We can fold reduce(mul(ext(A), ext(B))) into one
- // instruction.
- for (auto *ChainOp : ChainOps) {
- for (Value *Op : ChainOp->operands()) {
- if (auto *I = dyn_cast<Instruction>(Op)) {
- ChainOpsAndOperands.insert(I);
- if (I->getOpcode() == Instruction::Mul) {
- auto *Ext0 = dyn_cast<Instruction>(I->getOperand(0));
- auto *Ext1 = dyn_cast<Instruction>(I->getOperand(1));
- if (Ext0 && IsZExtOrSExt(Ext0->getOpcode()) && Ext1 &&
- Ext0->getOpcode() == Ext1->getOpcode()) {
- ChainOpsAndOperands.insert(Ext0);
- ChainOpsAndOperands.insert(Ext1);
- }
- }
- }
- }
- }
-
- // Pre-compute the cost for I, if it has a reduction pattern cost.
- for (Instruction *I : ChainOpsAndOperands) {
- auto ReductionCost = CM.getReductionPatternCost(
- I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);
- if (!ReductionCost)
- continue;
-
- assert(!CostCtx.SkipCostComputation.contains(I) &&
- "reduction op visited multiple times");
- CostCtx.SkipCostComputation.insert(I);
- LLVM_DEBUG(dbgs() << "Cost of " << ReductionCost << " for VF " << VF
- << ":\n in-loop reduction " << *I << "\n");
- Cost += *ReductionCost;
- }
}
// Pre-compute the costs for branches except for the backedge, as the number
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ab8fb45c351b4..49e93e1e7b5501 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -785,7 +785,7 @@ void VPRegionBlock::execute(VPTransformState *State) {
InstructionCost VPBasicBlock::cost(ElementCount VF, VPCostContext &Ctx) {
InstructionCost Cost = 0;
- for (VPRecipeBase &R : Recipes)
+ for (VPRecipeBase &R : reverse(Recipes))
Cost += R.cost(VF, Ctx);
return Cost;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 6a192bdf01c4ff..b26fd460a278f5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -725,6 +725,8 @@ struct VPCostContext {
LLVMContext &LLVMCtx;
LoopVectorizationCostModel &CM;
SmallPtrSet<Instruction *, 8> SkipCostComputation;
+ /// Contains recipes that are folded into other recipes.
+ SmallDenseMap<ElementCount, SmallPtrSet<VPRecipeBase *, 4>, 4> FoldedRecipes;
VPCostContext(const TargetTransformInfo &TTI, const TargetLibraryInfo &TLI,
Type *CanIVTy, LoopVectorizationCostModel &CM)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0eb4f7c7c88cee..5f59a1e96df9f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -299,7 +299,9 @@ InstructionCost VPRecipeBase::cost(ElementCount VF, VPCostContext &Ctx) {
UI = &WidenMem->getIngredient();
InstructionCost RecipeCost;
- if (UI && Ctx.skipCostComputation(UI, VF.isVector())) {
+ if ((UI && Ctx.skipCostComputation(UI, VF.isVector())) ||
+ (Ctx.FoldedRecipes.contains(VF) &&
+ Ctx.FoldedRecipes.at(VF).contains(this))) {
RecipeCost = 0;
} else {
RecipeCost = computeCost(VF, Ctx);
@@ -2188,30 +2190,143 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
unsigned Opcode = RdxDesc.getOpcode();
- // TODO: Support any-of and in-loop reductions.
+ // TODO: Support any-of reductions.
assert(
(!RecurrenceDescriptor::isAnyOfRecurrenceKind(RdxKind) ||
ForceTargetInstructionCost.getNumOccurrences() > 0) &&
"Any-of reduction not implemented in VPlan-based cost model currently.");
- assert(
- (!cast<VPReductionPHIRecipe>(getOperand(0))->isInLoop() ||
- ForceTargetInstructionCost.getNumOccurrences() > 0) &&
- "In-loop reduction not implemented in VPlan-based cost model currently.");
assert(ElementTy->getTypeID() == RdxDesc.getRecurrenceType()->getTypeID() &&
"Inferred type and recurrence type mismatch.");
- // Cost = Reduction cost + BinOp cost
- InstructionCost Cost =
+ // BaseCost = Reduction cost + BinOp cost
+ InstructionCost BaseCost =
Ctx.TTI.getArithmeticInstrCost(Opcode, ElementTy, CostKind);
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RdxKind)) {
Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
- return Cost + Ctx.TTI.getMinMaxReductionCost(
- Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ BaseCost += Ctx.TTI.getMinMaxReductionCost(
+ Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ } else {
+ BaseCost += Ctx.TTI.getArithmeticReductionCost(
+ Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
}
- return Cost + Ctx.TTI.getArithmeticReductionCost(
- Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ using namespace llvm::VPlanPatternMatch;
+ auto GetMulAccReductionCost =
+ [&](const VPReductionRecipe *Red) -> InstructionCost {
+ VPValue *A, *B;
+ InstructionCost InnerExt0Cost = 0;
+ InstructionCost InnerExt1Cost = 0;
+ InstructionCost ExtCost = 0;
+ InstructionCost MulCost = 0;
+
+ VectorType *SrcVecTy = VectorTy;
+ Type *InnerExt0Ty;
+ Type *InnerExt1Ty;
+ Type *MaxInnerExtTy;
+ bool IsUnsigned = true;
+ bool HasOuterExt = false;
+
+ auto *Ext = dyn_cast_if_present<VPWidenCastRecipe>(
+ Red->getVecOp()->getDefiningRecipe());
+ VPRecipeBase *Mul;
+ // Try to match outer extend reduce.add(ext(...))
+ if (Ext && match(Ext, m_ZExtOrSExt(m_VPValue())) &&
+ cast<VPWidenCastRecipe>(Ext)->getNumUsers() == 1) {
+ IsUnsigned =
+ Ext->getOpcode() == Instruction::CastOps::ZExt ? true : false;
+ ExtCost = Ext->computeCost(VF, Ctx);
+ Mul = Ext->getOperand(0)->getDefiningRecipe();
+ HasOuterExt = true;
+ } else {
+ Mul = Red->getVecOp()->getDefiningRecipe();
+ }
+
+ // Match reduce.add(mul())
+ if (Mul && match(Mul, m_Mul(m_VPValue(A), m_VPValue(B))) &&
+ cast<VPWidenRecipe>(Mul)->getNumUsers() == 1) {
+ MulCost = cast<VPWidenRecipe>(Mul)->computeCost(VF, Ctx);
+ auto *InnerExt0 =
+ dyn_cast_if_present<VPWidenCastRecipe>(A->getDefiningRecipe());
+ auto *InnerExt1 =
+ dyn_cast_if_present<VPWidenCastRecipe>(B->getDefiningRecipe());
+ bool HasInnerExt = false;
+ // Try to match inner extends.
+ if (InnerExt0 && InnerExt1 &&
+ match(InnerExt0, m_ZExtOrSExt(m_VPValue())) &&
+ match(InnerExt1, m_ZExtOrSExt(m_VPValue())) &&
+ InnerExt0->getOpcode() == InnerExt1->getOpcode() &&
+ (InnerExt0->getNumUsers() > 0 &&
+ !InnerExt0->hasMoreThanOneUniqueUser()) &&
+ (InnerExt1->getNumUsers() > 0 &&
+ !InnerExt1->hasMoreThanOneUniqueUser())) {
+ InnerExt0Cost = InnerExt0->computeCost(VF, Ctx);
+ InnerExt1Cost = InnerExt1->computeCost(VF, Ctx);
+ Type *InnerExt0Ty = Ctx.Types.inferScalarType(InnerExt0->getOperand(0));
+ Type *InnerExt1Ty = Ctx.Types.inferScalarType(InnerExt1->getOperand(0));
+ Type *MaxInnerExtTy = InnerExt0Ty->getIntegerBitWidth() >
+ InnerExt1Ty->getIntegerBitWidth()
+ ? InnerExt0Ty
+ : InnerExt1Ty;
+ SrcVecTy = cast<VectorType>(ToVectorTy(MaxInnerExtTy, VF));
+ IsUnsigned = true;
+ HasInnerExt = true;
+ }
+ InstructionCost MulAccRedCost = Ctx.TTI.getMulAccReductionCost(
+ IsUnsigned, ElementTy, SrcVecTy, CostKind);
+ // Check if folding ext/mul into MulAccReduction is profitable.
+ if (MulAccRedCost.isValid() &&
+ MulAccRedCost <
+ ExtCost + MulCost + InnerExt0Cost + InnerExt1Cost + BaseCost) {
+ if (HasInnerExt) {
+ Ctx.FoldedRecipes[VF].insert(InnerExt0);
+ Ctx.FoldedRecipes[VF].insert(InnerExt1);
+ }
+ Ctx.FoldedRecipes[VF].insert(Mul);
+ if (HasOuterExt)
+ Ctx.FoldedRecipes[VF].insert(Ext);
+ return MulAccRedCost;
+ }
+ }
+ return InstructionCost::getInvalid();
+ };
+
+ // Match reduce(ext(...))
+ auto GetExtendedReductionCost =
+ [&](const VPReductionRecipe *Red) -> InstructionCost {
+ VPValue *VecOp = Red->getVecOp();
+ VPValue *A;
+ if (match(VecOp, m_ZExtOrSExt(m_VPValue(A))) && VecOp->getNumUsers() == 1) {
+ VPWidenCastRecipe *Ext =
+ cast<VPWidenCastRecipe>(VecOp->getDefiningRecipe());
+ bool IsUnsigned = Ext->getOpcode() == Instruction::CastOps::ZExt;
+ InstructionCost ExtCost = Ext->computeCost(VF, Ctx);
+ auto *ExtVecTy =
+ cast<VectorType>(ToVectorTy(Ctx.Types.inferScalarType(A), VF));
+ InstructionCost ExtendedRedCost = Ctx.TTI.getExtendedReductionCost(
+ Opcode, IsUnsigned, ElementTy, ExtVecTy, RdxDesc.getFastMathFlags(),
+ CostKind);
+ // Check if folding ext into ExtendedReduction is profitable.
+ if (ExtendedRedCost.isValid() && ExtendedRedCost < ExtCost + BaseCost) {
+ Ctx.FoldedRecipes[VF].insert(Ext);
+ return ExtendedRedCost;
+ }
+ }
+ return InstructionCost::getInvalid();
+ };
+
+ // Match MulAccReduction patterns.
+ InstructionCost MulAccCost = GetMulAccReductionCost(this);
+ if (MulAccCost.isValid())
+ return MulAccCost;
+
+ // Match ExtendedReduction patterns.
+ InstructionCost ExtendedCost = GetExtendedReductionCost(this);
+ if (ExtendedCost.isValid())
+ return ExtendedCost;
+
+ // Default cost.
+ return BaseCost;
}
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..f2e36399c85f5d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -6,26 +6,26 @@ define void @i8_factor_2(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_2'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %...
[truncated]
|
The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model. I'm not sure if that is still the current plan or not. |
Yes I think ideally we would model the add-extend/mul-extend operations explicitly in VPlan, especially if matching it in VPlan would require changes to the order in which the cost is computed. Would add-extend/mul-extend be sufficient or would other recipes be needed as well? |
I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan. Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the |
@davemgreen do you think those would be enough?
This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305 |
Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:
In the case of MVE, both the ext's will be the same. The add can be done by setting one of the operands to 1. #92418 is similar, but produce a vector instead of a single output value (a dot-product). Dot product has udot, sdot and usdot that do There are also other patterns that come up too. The first I believe should be equivalent to
AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a |
BTW, I believe this patch is currently changing the scores computed for |
Thanks for your advice, I'm working on this direction.
Thanks, I will only model these three patterns for reduction.
Thanks for caching that. I misunderstood that how many patterns can be folded into mve reduction-like instructions in the original patch.
I think we already model the instruction cost for In summary, I think we only need two new recipes for reduction - If there is any question, please let me know. |
I believe that
IIRC CCH was added for getting the extension costs (more) correct under Arm/MVE, so it hopefully does OK in the current scheme. We might need to be a little careful about which one of |
442d1dd
to
bcccb13
Compare
Update to recipe based implementation. Note that the EVL version is not implement yet, so the test case in the RISCV changed. |
Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that |
This might show it picking a different VF now: https://godbolt.org/z/zda3dvcrx |
Thanks for your information. I will fix the MulAcc pattern match for this pattern. |
Thanks - There might be some more and that might have just been one of the issues, I will keep trying to test the others and see if anything else comes up. |
This is another one that is behaving differently, I think due to subtleties about when one-use checks are beneficial. https://godbolt.org/z/fYW9Y5TqG. There is a third more awkward case with interleave groups I have not looked into much yet. I will try and make those into test cases to ensure we have the test coverage for them. I will have to check again later if anything else remains that is behaving differently and hitting the assert. |
Some tests in ab9178e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, some initial suggestions inline.
Would be good if we could avoid any references to IR operands, and possibly unify both recipes (and avoid duplication with the existing reduction recipe)
bcccb13
to
967b370
Compare
Thanks for adding new tests. Update the condition of creating MulAccRecipe for To support |
967b370
to
234e81e
Compare
@@ -811,6 +823,8 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe { | |||
|
|||
FastMathFlags getFastMathFlags() const; | |||
|
|||
bool isNonNeg() const { return NonNegFlags.NonNeg; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that OpType == OperationType::NonNegOp here, like for other accessors? otherwise we would access a non-active union element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
@@ -2341,6 +2373,28 @@ class VPReductionRecipe : public VPRecipeWithIRFlags { | |||
setUnderlyingValue(I); | |||
} | |||
|
|||
/// For VPExtendedReductionRecipe. | |||
/// Note that IsNonNeg flag and the debug location are from the extend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note that IsNonNeg flag and the debug location are from the extend. | |
/// Note that the IsNonNeg flag and the debug location are from the extend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
/// For VPMulAccumulateReductionRecipe. | ||
/// Note that the NUW/NSW and DL are from the Mul. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note that the NUW/NSW and DL are from the Mul. | |
/// Note that the NUW/NSW flags and the debug location are from the Mul. |
(for consistency with comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
/// A recipe to represent inloop extended reduction operations, performing a | ||
/// reduction on a extended vector operand into a scalar value, and adding the | ||
/// result to a chain. This recipe is abstract and needs to be lowered to | ||
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp, | |
/// concrete recipes before codegen. The operands are {ChainOp, VecOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not done yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I missed that. Updated, thanks!
VPValue *getVecOp0() const { return getOperand(1); } | ||
VPValue *getVecOp1() const { return getOperand(2); } | ||
|
||
/// Return if this MulAcc recipe contains extend instructions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more accurate? Would be good to strip reference to instructions
.
/// Return if this MulAcc recipe contains extend instructions. | |
/// Return if this MulAcc recipe contains extended operands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
/// Return the opcode of the underlying extend. | ||
Instruction::CastOps getExtOpcode() const { return ExtOp; } | ||
|
||
/// Return if the extend opcode is ZExt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Return if the extend opcode is ZExt. | |
/// Return if the operands are zero extended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/// Return if this MulAcc recipe contains extend instructions. | ||
bool isExtended() const { return ExtOp != Instruction::CastOps::CastOpsEnd; } | ||
|
||
/// Return the opcode of the underlying extend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Return the opcode of the underlying extend. | |
/// Return the opcode of the extends for the operands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
.Case<VPExtendedReductionRecipe>( | ||
[](const VPExtendedReductionRecipe *R) { | ||
return R->getResultType(); | ||
}) | ||
.Case<VPMulAccumulateReductionRecipe>( | ||
[](const VPMulAccumulateReductionRecipe *R) { | ||
return R->getResultType(); | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like this should work:
.Case<VPExtendedReductionRecipe>( | |
[](const VPExtendedReductionRecipe *R) { | |
return R->getResultType(); | |
}) | |
.Case<VPMulAccumulateReductionRecipe>( | |
[](const VPMulAccumulateReductionRecipe *R) { | |
return R->getResultType(); | |
}) | |
.Case<VPExtendedReductionRecipe, VPMulAccumulateReductionRecipe>( | |
[](const auto *R) { | |
return R->getResultType(); | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only common super class for them is VPReductionRecipe
, so the lambda parameter would need to be cast to VPExtendedReductionRecipe
or VPMulAccumulateReductionRecipe
. In which case it's probably cleaner to keep the structure as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda should have used const auto *R
(updated the suggestion) which is something that should be supported by the type switch, I think we use that feature in a number of places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
bool contains(const ElementCount &VF) const { | ||
return VF.getKnownMinValue() >= Start.getKnownMinValue() && | ||
VF.getKnownMinValue() < End.getKnownMinValue(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needed in the latest version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
return Copy; | ||
} | ||
|
||
VP_CLASSOF_IMPL(VPDef::VPMulAccumulateReductionSC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since building on top of this I've noticed that this class needs a case in VPRecipeBase::mayHaveSideEffects
, probably above the VPReductionSC
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks!
Only VPWidenCastRecipe with ZExt continas nneg flag. Using transferFlags() to prevent get the nneg flag from a not-nneg flags operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor comments.
/// A recipe to represent inloop extended reduction operations, performing a | ||
/// reduction on a extended vector operand into a scalar value, and adding the | ||
/// result to a chain. This recipe is abstract and needs to be lowered to | ||
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not done yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments @ElvisWang123!
@@ -155,6 +155,8 @@ bool VPRecipeBase::mayHaveSideEffects() const { | |||
case VPBlendSC: | |||
case VPReductionEVLSC: | |||
case VPReductionSC: | |||
case VPExtendedReductionSC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be added to mayReadFromMemory
and mayWriteToMemory
for completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates. Basically looks good, but I think it would be good to split up the patch into adding the new recipes + transform + lowering but without the cost changes. And then land the cost changes separately, if possible?
That way there will less churn in case we need to revert due to cost failures |
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for #113903.
…to abstract recipe. This patch implements the transformation that match the following patterns in the vplan and converts to abstract recipes for better cost estimation. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before vector codegen. This should be a cost-model based decision which will be implemented in the following patch. In current status, still rely on the legacy cost model to calaulate the right cost. Split from llvm#113903.
Split new recipes implementations to #137745. Will update this patch after above patches landed. |
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.
…xit`. (llvm#135294) This patch check if the plan contains scalar VF by VFRange instead of Plan. This patch also clamp the range to contains either only scalar or only vector VFs to prevent mis-compile. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
…xit`. (llvm#135294) This patch check if the plan contains scalar VF by VFRange instead of Plan. This patch also clamp the range to contains either only scalar or only vector VFs to prevent mis-compile. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
This patch implement the VPlan-based pattern match for extended-reduction and MulAccReduction. In above reduction patterns, extend instructions and mul instruction can fold into reduction instruction and the cost is free.
This patch creates two new abstract recipe which will be lower to corresponding VPWidenCastRecipe, VPWidenRecipe and VPReduction recipes before execution.
VPReductionRecipe
VPWidenCastRecipe
+VPReductionRecipe
VPMulAccumulateReductionRecipe:
VPWidenCastRecipe
+VPWidenRecipe
+VPReductionRecipe
.Ref: Original instruction based implementation: https://reviews.llvm.org/D93476