Skip to content

[LoopVectorizer] Bundle partial reductions inside VPMulAccumulateReductionRecipe #136173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: users/SamTebbs33/elvis-vp-arm-mve-transform
Choose a base branch
from
2 changes: 2 additions & 0 deletions llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ class TargetTransformInfo {
/// Get the kind of extension that an instruction represents.
static PartialReductionExtendKind
getPartialReductionExtendKind(Instruction *I);
static PartialReductionExtendKind
getPartialReductionExtendKind(Instruction::CastOps ExtOpcode);

/// Construct a TTI object using a type implementing the \c Concept
/// API below.
Expand Down
19 changes: 15 additions & 4 deletions llvm/lib/Analysis/TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -986,13 +986,24 @@ InstructionCost TargetTransformInfo::getShuffleCost(

TargetTransformInfo::PartialReductionExtendKind
TargetTransformInfo::getPartialReductionExtendKind(Instruction *I) {
if (isa<SExtInst>(I))
return PR_SignExtend;
if (isa<ZExtInst>(I))
return PR_ZeroExtend;
if (auto *Cast = dyn_cast<CastInst>(I))
return getPartialReductionExtendKind(Cast->getOpcode());
return PR_None;
}

TargetTransformInfo::PartialReductionExtendKind
TargetTransformInfo::getPartialReductionExtendKind(
Instruction::CastOps ExtOpcode) {
switch (ExtOpcode) {
case Instruction::CastOps::ZExt:
return PR_ZeroExtend;
case Instruction::CastOps::SExt:
return PR_SignExtend;
default:
llvm_unreachable("Unexpected cast opcode");
}
}

TTI::CastContextHint
TargetTransformInfo::getCastContextHint(const Instruction *I) {
if (!I)
Expand Down
8 changes: 3 additions & 5 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8879,17 +8879,15 @@ VPRecipeBuilder::tryToCreatePartialReduction(Instruction *Reduction,
ReductionOpcode = Instruction::Add;
}

VPValue *Cond = nullptr;
if (CM.blockNeedsPredicationForAnyReason(Reduction->getParent())) {
assert((ReductionOpcode == Instruction::Add ||
ReductionOpcode == Instruction::Sub) &&
"Expected an ADD or SUB operation for predicated partial "
"reductions (because the neutral element in the mask is zero)!");
VPValue *Mask = getBlockInMask(Reduction->getParent());
VPValue *Zero =
Plan.getOrAddLiveIn(ConstantInt::get(Reduction->getType(), 0));
BinOp = Builder.createSelect(Mask, BinOp, Zero, Reduction->getDebugLoc());
Cond = getBlockInMask(Reduction->getParent());
}
return new VPPartialReductionRecipe(ReductionOpcode, BinOp, Accumulator,
return new VPPartialReductionRecipe(ReductionOpcode, Accumulator, BinOp, Cond,
Reduction);
}

Expand Down
117 changes: 65 additions & 52 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried downloading this patch and applying to the HEAD of LLVM and patch said this diff had already been applied. Does the PR need rebasing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perhaps this is my mistake. You did say it depends upon #113903. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the case :). Let me know if you have any issues applying it after applying 113903 too.

Original file line number Diff line number Diff line change
Expand Up @@ -2056,55 +2056,6 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
}
};

/// A recipe for forming partial reductions. In the loop, an accumulator and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make the change of VPPartialReductionRecipe : public VPSingleDefRecipe -> VPPartialReductionRecipe : public VPReductionRecipe as an NFC change? (For cases around VPMulAccumulateReductionRecipes you can initially add some asserts that the recipe isn't a partial reduction, because that won't be supported until this PR lands)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I could make it an NFC change, since to conform to VPReductionRecipe, the accumulator and binop have to be swapped around.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We discussed this offline) swapping the operands in the debug-print function of the recipe is not something that really concerns me, and I think there's still value making this functionally (from end-user perspective) NFC change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pre-committed the NFC but rebasing Elvis's changes on top of that has been pretty challenging considering the number of commits on that branch. So I will cherry-pick the NFC on to this branch and it'll just go away once Elvis's PR lands and I rebase this PR on top of main.

/// vector operand are added together and passed to the next iteration as the
/// next accumulator. After the loop body, the accumulator is reduced to a
/// scalar value.
class VPPartialReductionRecipe : public VPSingleDefRecipe {
unsigned Opcode;

public:
VPPartialReductionRecipe(Instruction *ReductionInst, VPValue *Op0,
VPValue *Op1)
: VPPartialReductionRecipe(ReductionInst->getOpcode(), Op0, Op1,
ReductionInst) {}
VPPartialReductionRecipe(unsigned Opcode, VPValue *Op0, VPValue *Op1,
Instruction *ReductionInst = nullptr)
: VPSingleDefRecipe(VPDef::VPPartialReductionSC,
ArrayRef<VPValue *>({Op0, Op1}), ReductionInst),
Opcode(Opcode) {
[[maybe_unused]] auto *AccumulatorRecipe =
getOperand(1)->getDefiningRecipe();
assert((isa<VPReductionPHIRecipe>(AccumulatorRecipe) ||
isa<VPPartialReductionRecipe>(AccumulatorRecipe)) &&
"Unexpected operand order for partial reduction recipe");
}
~VPPartialReductionRecipe() override = default;

VPPartialReductionRecipe *clone() override {
return new VPPartialReductionRecipe(Opcode, getOperand(0), getOperand(1),
getUnderlyingInstr());
}

VP_CLASSOF_IMPL(VPDef::VPPartialReductionSC)

/// Generate the reduction in the loop.
void execute(VPTransformState &State) override;

/// Return the cost of this VPPartialReductionRecipe.
InstructionCost computeCost(ElementCount VF,
VPCostContext &Ctx) const override;

/// Get the binary op's opcode.
unsigned getOpcode() const { return Opcode; }

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const override;
#endif
};

/// A recipe for vectorizing a phi-node as a sequence of mask-based select
/// instructions.
class VPBlendRecipe : public VPSingleDefRecipe {
Expand Down Expand Up @@ -2339,7 +2290,8 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
return R->getVPDefID() == VPRecipeBase::VPReductionSC ||
R->getVPDefID() == VPRecipeBase::VPReductionEVLSC ||
R->getVPDefID() == VPRecipeBase::VPExtendedReductionSC ||
R->getVPDefID() == VPRecipeBase::VPMulAccumulateReductionSC;
R->getVPDefID() == VPRecipeBase::VPMulAccumulateReductionSC ||
R->getVPDefID() == VPRecipeBase::VPPartialReductionSC;
}

static inline bool classof(const VPUser *U) {
Expand Down Expand Up @@ -2376,6 +2328,59 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
}
};

/// A recipe for forming partial reductions. In the loop, an accumulator and
/// vector operand are added together and passed to the next iteration as the
/// next accumulator. After the loop body, the accumulator is reduced to a
/// scalar value.
class VPPartialReductionRecipe : public VPReductionRecipe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the classof for VPReductionRecipe now include VPPartialReductionRecipe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

unsigned Opcode;

public:
VPPartialReductionRecipe(Instruction *ReductionInst, VPValue *Op0,
VPValue *Op1, VPValue *Cond)
: VPPartialReductionRecipe(ReductionInst->getOpcode(), Op0, Op1, Cond,
ReductionInst) {}
VPPartialReductionRecipe(unsigned Opcode, VPValue *Op0, VPValue *Op1,
VPValue *Cond, Instruction *ReductionInst = nullptr)
: VPReductionRecipe(VPDef::VPPartialReductionSC, RecurKind::Add,
FastMathFlags(), ReductionInst,
ArrayRef<VPValue *>({Op0, Op1}), Cond, false, {}),
Opcode(Opcode) {
[[maybe_unused]] auto *AccumulatorRecipe =
getChainOp()->getDefiningRecipe();
assert((isa<VPReductionPHIRecipe>(AccumulatorRecipe) ||
isa<VPPartialReductionRecipe>(AccumulatorRecipe)) &&
"Unexpected operand order for partial reduction recipe");
}
~VPPartialReductionRecipe() override = default;

VPPartialReductionRecipe *clone() override {
return new VPPartialReductionRecipe(Opcode, getOperand(0), getOperand(1),
getCondOp(), getUnderlyingInstr());
}

VP_CLASSOF_IMPL(VPDef::VPPartialReductionSC)

/// Generate the reduction in the loop.
void execute(VPTransformState &State) override;

/// Return the cost of this VPPartialReductionRecipe.
InstructionCost computeCost(ElementCount VF,
VPCostContext &Ctx) const override;

/// Get the binary op's opcode.
unsigned getOpcode() const { return Opcode; }

/// Get the binary op this reduction is applied to.
VPValue *getBinOp() const { return getOperand(1); }

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const override;
#endif
};

/// A recipe to represent inloop reduction operations with vector-predication
/// intrinsics, performing a reduction on a vector operand with the explicit
/// vector length (EVL) into a scalar value, and adding the result to a chain.
Expand Down Expand Up @@ -2496,6 +2501,9 @@ class VPMulAccumulateReductionRecipe : public VPReductionRecipe {

Type *ResultTy;

/// If the reduction this is based on is a partial reduction.
bool IsPartialReduction = false;

/// For cloning VPMulAccumulateReductionRecipe.
VPMulAccumulateReductionRecipe(VPMulAccumulateReductionRecipe *MulAcc)
: VPReductionRecipe(
Expand All @@ -2505,7 +2513,8 @@ class VPMulAccumulateReductionRecipe : public VPReductionRecipe {
WrapFlagsTy(MulAcc->hasNoUnsignedWrap(), MulAcc->hasNoSignedWrap()),
MulAcc->getDebugLoc()),
ExtOp(MulAcc->getExtOpcode()), IsNonNeg(MulAcc->isNonNeg()),
ResultTy(MulAcc->getResultType()) {}
ResultTy(MulAcc->getResultType()),
IsPartialReduction(MulAcc->isPartialReduction()) {}

public:
VPMulAccumulateReductionRecipe(VPReductionRecipe *R, VPWidenRecipe *Mul,
Expand All @@ -2518,7 +2527,8 @@ class VPMulAccumulateReductionRecipe : public VPReductionRecipe {
WrapFlagsTy(Mul->hasNoUnsignedWrap(), Mul->hasNoSignedWrap()),
R->getDebugLoc()),
ExtOp(Ext0->getOpcode()), IsNonNeg(Ext0->isNonNeg()),
ResultTy(ResultTy) {
ResultTy(ResultTy),
IsPartialReduction(isa<VPPartialReductionRecipe>(R)) {
assert(RecurrenceDescriptor::getOpcode(getRecurrenceKind()) ==
Instruction::Add &&
"The reduction instruction in MulAccumulateteReductionRecipe must "
Expand Down Expand Up @@ -2589,6 +2599,9 @@ class VPMulAccumulateReductionRecipe : public VPReductionRecipe {

/// Return the non negative flag of the ext recipe.
bool isNonNeg() const { return IsNonNeg; }

/// Return if the underlying reduction recipe is a partial reduction.
bool isPartialReduction() const { return IsPartialReduction; }
};

/// VPReplicateRecipe replicates a given instruction producing multiple scalar
Expand Down
28 changes: 20 additions & 8 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPWidenIntrinsicSC:
return cast<VPWidenIntrinsicRecipe>(this)->mayHaveSideEffects();
case VPBlendSC:
case VPPartialReductionSC:
case VPReductionEVLSC:
case VPReductionSC:
case VPScalarIVStepsSC:
Expand Down Expand Up @@ -287,14 +288,9 @@ InstructionCost
VPPartialReductionRecipe::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
std::optional<unsigned> Opcode = std::nullopt;
VPValue *BinOp = getOperand(0);
VPValue *BinOp = getBinOp();

// If the partial reduction is predicated, a select will be operand 0 rather
// than the binary op
using namespace llvm::VPlanPatternMatch;
if (match(getOperand(0), m_Select(m_VPValue(), m_VPValue(), m_VPValue())))
BinOp = BinOp->getDefiningRecipe()->getOperand(1);

// If BinOp is a negation, use the side effect of match to assign the actual
// binary operation to BinOp
match(BinOp, m_Binary<Instruction::Sub>(m_SpecificInt(0), m_VPValue(BinOp)));
Expand Down Expand Up @@ -338,12 +334,18 @@ void VPPartialReductionRecipe::execute(VPTransformState &State) {
assert(getOpcode() == Instruction::Add &&
"Unhandled partial reduction opcode");

Value *BinOpVal = State.get(getOperand(0));
Value *PhiVal = State.get(getOperand(1));
Value *BinOpVal = State.get(getBinOp());
Value *PhiVal = State.get(getChainOp());
assert(PhiVal && BinOpVal && "Phi and Mul must be set");

Type *RetTy = PhiVal->getType();

/// Mask the bin op output.
if (VPValue *Cond = getCondOp()) {
Value *Zero = ConstantInt::get(BinOpVal->getType(), 0);
BinOpVal = Builder.CreateSelect(State.get(Cond), BinOpVal, Zero);
}

CallInst *V = Builder.CreateIntrinsic(
RetTy, Intrinsic::experimental_vector_partial_reduce_add,
{PhiVal, BinOpVal}, nullptr, "partial.reduce");
Expand Down Expand Up @@ -2432,6 +2434,14 @@ VPExtendedReductionRecipe::computeCost(ElementCount VF,
InstructionCost
VPMulAccumulateReductionRecipe::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
if (isPartialReduction()) {
return Ctx.TTI.getPartialReductionCost(
Instruction::Add, Ctx.Types.inferScalarType(getVecOp0()),
Ctx.Types.inferScalarType(getVecOp1()), getResultType(), VF,
TTI::getPartialReductionExtendKind(getExtOpcode()),
TTI::getPartialReductionExtendKind(getExtOpcode()), Instruction::Mul);
}

Type *RedTy = Ctx.Types.inferScalarType(this);
auto *SrcVecTy =
cast<VectorType>(toVectorTy(Ctx.Types.inferScalarType(getVecOp0()), VF));
Expand Down Expand Up @@ -2509,6 +2519,8 @@ void VPMulAccumulateReductionRecipe::print(raw_ostream &O, const Twine &Indent,
O << " = ";
getChainOp()->printAsOperand(O, SlotTracker);
O << " + ";
if (isPartialReduction())
O << "partial.";
O << "reduce."
<< Instruction::getOpcodeName(
RecurrenceDescriptor::getOpcode(getRecurrenceKind()))
Expand Down
43 changes: 39 additions & 4 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2158,9 +2158,14 @@ expandVPMulAccumulateReduction(VPMulAccumulateReductionRecipe *MulAcc) {
Mul->insertBefore(MulAcc);

// Generate VPReductionRecipe.
auto *Red = new VPReductionRecipe(
MulAcc->getRecurrenceKind(), FastMathFlags(), MulAcc->getChainOp(), Mul,
MulAcc->getCondOp(), MulAcc->isOrdered(), MulAcc->getDebugLoc());
VPReductionRecipe *Red = nullptr;
if (MulAcc->isPartialReduction())
Red = new VPPartialReductionRecipe(Instruction::Add, MulAcc->getChainOp(),
Mul, MulAcc->getCondOp());
else
Red = new VPReductionRecipe(MulAcc->getRecurrenceKind(), FastMathFlags(),
MulAcc->getChainOp(), Mul, MulAcc->getCondOp(),
MulAcc->isOrdered(), MulAcc->getDebugLoc());
Red->insertBefore(MulAcc);

MulAcc->replaceAllUsesWith(Red);
Expand Down Expand Up @@ -2432,12 +2437,42 @@ static void tryToCreateAbstractReductionRecipe(VPReductionRecipe *Red,
Red->replaceAllUsesWith(AbstractR);
}

/// This function tries to create an abstract recipe from a partial reduction to
/// hide its mul and extends from cost estimation.
static void
tryToCreateAbstractPartialReductionRecipe(VPPartialReductionRecipe *PRed) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be given the Range & and for that range to be clamped if it doesn't match or if the cost is higher than the individual operations (similar to what happens in tryToCreateAbstractReductionRecipe) ?

(note that the cost part is still missing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we've already created the partial reduction and clamped the range so I don't think we need to do any costing (like tryToMatchAndCreateMulAccumulateReduction does with getMulAccReductionCost) since we already know it's worthwhile (see getScaledReductions in LoopVectorize.cpp). This part of the code just puts the partial reduction inside the abstract recipe, which shouldn't need to consider any costing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read the code is that at the point of getting to this point in the code, it has recognised a reduction so there is a VP[Partial]ReductionRecipe. It then tries to analyse whether that recipe can be transformed into a VPMulAccumulateReductionRecipe. For VPReductionRecipe it will clamp the range to all the VFs that can be turned into a VPMulAccumulateReductionRecipe, but for VPPartialReductionRecipe it doesn't do that. I don't see why for partial reductions we'd do something different.

In fact, why wouldn't the tryToMatchAndCreateMulAccumulateReduction code be sufficient here? Now that you've made VPPartialReductionRecipe a subclass of VPReductionRecipe, I'd expect that code to function roughly the same.

if (PRed->getOpcode() != Instruction::Add)
return;

using namespace llvm::VPlanPatternMatch;
auto *BinOp = PRed->getBinOp();
if (!match(BinOp,
m_Mul(m_ZExtOrSExt(m_VPValue()), m_ZExtOrSExt(m_VPValue()))))
return;

auto *BinOpR = cast<VPWidenRecipe>(BinOp->getDefiningRecipe());
VPWidenCastRecipe *Ext0R = dyn_cast<VPWidenCastRecipe>(BinOpR->getOperand(0));
VPWidenCastRecipe *Ext1R = dyn_cast<VPWidenCastRecipe>(BinOpR->getOperand(1));

// TODO: Make work with extends of different signedness
if (Ext0R->hasMoreThanOneUniqueUser() || Ext1R->hasMoreThanOneUniqueUser() ||
Ext0R->getOpcode() != Ext1R->getOpcode())
return;

auto *AbstractR = new VPMulAccumulateReductionRecipe(
PRed, BinOpR, Ext0R, Ext1R, Ext0R->getResultType());
AbstractR->insertBefore(PRed);
PRed->replaceAllUsesWith(AbstractR);
}

void VPlanTransforms::convertToAbstractRecipes(VPlan &Plan, VPCostContext &Ctx,
VFRange &Range) {
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_deep(Plan.getVectorLoopRegion()))) {
for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
if (auto *Red = dyn_cast<VPReductionRecipe>(&R))
if (auto *PRed = dyn_cast<VPPartialReductionRecipe>(&R))
tryToCreateAbstractPartialReductionRecipe(PRed);
else if (auto *Red = dyn_cast<VPReductionRecipe>(&R))
tryToCreateAbstractReductionRecipe(Red, Ctx, Range);
}
}
Expand Down
Loading
Loading