Skip to content

[VPlan] Impl VPlan-based pattern match for ExtendedRed and MulAccRed #113903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

ElvisWang123
Copy link
Contributor

@ElvisWang123 ElvisWang123 commented Oct 28, 2024

This patch implement the VPlan-based pattern match for extended-reduction and MulAccReduction. In above reduction patterns, extend instructions and mul instruction can fold into reduction instruction and the cost is free.

This patch creates two new abstract recipe which will be lower to corresponding VPWidenCastRecipe, VPWidenRecipe and VPReduction recipes before execution.

  1. VPReductionRecipe

    • Pattern: reduce(ext(...))
    • Lower to VPWidenCastRecipe + VPReductionRecipe
  2. VPMulAccumulateReductionRecipe:

    • Patterns: reduce.add(mul(...)), reduce.add(mul(ext(...), ext(...)))), reduce.add(ext(mul(ext(...), ext(...)))) with all extend instruction be same opcode and reduce.add(Sext(mul(zext(A), zext(A)))).
    • Lower to VPWidenCastRecipe + VPWidenRecipe+ VPReductionRecipe.

Ref: Original instruction based implementation: https://reviews.llvm.org/D93476

@llvmbot
Copy link
Member

llvmbot commented Oct 28, 2024

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Elvis Wang (ElvisWang123)

Changes

This patch implement the VPlan-based pattern match for extendedReduction and MulAccReduction. In above reduction patterns, extened instructions and mul instruction can fold into reduction instruction and the cost is free.

We add FoldedRecipes in the VPCostContext to put recipes that can be folded into other recipes.

ExtendedReductionPatterns:
reduce(ext(...))
MulAccReductionPatterns:
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))
reduce.add(ext(mul(...)))
reduce.add(ext(mul(ext(...), ext(...))))

Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476

This patch is based on #113902 .


Patch is 21.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113903.diff

5 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-45)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+127-12)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+36-36)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 60a94ca1f86e42..483e039fe133d6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7303,51 +7303,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
       Cost += ReductionCost;
       continue;
     }
-
-    const auto &ChainOps = RdxDesc.getReductionOpChain(RedPhi, OrigLoop);
-    SetVector<Instruction *> ChainOpsAndOperands(ChainOps.begin(),
-                                                 ChainOps.end());
-    auto IsZExtOrSExt = [](const unsigned Opcode) -> bool {
-      return Opcode == Instruction::ZExt || Opcode == Instruction::SExt;
-    };
-    // Also include the operands of instructions in the chain, as the cost-model
-    // may mark extends as free.
-    //
-    // For ARM, some of the instruction can folded into the reducion
-    // instruction. So we need to mark all folded instructions free.
-    // For example: We can fold reduce(mul(ext(A), ext(B))) into one
-    // instruction.
-    for (auto *ChainOp : ChainOps) {
-      for (Value *Op : ChainOp->operands()) {
-        if (auto *I = dyn_cast<Instruction>(Op)) {
-          ChainOpsAndOperands.insert(I);
-          if (I->getOpcode() == Instruction::Mul) {
-            auto *Ext0 = dyn_cast<Instruction>(I->getOperand(0));
-            auto *Ext1 = dyn_cast<Instruction>(I->getOperand(1));
-            if (Ext0 && IsZExtOrSExt(Ext0->getOpcode()) && Ext1 &&
-                Ext0->getOpcode() == Ext1->getOpcode()) {
-              ChainOpsAndOperands.insert(Ext0);
-              ChainOpsAndOperands.insert(Ext1);
-            }
-          }
-        }
-      }
-    }
-
-    // Pre-compute the cost for I, if it has a reduction pattern cost.
-    for (Instruction *I : ChainOpsAndOperands) {
-      auto ReductionCost = CM.getReductionPatternCost(
-          I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);
-      if (!ReductionCost)
-        continue;
-
-      assert(!CostCtx.SkipCostComputation.contains(I) &&
-             "reduction op visited multiple times");
-      CostCtx.SkipCostComputation.insert(I);
-      LLVM_DEBUG(dbgs() << "Cost of " << ReductionCost << " for VF " << VF
-                        << ":\n in-loop reduction " << *I << "\n");
-      Cost += *ReductionCost;
-    }
   }
 
   // Pre-compute the costs for branches except for the backedge, as the number
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ab8fb45c351b4..49e93e1e7b5501 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -785,7 +785,7 @@ void VPRegionBlock::execute(VPTransformState *State) {
 
 InstructionCost VPBasicBlock::cost(ElementCount VF, VPCostContext &Ctx) {
   InstructionCost Cost = 0;
-  for (VPRecipeBase &R : Recipes)
+  for (VPRecipeBase &R : reverse(Recipes))
     Cost += R.cost(VF, Ctx);
   return Cost;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 6a192bdf01c4ff..b26fd460a278f5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -725,6 +725,8 @@ struct VPCostContext {
   LLVMContext &LLVMCtx;
   LoopVectorizationCostModel &CM;
   SmallPtrSet<Instruction *, 8> SkipCostComputation;
+  /// Contains recipes that are folded into other recipes.
+  SmallDenseMap<ElementCount, SmallPtrSet<VPRecipeBase *, 4>, 4> FoldedRecipes;
 
   VPCostContext(const TargetTransformInfo &TTI, const TargetLibraryInfo &TLI,
                 Type *CanIVTy, LoopVectorizationCostModel &CM)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0eb4f7c7c88cee..5f59a1e96df9f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -299,7 +299,9 @@ InstructionCost VPRecipeBase::cost(ElementCount VF, VPCostContext &Ctx) {
     UI = &WidenMem->getIngredient();
 
   InstructionCost RecipeCost;
-  if (UI && Ctx.skipCostComputation(UI, VF.isVector())) {
+  if ((UI && Ctx.skipCostComputation(UI, VF.isVector())) ||
+      (Ctx.FoldedRecipes.contains(VF) &&
+       Ctx.FoldedRecipes.at(VF).contains(this))) {
     RecipeCost = 0;
   } else {
     RecipeCost = computeCost(VF, Ctx);
@@ -2188,30 +2190,143 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
   TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
   unsigned Opcode = RdxDesc.getOpcode();
 
-  // TODO: Support any-of and in-loop reductions.
+  // TODO: Support any-of reductions.
   assert(
       (!RecurrenceDescriptor::isAnyOfRecurrenceKind(RdxKind) ||
        ForceTargetInstructionCost.getNumOccurrences() > 0) &&
       "Any-of reduction not implemented in VPlan-based cost model currently.");
-  assert(
-      (!cast<VPReductionPHIRecipe>(getOperand(0))->isInLoop() ||
-       ForceTargetInstructionCost.getNumOccurrences() > 0) &&
-      "In-loop reduction not implemented in VPlan-based cost model currently.");
 
   assert(ElementTy->getTypeID() == RdxDesc.getRecurrenceType()->getTypeID() &&
          "Inferred type and recurrence type mismatch.");
 
-  // Cost = Reduction cost + BinOp cost
-  InstructionCost Cost =
+  // BaseCost = Reduction cost + BinOp cost
+  InstructionCost BaseCost =
       Ctx.TTI.getArithmeticInstrCost(Opcode, ElementTy, CostKind);
   if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RdxKind)) {
     Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
-    return Cost + Ctx.TTI.getMinMaxReductionCost(
-                      Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+    BaseCost += Ctx.TTI.getMinMaxReductionCost(
+        Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+  } else {
+    BaseCost += Ctx.TTI.getArithmeticReductionCost(
+        Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
   }
 
-  return Cost + Ctx.TTI.getArithmeticReductionCost(
-                    Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+  using namespace llvm::VPlanPatternMatch;
+  auto GetMulAccReductionCost =
+      [&](const VPReductionRecipe *Red) -> InstructionCost {
+    VPValue *A, *B;
+    InstructionCost InnerExt0Cost = 0;
+    InstructionCost InnerExt1Cost = 0;
+    InstructionCost ExtCost = 0;
+    InstructionCost MulCost = 0;
+
+    VectorType *SrcVecTy = VectorTy;
+    Type *InnerExt0Ty;
+    Type *InnerExt1Ty;
+    Type *MaxInnerExtTy;
+    bool IsUnsigned = true;
+    bool HasOuterExt = false;
+
+    auto *Ext = dyn_cast_if_present<VPWidenCastRecipe>(
+        Red->getVecOp()->getDefiningRecipe());
+    VPRecipeBase *Mul;
+    // Try to match outer extend reduce.add(ext(...))
+    if (Ext && match(Ext, m_ZExtOrSExt(m_VPValue())) &&
+        cast<VPWidenCastRecipe>(Ext)->getNumUsers() == 1) {
+      IsUnsigned =
+          Ext->getOpcode() == Instruction::CastOps::ZExt ? true : false;
+      ExtCost = Ext->computeCost(VF, Ctx);
+      Mul = Ext->getOperand(0)->getDefiningRecipe();
+      HasOuterExt = true;
+    } else {
+      Mul = Red->getVecOp()->getDefiningRecipe();
+    }
+
+    // Match reduce.add(mul())
+    if (Mul && match(Mul, m_Mul(m_VPValue(A), m_VPValue(B))) &&
+        cast<VPWidenRecipe>(Mul)->getNumUsers() == 1) {
+      MulCost = cast<VPWidenRecipe>(Mul)->computeCost(VF, Ctx);
+      auto *InnerExt0 =
+          dyn_cast_if_present<VPWidenCastRecipe>(A->getDefiningRecipe());
+      auto *InnerExt1 =
+          dyn_cast_if_present<VPWidenCastRecipe>(B->getDefiningRecipe());
+      bool HasInnerExt = false;
+      // Try to match inner extends.
+      if (InnerExt0 && InnerExt1 &&
+          match(InnerExt0, m_ZExtOrSExt(m_VPValue())) &&
+          match(InnerExt1, m_ZExtOrSExt(m_VPValue())) &&
+          InnerExt0->getOpcode() == InnerExt1->getOpcode() &&
+          (InnerExt0->getNumUsers() > 0 &&
+           !InnerExt0->hasMoreThanOneUniqueUser()) &&
+          (InnerExt1->getNumUsers() > 0 &&
+           !InnerExt1->hasMoreThanOneUniqueUser())) {
+        InnerExt0Cost = InnerExt0->computeCost(VF, Ctx);
+        InnerExt1Cost = InnerExt1->computeCost(VF, Ctx);
+        Type *InnerExt0Ty = Ctx.Types.inferScalarType(InnerExt0->getOperand(0));
+        Type *InnerExt1Ty = Ctx.Types.inferScalarType(InnerExt1->getOperand(0));
+        Type *MaxInnerExtTy = InnerExt0Ty->getIntegerBitWidth() >
+                                      InnerExt1Ty->getIntegerBitWidth()
+                                  ? InnerExt0Ty
+                                  : InnerExt1Ty;
+        SrcVecTy = cast<VectorType>(ToVectorTy(MaxInnerExtTy, VF));
+        IsUnsigned = true;
+        HasInnerExt = true;
+      }
+      InstructionCost MulAccRedCost = Ctx.TTI.getMulAccReductionCost(
+          IsUnsigned, ElementTy, SrcVecTy, CostKind);
+      // Check if folding ext/mul into MulAccReduction is profitable.
+      if (MulAccRedCost.isValid() &&
+          MulAccRedCost <
+              ExtCost + MulCost + InnerExt0Cost + InnerExt1Cost + BaseCost) {
+        if (HasInnerExt) {
+          Ctx.FoldedRecipes[VF].insert(InnerExt0);
+          Ctx.FoldedRecipes[VF].insert(InnerExt1);
+        }
+        Ctx.FoldedRecipes[VF].insert(Mul);
+        if (HasOuterExt)
+          Ctx.FoldedRecipes[VF].insert(Ext);
+        return MulAccRedCost;
+      }
+    }
+    return InstructionCost::getInvalid();
+  };
+
+  // Match reduce(ext(...))
+  auto GetExtendedReductionCost =
+      [&](const VPReductionRecipe *Red) -> InstructionCost {
+    VPValue *VecOp = Red->getVecOp();
+    VPValue *A;
+    if (match(VecOp, m_ZExtOrSExt(m_VPValue(A))) && VecOp->getNumUsers() == 1) {
+      VPWidenCastRecipe *Ext =
+          cast<VPWidenCastRecipe>(VecOp->getDefiningRecipe());
+      bool IsUnsigned = Ext->getOpcode() == Instruction::CastOps::ZExt;
+      InstructionCost ExtCost = Ext->computeCost(VF, Ctx);
+      auto *ExtVecTy =
+          cast<VectorType>(ToVectorTy(Ctx.Types.inferScalarType(A), VF));
+      InstructionCost ExtendedRedCost = Ctx.TTI.getExtendedReductionCost(
+          Opcode, IsUnsigned, ElementTy, ExtVecTy, RdxDesc.getFastMathFlags(),
+          CostKind);
+      // Check if folding ext into ExtendedReduction is profitable.
+      if (ExtendedRedCost.isValid() && ExtendedRedCost < ExtCost + BaseCost) {
+        Ctx.FoldedRecipes[VF].insert(Ext);
+        return ExtendedRedCost;
+      }
+    }
+    return InstructionCost::getInvalid();
+  };
+
+  // Match MulAccReduction patterns.
+  InstructionCost MulAccCost = GetMulAccReductionCost(this);
+  if (MulAccCost.isValid())
+    return MulAccCost;
+
+  // Match ExtendedReduction patterns.
+  InstructionCost ExtendedCost = GetExtendedReductionCost(this);
+  if (ExtendedCost.isValid())
+    return ExtendedCost;
+
+  // Default cost.
+  return BaseCost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..f2e36399c85f5d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -6,26 +6,26 @@ define void @i8_factor_2(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_2'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %...
[truncated]

@davemgreen davemgreen requested a review from david-arm October 28, 2024 12:57
@davemgreen
Copy link
Collaborator

The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model.

I'm not sure if that is still the current plan or not.

@fhahn
Copy link
Contributor

fhahn commented Oct 28, 2024

The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model.

Yes I think ideally we would model the add-extend/mul-extend operations explicitly in VPlan, especially if matching it in VPlan would require changes to the order in which the cost is computed. Would add-extend/mul-extend be sufficient or would other recipes be needed as well?

@ElvisWang123
Copy link
Contributor Author

I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan.

Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the execute() function since we lack of middle-end IRs for these patterns. The new recipes would still need to generate all vector instructions for recipes that has been folding into it.

@fhahn
Copy link
Contributor

fhahn commented Oct 30, 2024

I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan.

@davemgreen do you think those would be enough?

Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the execute() function since we lack of middle-end IRs for these patterns. The new recipes would still need to generate all vector instructions for recipes that has been folding into it.

This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305

@davemgreen
Copy link
Collaborator

Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:

reduce(ext(...))
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))

In the case of MVE, both the ext's will be the same. The add can be done by setting one of the operands to 1. #92418 is similar, but produce a vector instead of a single output value (a dot-product). Dot product has udot, sdot and usdot that do partialreduce.add(mul(sext, zext)), but for MVE the extends need to be the same.

There are also other patterns that come up too. The first I believe should be equivalent to vecreduce(mul(ext, ext)), providing the ext nodes are the correct types. I don't remember about the second.

reduce.add(ext(mul(ext(...), ext(...))))
reduce.add(ext(mul(...)))

AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a mul(ext, ext). Sometimes it might be better to fold towards ext(load) though, depending on the types.

@davemgreen
Copy link
Collaborator

BTW, I believe this patch is currently changing the scores computed for reduce.add(ext(mul(ext(...), ext(...)))), in a way that makes it pick a different VF. I wasn't sure if that was because this was better or not, but I think it is picking different type sizes. https://godbolt.org/z/7TYb8heEM should hopefully show the difference? Let me know if it doesn't.

@ElvisWang123
Copy link
Contributor Author

This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305

Thanks for your advice, I'm working on this direction.

Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:

reduce(ext(...))
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))

Thanks, I will only model these three patterns for reduction.

There are also other patterns that come up too. The first I believe should be equivalent to vecreduce(mul(ext, ext)), providing the ext nodes are the correct types. I don't remember about the second.

reduce.add(ext(mul(ext(...), ext(...))))
reduce.add(ext(mul(...)))

Thanks for caching that. I misunderstood that how many patterns can be folded into mve reduction-like instructions in the original patch.

AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a mul(ext, ext). Sometimes it might be better to fold towards ext(load) though, depending on the types.

I think we already model the instruction cost for ext(load) in VPWidenCastRecipe::computeCost(). We compute the CastContextHint which depends on the load/store for the ext instructions. But I am not quite sure will ARMTTI handle this pattern correctly or not.

In summary, I think we only need two new recipes for reduction - reduce(ext) and reduce.add(mul(<optional>(ext), <optional>(ext))) ?

If there is any question, please let me know.

@davemgreen
Copy link
Collaborator

Thanks, I will only model these three patterns for reduction.

I believe that reduce.add(ext(mul(ext(...), ext(...)))) is mathematically equivalent to reduce.add(mul(ext(...), ext(...))), so that sounds good so long as we can match both. I don't remember about the reduce.add(ext(mul(..., ...))) form, it doesn't sound like it would be equivalent. If it turns out we do need it, it should hopefully be fine to look into adding it later.

I think we already model the instruction cost for ext(load) in VPWidenCastRecipe::computeCost(). We compute the CastContextHint which depends on the load/store for the ext instructions. But I am not quite sure will ARMTTI handle this pattern correctly or not.

In summary, I think we only need two new recipes for reduction - reduce(ext) and reduce.add(mul((ext), (ext))) ?

IIRC CCH was added for getting the extension costs (more) correct under Arm/MVE, so it hopefully does OK in the current scheme. We might need to be a little careful about which one of red-ext + load vs red + ext-load we prefer for each vplan.

@ElvisWang123
Copy link
Contributor Author

Update to recipe based implementation.

Note that the EVL version is not implement yet, so the test case in the RISCV changed.

@davemgreen
Copy link
Collaborator

Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that add(sext(mul(sext, sext)) can be optimized to add(zext(mul(sext, sext)) nowadays, or that there are multiple uses.

@davemgreen
Copy link
Collaborator

This might show it picking a different VF now: https://godbolt.org/z/zda3dvcrx

@ElvisWang123
Copy link
Contributor Author

Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that add(sext(mul(sext, sext)) can be optimized to add(zext(mul(sext, sext)) nowadays, or that there are multiple uses.

Thanks for your information. I will fix the MulAcc pattern match for this pattern.

@davemgreen
Copy link
Collaborator

Thanks - There might be some more and that might have just been one of the issues, I will keep trying to test the others and see if anything else comes up.

@davemgreen
Copy link
Collaborator

This is another one that is behaving differently, I think due to subtleties about when one-use checks are beneficial. https://godbolt.org/z/fYW9Y5TqG. There is a third more awkward case with interleave groups I have not looked into much yet.

I will try and make those into test cases to ensure we have the test coverage for them. I will have to check again later if anything else remains that is behaving differently and hitting the assert.

@davemgreen
Copy link
Collaborator

Some tests in ab9178e.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, some initial suggestions inline.

Would be good if we could avoid any references to IR operands, and possibly unify both recipes (and avoid duplication with the existing reduction recipe)

@ElvisWang123
Copy link
Contributor Author

ElvisWang123 commented Nov 11, 2024

Some tests in ab9178e.

Thanks for adding new tests.

Update the condition of creating MulAccRecipe for reduce.add(zext(mul(sext(A), sext(B)))) when A == B. The new patch can pass all assertion checks.

To support mla_and_add_together_16_64 which contains two reduction patterns in same loop, reduce.add(zext(mul(sext(A), sext(A)))) and reduce(sext(A)). I removed the checks of if the extend recipes has only one user to align the behavior of the legacy model. This change might lead to miscalculating cost if the extend recipes will be used by other recipes not in the reduction pattern, which we still need to calculate the costs of extend recipes. But I think we would have same issue in the legacy cost model that we can fix it in future.
Also this may generate duplicate extend instructions but I think it will be removed after other optimizations. 234e81e

@@ -811,6 +823,8 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe {

FastMathFlags getFastMathFlags() const;

bool isNonNeg() const { return NonNegFlags.NonNeg; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that OpType == OperationType::NonNegOp here, like for other accessors? otherwise we would access a non-active union element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

@@ -2341,6 +2373,28 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
setUnderlyingValue(I);
}

/// For VPExtendedReductionRecipe.
/// Note that IsNonNeg flag and the debug location are from the extend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note that IsNonNeg flag and the debug location are from the extend.
/// Note that the IsNonNeg flag and the debug location are from the extend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

/// For VPMulAccumulateReductionRecipe.
/// Note that the NUW/NSW and DL are from the Mul.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note that the NUW/NSW and DL are from the Mul.
/// Note that the NUW/NSW flags and the debug location are from the Mul.

(for consistency with comment above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

/// A recipe to represent inloop extended reduction operations, performing a
/// reduction on a extended vector operand into a scalar value, and adding the
/// result to a chain. This recipe is abstract and needs to be lowered to
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp,
/// concrete recipes before codegen. The operands are {ChainOp, VecOp,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not done yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I missed that. Updated, thanks!

VPValue *getVecOp0() const { return getOperand(1); }
VPValue *getVecOp1() const { return getOperand(2); }

/// Return if this MulAcc recipe contains extend instructions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more accurate? Would be good to strip reference to instructions.

Suggested change
/// Return if this MulAcc recipe contains extend instructions.
/// Return if this MulAcc recipe contains extended operands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

/// Return the opcode of the underlying extend.
Instruction::CastOps getExtOpcode() const { return ExtOp; }

/// Return if the extend opcode is ZExt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Return if the extend opcode is ZExt.
/// Return if the operands are zero extended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// Return if this MulAcc recipe contains extend instructions.
bool isExtended() const { return ExtOp != Instruction::CastOps::CastOpsEnd; }

/// Return the opcode of the underlying extend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Return the opcode of the underlying extend.
/// Return the opcode of the extends for the operands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 275 to 282
.Case<VPExtendedReductionRecipe>(
[](const VPExtendedReductionRecipe *R) {
return R->getResultType();
})
.Case<VPMulAccumulateReductionRecipe>(
[](const VPMulAccumulateReductionRecipe *R) {
return R->getResultType();
})
Copy link
Contributor

@fhahn fhahn Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this should work:

Suggested change
.Case<VPExtendedReductionRecipe>(
[](const VPExtendedReductionRecipe *R) {
return R->getResultType();
})
.Case<VPMulAccumulateReductionRecipe>(
[](const VPMulAccumulateReductionRecipe *R) {
return R->getResultType();
})
.Case<VPExtendedReductionRecipe, VPMulAccumulateReductionRecipe>(
[](const auto *R) {
return R->getResultType();
})

Copy link
Collaborator

@SamTebbs33 SamTebbs33 Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only common super class for them is VPReductionRecipe, so the lambda parameter would need to be cast to VPExtendedReductionRecipe or VPMulAccumulateReductionRecipe. In which case it's probably cleaner to keep the structure as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda should have used const auto *R (updated the suggestion) which is something that should be supported by the type switch, I think we use that feature in a number of places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines 81 to 85
bool contains(const ElementCount &VF) const {
return VF.getKnownMinValue() >= Start.getKnownMinValue() &&
VF.getKnownMinValue() < End.getKnownMinValue();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needed in the latest version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

return Copy;
}

VP_CLASSOF_IMPL(VPDef::VPMulAccumulateReductionSC);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since building on top of this I've noticed that this class needs a case in VPRecipeBase::mayHaveSideEffects, probably above the VPReductionSC case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks!

Only VPWidenCastRecipe with ZExt continas nneg flag. Using
transferFlags() to prevent get the nneg flag from a not-nneg flags
operands.
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor comments.

/// A recipe to represent inloop extended reduction operations, performing a
/// reduction on a extended vector operand into a scalar value, and adding the
/// result to a chain. This recipe is abstract and needs to be lowered to
/// concrete recipes before codegen. The Operands are {ChainOp, VecOp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not done yet?

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments @ElvisWang123!

@@ -155,6 +155,8 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPBlendSC:
case VPReductionEVLSC:
case VPReductionSC:
case VPExtendedReductionSC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be added to mayReadFromMemory and mayWriteToMemory for completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. Basically looks good, but I think it would be good to split up the patch into adding the new recipes + transform + lowering but without the cost changes. And then land the cost changes separately, if possible?

@fhahn
Copy link
Contributor

fhahn commented Apr 27, 2025

That way there will less churn in case we need to revert due to cost failures

ElvisWang123 added a commit that referenced this pull request Apr 29, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for #113903.
ElvisWang123 added a commit to ElvisWang123/llvm-project that referenced this pull request Apr 29, 2025
…to abstract recipe.

This patch implements the transformation that match the following
patterns in the vplan and converts to abstract recipes for better cost
estimation.

* VPExtendedReductionRecipe
  - cast + reduction.

* VPMulAccumulateReductionRecipe
  - (cast) + mul + reduction.

The conveted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before vector codegen.

This should be a cost-model based decision which will be implemented in the
following patch. In current status, still rely on the legacy cost model
to calaulate the right cost.

Split from llvm#113903.
@ElvisWang123
Copy link
Contributor Author

Split new recipes implementations to #137745.
Split transformations (convertTo{Abstract|Concrete}Recipes) to #137746.

Will update this patch after above patches landed.
This patch will focus on removing the dependency of the legacy cost model and enable this transformation by VPlan-based cost model.

gizmondo pushed a commit to gizmondo/llvm-project that referenced this pull request Apr 29, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
@fhahn
Copy link
Contributor

fhahn commented Apr 29, 2025

Split new recipes implementations to #137745. Split transformations (convertTo{Abstract|Concrete}Recipes) to #137746.

Would be good to combine #137745 & #137746, so the recipes are used already, sorry if that wasn't clearer from the previous comment

ElvisWang123 added a commit to ElvisWang123/llvm-project that referenced this pull request Apr 29, 2025
… to abstract recipe.

This patch introduce two new recipes.

* VPExtendedReductionRecipe
  - cast + reduction.

* VPMulAccumulateReductionRecipe
  - (cast) + mul + reduction.

This patch also implements the transformation that match following
patterns via vplan and converts to abstract recipes for better cost
estimation.

* VPExtendedReduction
  - reduce(cast(...))

* VPMulAccumulateReductionRecipe
  - reduce.add(mul(...))
  - reduce.add(mul(ext(...), ext(...))
  - reduce.add(ext(mul(ext(...), ext(...))))

The conveted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before recipe execution.

Split from llvm#113903.
@ElvisWang123
Copy link
Contributor Author

Would be good to combine #137745 & #137746, so the recipes are used already, sorry if that wasn't clearer from the previous comment

Ah it is fine, I misunderstood the previous comment 😃. Combined all the changes into #137746.

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…xit`. (llvm#135294)

This patch check if the plan contains scalar VF by VFRange instead of
Plan.
This patch also clamp the range to contains either only scalar or only
vector VFs to prevent mis-compile.

Split from llvm#113903.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…xit`. (llvm#135294)

This patch check if the plan contains scalar VF by VFRange instead of
Plan.
This patch also clamp the range to contains either only scalar or only
vector VFs to prevent mis-compile.

Split from llvm#113903.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
ElvisWang123 added a commit to ElvisWang123/llvm-project that referenced this pull request May 8, 2025
… to abstract recipe.

This patch introduce two new recipes.

* VPExtendedReductionRecipe
  - cast + reduction.

* VPMulAccumulateReductionRecipe
  - (cast) + mul + reduction.

This patch also implements the transformation that match following
patterns via vplan and converts to abstract recipes for better cost
estimation.

* VPExtendedReduction
  - reduce(cast(...))

* VPMulAccumulateReductionRecipe
  - reduce.add(mul(...))
  - reduce.add(mul(ext(...), ext(...))
  - reduce.add(ext(mul(ext(...), ext(...))))

The conveted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before recipe execution.

Split from llvm#113903.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for llvm#113903.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants