[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

skachkov-sc · 2023-09-25T14:20:39Z

This patch fixes the compilation time issue of matrix-types-spec test from test-suite.

Reproduction of the problem:

clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=<sysroot path> --gcc-toolchain=<gcc path> -O2 -fenable-matrix <test-suite-path>/SingleSource/UnitTests/matrix-types-spec.cpp

On my machine, compilation takes 50.44s. In comparison, the same test with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it takes 1.71s. It turns out that the main issue is unrolling of loop in multiplySpec function, that has extractelements with non-constant index:

for.body9.i:                                      ; preds = %for.body9.i, %for.cond6.preheader.i
  %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
  %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
  %77 = mul nuw nsw i64 %indvars.iv.i92, 25
  %78 = add nuw nsw i64 %77, %indvars.iv39.i91
  %matrixext.i = extractelement <475 x double> %62, i64 %78
  %79 = add nuw nsw i64 %indvars.iv.i92, %74
  %matrixext13.i = extractelement <209 x double> %73, i64 %79
  %80 = tail call double @llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
  %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
  %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
  br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21

When RVV is supported, extractelement/insertelement with non-constant index can be lowered quite efficiently with vslidedown/vslideup; otherwise it's lowered via stack memory operations, i.e. for extractelement each vector element is stored on stack and then the needed element is loaded back; for insertelement is stores all vector elements, rewrites the required element value and then loads vector back. Currently the cost of such expensive operation is estimated as zero, so loop unroll processes the loop more aggresively. The proper estimation of cost (in a way like in X86 target) prohibits unrolling of this loop and fixes compilation time (2.77s on my machine).

llvmbot · 2023-09-25T14:21:54Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-risc-v

Changes

This patch fixes the compilation time issue of matrix-types-spec test from test-suite.

Reproduction of the problem:

clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=&lt;sysroot path&gt; --gcc-toolchain=&lt;gcc path&gt; -O2 -fenable-matrix &lt;test-suite-path&gt;/SingleSource/UnitTests/matrix-types-spec.cpp

On my machine, compilation takes 50.44s. In comparison, the same test with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it takes 1.71s. It turns out that the main issue is unrolling of loop in multiplySpec function, that has extractelements with non-constant index:

for.body9.i:                                      ; preds = %for.body9.i, %for.cond6.preheader.i
  %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
  %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
  %77 = mul nuw nsw i64 %indvars.iv.i92, 25
  %78 = add nuw nsw i64 %77, %indvars.iv39.i91
  %matrixext.i = extractelement &lt;475 x double&gt; %62, i64 %78
  %79 = add nuw nsw i64 %indvars.iv.i92, %74
  %matrixext13.i = extractelement &lt;209 x double&gt; %73, i64 %79
  %80 = tail call double @<!-- -->llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
  %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
  %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
  br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21

When RVV is supported, extractelement/insertelement with non-constant index can be lowered quite efficiently with vslidedown/vslideup; otherwise it's lowered via stack memory operations, i.e. for extractelement each vector element is stored on stack and then the needed element is loaded back; for insertelement is stores all vector elements, rewrites the required element value and then loads vector back. Currently the cost of such expensive operation is estimated as zero, so loop unroll processes the loop more aggresively. The proper estimation of cost (in a way like in X86 target) prohibits unrolling of this loop and fixes compilation time (2.77s on my machine).

Patch is 21.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67334.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+18)
(added) llvm/test/Analysis/CostModel/RISCV/extractelement.ll (+114)
(added) llvm/test/Analysis/CostModel/RISCV/insertelement.ll (+114)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4a34362a792f150..2df4ea77033f1c6 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1456,6 +1456,24 @@ InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
       Opcode != Instruction::InsertElement)
     return BaseT::getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
 
+  // Extract/InsertElement with non-constant index is very costly without
+  // V instructions; estimate cost of loads/stores sequence via the stack:
+  // ExtractElement cost: store vector to stack, load scalar;
+  // InsertElement cost: store vector to stack, store scalar, load vector.
+  if (Index == -1U && !ST->hasVInstructions()) {
+    auto *VecTy = cast<VectorType>(Val);
+    auto *ElemTy = VecTy->getElementType();
+    auto NumElems = VecTy->getElementCount().getKnownMinValue();
+    auto Align = DL.getPrefTypeAlign(ElemTy);
+    InstructionCost LoadCost =
+        getMemoryOpCost(Instruction::Load, ElemTy, Align, 0, CostKind);
+    InstructionCost StoreCost =
+        getMemoryOpCost(Instruction::Store, ElemTy, Align, 0, CostKind);
+    return Opcode == Instruction::ExtractElement
+               ? StoreCost * NumElems + LoadCost
+               : (StoreCost + LoadCost) * NumElems + StoreCost;
+  }
+
   // Legalize the type.
   std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
 
diff --git a/llvm/test/Analysis/CostModel/RISCV/extractelement.ll b/llvm/test/Analysis/CostModel/RISCV/extractelement.ll
new file mode 100644
index 000000000000000..8f9275f717f1176
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/RISCV/extractelement.ll
@@ -0,0 +1,114 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv32 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV32
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV64
+
+define void @extractelement_int(i32 %x) {
+; RV32-LABEL: 'extractelement_int'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i8 = extractelement <2 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i8 = extractelement <4 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i8 = extractelement <8 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i8 = extractelement <16 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i16 = extractelement <2 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i16 = extractelement <4 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i16 = extractelement <8 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i16 = extractelement <16 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i32 = extractelement <2 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i32 = extractelement <4 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i32 = extractelement <8 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i32 = extractelement <16 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v2i64 = extractelement <2 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v4i64 = extractelement <4 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v8i64 = extractelement <8 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v16i64 = extractelement <16 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'extractelement_int'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i8 = extractelement <2 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i8 = extractelement <4 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i8 = extractelement <8 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i8 = extractelement <16 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i16 = extractelement <2 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i16 = extractelement <4 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i16 = extractelement <8 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i16 = extractelement <16 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i32 = extractelement <2 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i32 = extractelement <4 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i32 = extractelement <8 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i32 = extractelement <16 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i64 = extractelement <2 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i64 = extractelement <4 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i64 = extractelement <8 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i64 = extractelement <16 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2i8 = extractelement <2 x i8> undef, i32 %x
+  %v4i8 = extractelement <4 x i8> undef, i32 %x
+  %v8i8 = extractelement <8 x i8> undef, i32 %x
+  %v16i8 = extractelement <16 x i8> undef, i32 %x
+
+  %v2i16 = extractelement <2 x i16> undef, i32 %x
+  %v4i16 = extractelement <4 x i16> undef, i32 %x
+  %v8i16 = extractelement <8 x i16> undef, i32 %x
+  %v16i16 = extractelement <16 x i16> undef, i32 %x
+
+  %v2i32 = extractelement <2 x i32> undef, i32 %x
+  %v4i32 = extractelement <4 x i32> undef, i32 %x
+  %v8i32 = extractelement <8 x i32> undef, i32 %x
+  %v16i32 = extractelement <16 x i32> undef, i32 %x
+
+  %v2i64 = extractelement <2 x i64> undef, i32 %x
+  %v4i64 = extractelement <4 x i64> undef, i32 %x
+  %v8i64 = extractelement <8 x i64> undef, i32 %x
+  %v16i64 = extractelement <16 x i64> undef, i32 %x
+
+  ret void
+}
+
+define void @extractelement_fp(i32 %x) {
+; RV32-LABEL: 'extractelement_fp'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = extractelement <2 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f16 = extractelement <4 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f16 = extractelement <8 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f16 = extractelement <16 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f32 = extractelement <2 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f32 = extractelement <4 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f32 = extractelement <8 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f32 = extractelement <16 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f64 = extractelement <2 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f64 = extractelement <4 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f64 = extractelement <8 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f64 = extractelement <16 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'extractelement_fp'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = extractelement <2 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f16 = extractelement <4 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f16 = extractelement <8 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f16 = extractelement <16 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f32 = extractelement <2 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f32 = extractelement <4 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f32 = extractelement <8 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f32 = extractelement <16 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f64 = extractelement <2 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f64 = extractelement <4 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f64 = extractelement <8 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f64 = extractelement <16 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2f16 = extractelement <2 x half> undef, i32 %x
+  %v4f16 = extractelement <4 x half> undef, i32 %x
+  %v8f16 = extractelement <8 x half> undef, i32 %x
+  %v16f16 = extractelement <16 x half> undef, i32 %x
+
+  %v2f32 = extractelement <2 x float> undef, i32 %x
+  %v4f32 = extractelement <4 x float> undef, i32 %x
+  %v8f32 = extractelement <8 x float> undef, i32 %x
+  %v16f32 = extractelement <16 x float> undef, i32 %x
+
+  %v2f64 = extractelement <2 x double> undef, i32 %x
+  %v4f64 = extractelement <4 x double> undef, i32 %x
+  %v8f64 = extractelement <8 x double> undef, i32 %x
+  %v16f64 = extractelement <16 x double> undef, i32 %x
+
+  ret void
+}
diff --git a/llvm/test/Analysis/CostModel/RISCV/insertelement.ll b/llvm/test/Analysis/CostModel/RISCV/insertelement.ll
new file mode 100644
index 000000000000000..4cdeaf26e19e4d1
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/RISCV/insertelement.ll
@@ -0,0 +1,114 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv32 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV32
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV64
+
+define void @insertelement_int(i32 %x) {
+; RV32-LABEL: 'insertelement_int'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 66 for instruction: %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'insertelement_int'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+  %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+  %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+  %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+
+  %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+  %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+  %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+  %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+
+  %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+  %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+  %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+  %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+
+  %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+  %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+  %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+  %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+
+  ret void
+}
+
+define void @insertelement_fp(i32 %x) {
+; RV32-LABEL: 'insertelement_fp'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f16 = insertelement <2 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f16 = insertelement <4 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f16 = insertelement <8 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f16 = insertelement <16 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f32 = insertelement <2 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f32 = insertelement <4 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f32 = insertelement <8 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f32 = insertelement <16 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f64 = insertelement <2 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f64 = insertelement <4 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f64 = insertelement <8 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f64 = insertelement <16 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'insertelement_fp'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f16 = insertelement <2 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f16 = insertelement <4 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f16 = insertelement <8 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f16 = insertelement <16 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f32 = insertelement <2 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f32 = insertelement <4 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f32 = insertelement <8 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f32 = insertelement <16 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f64 = insertelement <2 x double> undef, double undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f64 = insertelement <4 x double> undef, double undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for inst...
[truncated]

skachkov-sc · 2023-09-25T14:31:58Z

FYI @fhahn (looks like that access to builtin matrix type through non-constant index is very inefficient on targets without vector registers)

preames

Ouch, good catch. Minor review comments inline, this looks straight forward and likely to converge quickly.

p.s. Mostly for my own later self... I don't think this interacts with build_vector or explode_vector costing and/or SLP because those cases should only involve constant indices.

preames · 2023-09-25T17:21:02Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

+  if (Index == -1U && !ST->hasVInstructions()) {
+    auto *VecTy = cast<VectorType>(Val);
+    auto *ElemTy = VecTy->getElementType();
+    auto NumElems = VecTy->getElementCount().getKnownMinValue();


This will produce a significant under-estimate on cost for scalable vectors as the number elements will be very low. See getEstimatedVLFor.

Actually, I've realized that scalable vectors types can't be lowered with scalars (backend fails with "Don't know how to legalize this scalable vector type" error), and currently the reported cost of these operations is Invalid. So probably it's better to return Invalid cost instead of numeric value to show that it can't be lowered (added some test cases with vscale vectors)

I'm a bit confused by your comment. I wouldn't expect scalable vectors to be lowered via scalars, I'd expect this case to be lowered via the stack and thus your new costing would be relevant. Are you saying that we fail to lower this case entirely with scalable vectors?

Sorry for the bad wording, I mean that is possible to lower only fixed-width vector insertelement/extractelement via stack, scalable vectors will fail now: https://godbolt.org/z/9qj68zfTv

That test crashes in argument lowering, not insertelement handling. However, your point stands, here's an alternative example: https://godbolt.org/z/13v643fGe

It does look like we're trying to scalarize the type rather than legalize the operation via the stack.

preames · 2023-09-25T17:22:24Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

@@ -1456,6 +1456,24 @@ InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
      Opcode != Instruction::InsertElement)
    return BaseT::getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);

+  // Extract/InsertElement with non-constant index is very costly without


Placement wise, this should probably be under the if (!LT.second.isVector()) check below.

Thank you, after moving the code I've discovered one more missing case here: there are floating-point tests in rvv-insertelement.ll, rvv-extractelement.ll and with zve64x extension the reported cost was zero. However, the actual cost is much higher because in absence of zve64f/zve64d extensions it's also lowered via stack (proof: https://godbolt.org/z/aob3MEMPP). Now these cases are also processed (and reflected in tests)

preames · 2023-09-25T17:24:09Z

llvm/test/Analysis/CostModel/RISCV/insertelement.ll

@@ -0,0 +1,114 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3


Can you precommit the tests so that we can see the diff? You don't need separate review for that.

Pull request consists from 2 commits (pre-commit test and code change), that can be checked separately (by choosing the required one on GitHub's commits tab)

preames · 2023-09-26T16:17:59Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

+  if (!LT.second.isVector()) {
+    auto *FixedVecTy = dyn_cast<FixedVectorType>(Val);
+    // Scalable vectors can't be legalized with scalars, return invalid cost.
+    if (!FixedVecTy)


Does this bit actually get hit? I'd have expected it to go through the if (LT.second.isScalableVector() && !LT.first.isValid()) check just below.

Yes, the condition below handles it, removed this

preames

LGTM

…onstant index when vector instructions are not available When vector instructions are not available, Extract/InsertElement with N-element vector types and non-constant indexes are lowered in this way: ExtractElement: N stores of each element on stack, 1 load of required element InsertElement: N stores of each element on stack, 1 store to rewrite required element, N loads of each element back This patch implements cost calculation of these operations to fix compilation time of matrix-types-spec test.

…nstant index when vector instructions are not available (llvm#67334) This patch fixes the compilation time issue of matrix-types-spec test from test-suite. Reproduction of the problem: ``` clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=<sysroot path> --gcc-toolchain=<gcc path> -O2 -fenable-matrix <test-suite-path>/SingleSource/UnitTests/matrix-types-spec.cpp ``` On my machine, compilation takes 50.44s. In comparison, the same test with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it takes 1.71s. It turns out that the main issue is unrolling of loop in multiplySpec function, that has extractelements with non-constant index: ``` for.body9.i: ; preds = %for.body9.i, %for.cond6.preheader.i %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ] %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ] %77 = mul nuw nsw i64 %indvars.iv.i92, 25 %78 = add nuw nsw i64 %77, %indvars.iv39.i91 %matrixext.i = extractelement <475 x double> %62, i64 %78 %79 = add nuw nsw i64 %indvars.iv.i92, %74 %matrixext13.i = extractelement <209 x double> %73, i64 %79 %80 = tail call double @llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i) %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1 %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19 br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21 ``` When RVV is supported, extractelement/insertelement with non-constant index can be lowered quite efficiently with vslidedown/vslideup; otherwise it's lowered via stack memory operations, i.e. for extractelement each vector element is stored on stack and then the needed element is loaded back; for insertelement is stores all vector elements, rewrites the required element value and then loads vector back. Currently the cost of such expensive operation is estimated as zero, so loop unroll processes the loop more aggresively. The proper estimation of cost (in a way like in X86 target) prohibits unrolling of this loop and fixes compilation time (2.77s on my machine).

llvmbot added backend:RISC-V llvm:analysis Includes value tracking, cost tables and constant folding labels Sep 25, 2023

skachkov-sc requested review from michaelmaitland and preames September 25, 2023 14:23

preames reviewed Sep 25, 2023

View reviewed changes

skachkov-sc force-pushed the insert-extract-cost branch from a863a5d to 5c0cd0c Compare September 26, 2023 12:50

preames reviewed Sep 26, 2023

View reviewed changes

preames approved these changes Sep 26, 2023

View reviewed changes

skachkov-sc added 2 commits September 27, 2023 12:05

[RISCV][CostModel] Add pre-commit test

bf83e01

skachkov-sc force-pushed the insert-extract-cost branch from 62ea144 to 27394ae Compare September 27, 2023 10:05

skachkov-sc merged commit 3d7df0a into llvm:main Sep 27, 2023

dmsh11 mentioned this pull request Oct 17, 2023

New Landscape Entry - Syntacore riscv-admin/riscv-landscape#539

Closed

joe-img mentioned this pull request May 16, 2024

[RISCV][LoopUnroll/CostModel] Loops with Insert/Extract element using induction variable without vector are not unrolled #92366

Open

lukel97 mentioned this pull request Nov 25, 2024

[RISCV] Add cost for @llvm.experimental.vp.splat #117313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

Uh oh!

skachkov-sc commented Sep 25, 2023

Uh oh!

llvmbot commented Sep 25, 2023 •

edited

Loading

Uh oh!

skachkov-sc commented Sep 25, 2023

Uh oh!

preames left a comment

Uh oh!

preames Sep 25, 2023

Uh oh!

skachkov-sc Sep 26, 2023

Uh oh!

preames Sep 26, 2023

Uh oh!

skachkov-sc Sep 26, 2023

Uh oh!

preames Sep 26, 2023 •

edited

Loading

Uh oh!

preames Sep 25, 2023

Uh oh!

skachkov-sc Sep 26, 2023

Uh oh!

preames Sep 25, 2023

Uh oh!

skachkov-sc Sep 26, 2023

Uh oh!

preames Sep 26, 2023

Uh oh!

skachkov-sc Sep 26, 2023

Uh oh!

preames left a comment

Uh oh!

Uh oh!

		@@ -0,0 +1,114 @@
		; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3

[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

Uh oh!

Conversation

skachkov-sc commented Sep 25, 2023

Uh oh!

llvmbot commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skachkov-sc commented Sep 25, 2023

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preames Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Sep 25, 2023 •

edited

Loading

preames Sep 26, 2023 •

edited

Loading