Skip to content

[RISCV][CostModel] Estimate cost of Extract/InsertElement with non-constant index when vector instructions are not available #67334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

skachkov-sc
Copy link
Contributor

This patch fixes the compilation time issue of matrix-types-spec test from test-suite.

Reproduction of the problem:

clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=<sysroot path> --gcc-toolchain=<gcc path> -O2 -fenable-matrix <test-suite-path>/SingleSource/UnitTests/matrix-types-spec.cpp

On my machine, compilation takes 50.44s. In comparison, the same test with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it takes 1.71s. It turns out that the main issue is unrolling of loop in multiplySpec function, that has extractelements with non-constant index:

for.body9.i:                                      ; preds = %for.body9.i, %for.cond6.preheader.i
  %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
  %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
  %77 = mul nuw nsw i64 %indvars.iv.i92, 25
  %78 = add nuw nsw i64 %77, %indvars.iv39.i91
  %matrixext.i = extractelement <475 x double> %62, i64 %78
  %79 = add nuw nsw i64 %indvars.iv.i92, %74
  %matrixext13.i = extractelement <209 x double> %73, i64 %79
  %80 = tail call double @llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
  %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
  %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
  br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21

When RVV is supported, extractelement/insertelement with non-constant index can be lowered quite efficiently with vslidedown/vslideup; otherwise it's lowered via stack memory operations, i.e. for extractelement each vector element is stored on stack and then the needed element is loaded back; for insertelement is stores all vector elements, rewrites the required element value and then loads vector back. Currently the cost of such expensive operation is estimated as zero, so loop unroll processes the loop more aggresively. The proper estimation of cost (in a way like in X86 target) prohibits unrolling of this loop and fixes compilation time (2.77s on my machine).

@llvmbot
Copy link
Member

llvmbot commented Sep 25, 2023

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-risc-v

Changes

This patch fixes the compilation time issue of matrix-types-spec test from test-suite.

Reproduction of the problem:

clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=&lt;sysroot path&gt; --gcc-toolchain=&lt;gcc path&gt; -O2 -fenable-matrix &lt;test-suite-path&gt;/SingleSource/UnitTests/matrix-types-spec.cpp

On my machine, compilation takes 50.44s. In comparison, the same test with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it takes 1.71s. It turns out that the main issue is unrolling of loop in multiplySpec function, that has extractelements with non-constant index:

for.body9.i:                                      ; preds = %for.body9.i, %for.cond6.preheader.i
  %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
  %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
  %77 = mul nuw nsw i64 %indvars.iv.i92, 25
  %78 = add nuw nsw i64 %77, %indvars.iv39.i91
  %matrixext.i = extractelement &lt;475 x double&gt; %62, i64 %78
  %79 = add nuw nsw i64 %indvars.iv.i92, %74
  %matrixext13.i = extractelement &lt;209 x double&gt; %73, i64 %79
  %80 = tail call double @<!-- -->llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
  %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
  %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
  br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21

When RVV is supported, extractelement/insertelement with non-constant index can be lowered quite efficiently with vslidedown/vslideup; otherwise it's lowered via stack memory operations, i.e. for extractelement each vector element is stored on stack and then the needed element is loaded back; for insertelement is stores all vector elements, rewrites the required element value and then loads vector back. Currently the cost of such expensive operation is estimated as zero, so loop unroll processes the loop more aggresively. The proper estimation of cost (in a way like in X86 target) prohibits unrolling of this loop and fixes compilation time (2.77s on my machine).


Patch is 21.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67334.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+18)
  • (added) llvm/test/Analysis/CostModel/RISCV/extractelement.ll (+114)
  • (added) llvm/test/Analysis/CostModel/RISCV/insertelement.ll (+114)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4a34362a792f150..2df4ea77033f1c6 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1456,6 +1456,24 @@ InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
       Opcode != Instruction::InsertElement)
     return BaseT::getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
 
+  // Extract/InsertElement with non-constant index is very costly without
+  // V instructions; estimate cost of loads/stores sequence via the stack:
+  // ExtractElement cost: store vector to stack, load scalar;
+  // InsertElement cost: store vector to stack, store scalar, load vector.
+  if (Index == -1U && !ST->hasVInstructions()) {
+    auto *VecTy = cast<VectorType>(Val);
+    auto *ElemTy = VecTy->getElementType();
+    auto NumElems = VecTy->getElementCount().getKnownMinValue();
+    auto Align = DL.getPrefTypeAlign(ElemTy);
+    InstructionCost LoadCost =
+        getMemoryOpCost(Instruction::Load, ElemTy, Align, 0, CostKind);
+    InstructionCost StoreCost =
+        getMemoryOpCost(Instruction::Store, ElemTy, Align, 0, CostKind);
+    return Opcode == Instruction::ExtractElement
+               ? StoreCost * NumElems + LoadCost
+               : (StoreCost + LoadCost) * NumElems + StoreCost;
+  }
+
   // Legalize the type.
   std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
 
diff --git a/llvm/test/Analysis/CostModel/RISCV/extractelement.ll b/llvm/test/Analysis/CostModel/RISCV/extractelement.ll
new file mode 100644
index 000000000000000..8f9275f717f1176
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/RISCV/extractelement.ll
@@ -0,0 +1,114 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv32 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV32
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV64
+
+define void @extractelement_int(i32 %x) {
+; RV32-LABEL: 'extractelement_int'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i8 = extractelement <2 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i8 = extractelement <4 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i8 = extractelement <8 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i8 = extractelement <16 x i8> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i16 = extractelement <2 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i16 = extractelement <4 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i16 = extractelement <8 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i16 = extractelement <16 x i16> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i32 = extractelement <2 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i32 = extractelement <4 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i32 = extractelement <8 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i32 = extractelement <16 x i32> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v2i64 = extractelement <2 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v4i64 = extractelement <4 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v8i64 = extractelement <8 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v16i64 = extractelement <16 x i64> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'extractelement_int'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i8 = extractelement <2 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i8 = extractelement <4 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i8 = extractelement <8 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i8 = extractelement <16 x i8> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i16 = extractelement <2 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i16 = extractelement <4 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i16 = extractelement <8 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i16 = extractelement <16 x i16> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i32 = extractelement <2 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i32 = extractelement <4 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i32 = extractelement <8 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i32 = extractelement <16 x i32> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2i64 = extractelement <2 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4i64 = extractelement <4 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8i64 = extractelement <8 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16i64 = extractelement <16 x i64> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2i8 = extractelement <2 x i8> undef, i32 %x
+  %v4i8 = extractelement <4 x i8> undef, i32 %x
+  %v8i8 = extractelement <8 x i8> undef, i32 %x
+  %v16i8 = extractelement <16 x i8> undef, i32 %x
+
+  %v2i16 = extractelement <2 x i16> undef, i32 %x
+  %v4i16 = extractelement <4 x i16> undef, i32 %x
+  %v8i16 = extractelement <8 x i16> undef, i32 %x
+  %v16i16 = extractelement <16 x i16> undef, i32 %x
+
+  %v2i32 = extractelement <2 x i32> undef, i32 %x
+  %v4i32 = extractelement <4 x i32> undef, i32 %x
+  %v8i32 = extractelement <8 x i32> undef, i32 %x
+  %v16i32 = extractelement <16 x i32> undef, i32 %x
+
+  %v2i64 = extractelement <2 x i64> undef, i32 %x
+  %v4i64 = extractelement <4 x i64> undef, i32 %x
+  %v8i64 = extractelement <8 x i64> undef, i32 %x
+  %v16i64 = extractelement <16 x i64> undef, i32 %x
+
+  ret void
+}
+
+define void @extractelement_fp(i32 %x) {
+; RV32-LABEL: 'extractelement_fp'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = extractelement <2 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f16 = extractelement <4 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f16 = extractelement <8 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f16 = extractelement <16 x half> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f32 = extractelement <2 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f32 = extractelement <4 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f32 = extractelement <8 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f32 = extractelement <16 x float> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f64 = extractelement <2 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f64 = extractelement <4 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f64 = extractelement <8 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f64 = extractelement <16 x double> undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'extractelement_fp'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = extractelement <2 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f16 = extractelement <4 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f16 = extractelement <8 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f16 = extractelement <16 x half> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f32 = extractelement <2 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f32 = extractelement <4 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f32 = extractelement <8 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f32 = extractelement <16 x float> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f64 = extractelement <2 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v4f64 = extractelement <4 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v8f64 = extractelement <8 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v16f64 = extractelement <16 x double> undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2f16 = extractelement <2 x half> undef, i32 %x
+  %v4f16 = extractelement <4 x half> undef, i32 %x
+  %v8f16 = extractelement <8 x half> undef, i32 %x
+  %v16f16 = extractelement <16 x half> undef, i32 %x
+
+  %v2f32 = extractelement <2 x float> undef, i32 %x
+  %v4f32 = extractelement <4 x float> undef, i32 %x
+  %v8f32 = extractelement <8 x float> undef, i32 %x
+  %v16f32 = extractelement <16 x float> undef, i32 %x
+
+  %v2f64 = extractelement <2 x double> undef, i32 %x
+  %v4f64 = extractelement <4 x double> undef, i32 %x
+  %v8f64 = extractelement <8 x double> undef, i32 %x
+  %v16f64 = extractelement <16 x double> undef, i32 %x
+
+  ret void
+}
diff --git a/llvm/test/Analysis/CostModel/RISCV/insertelement.ll b/llvm/test/Analysis/CostModel/RISCV/insertelement.ll
new file mode 100644
index 000000000000000..4cdeaf26e19e4d1
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/RISCV/insertelement.ll
@@ -0,0 +1,114 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv32 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV32
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+f,+d,+zfh < %s | FileCheck %s --check-prefixes=RV64
+
+define void @insertelement_int(i32 %x) {
+; RV32-LABEL: 'insertelement_int'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 66 for instruction: %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'insertelement_int'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %v2i8 = insertelement <2 x i8> undef, i8 undef, i32 %x
+  %v4i8 = insertelement <4 x i8> undef, i8 undef, i32 %x
+  %v8i8 = insertelement <8 x i8> undef, i8 undef, i32 %x
+  %v16i8 = insertelement <16 x i8> undef, i8 undef, i32 %x
+
+  %v2i16 = insertelement <2 x i16> undef, i16 undef, i32 %x
+  %v4i16 = insertelement <4 x i16> undef, i16 undef, i32 %x
+  %v8i16 = insertelement <8 x i16> undef, i16 undef, i32 %x
+  %v16i16 = insertelement <16 x i16> undef, i16 undef, i32 %x
+
+  %v2i32 = insertelement <2 x i32> undef, i32 undef, i32 %x
+  %v4i32 = insertelement <4 x i32> undef, i32 undef, i32 %x
+  %v8i32 = insertelement <8 x i32> undef, i32 undef, i32 %x
+  %v16i32 = insertelement <16 x i32> undef, i32 undef, i32 %x
+
+  %v2i64 = insertelement <2 x i64> undef, i64 undef, i32 %x
+  %v4i64 = insertelement <4 x i64> undef, i64 undef, i32 %x
+  %v8i64 = insertelement <8 x i64> undef, i64 undef, i32 %x
+  %v16i64 = insertelement <16 x i64> undef, i64 undef, i32 %x
+
+  ret void
+}
+
+define void @insertelement_fp(i32 %x) {
+; RV32-LABEL: 'insertelement_fp'
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f16 = insertelement <2 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f16 = insertelement <4 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f16 = insertelement <8 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f16 = insertelement <16 x half> undef, half undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f32 = insertelement <2 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f32 = insertelement <4 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f32 = insertelement <8 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f32 = insertelement <16 x float> undef, float undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f64 = insertelement <2 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f64 = insertelement <4 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f64 = insertelement <8 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f64 = insertelement <16 x double> undef, double undef, i32 %x
+; RV32-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; RV64-LABEL: 'insertelement_fp'
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f16 = insertelement <2 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f16 = insertelement <4 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f16 = insertelement <8 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f16 = insertelement <16 x half> undef, half undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f32 = insertelement <2 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f32 = insertelement <4 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v8f32 = insertelement <8 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v16f32 = insertelement <16 x float> undef, float undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v2f64 = insertelement <2 x double> undef, double undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v4f64 = insertelement <4 x double> undef, double undef, i32 %x
+; RV64-NEXT:  Cost Model: Found an estimated cost of 17 for inst...
[truncated]

@skachkov-sc
Copy link
Contributor Author

FYI @fhahn (looks like that access to builtin matrix type through non-constant index is very inefficient on targets without vector registers)

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, good catch. Minor review comments inline, this looks straight forward and likely to converge quickly.

p.s. Mostly for my own later self... I don't think this interacts with build_vector or explode_vector costing and/or SLP because those cases should only involve constant indices.

if (Index == -1U && !ST->hasVInstructions()) {
auto *VecTy = cast<VectorType>(Val);
auto *ElemTy = VecTy->getElementType();
auto NumElems = VecTy->getElementCount().getKnownMinValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce a significant under-estimate on cost for scalable vectors as the number elements will be very low. See getEstimatedVLFor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I've realized that scalable vectors types can't be lowered with scalars (backend fails with "Don't know how to legalize this scalable vector type" error), and currently the reported cost of these operations is Invalid. So probably it's better to return Invalid cost instead of numeric value to show that it can't be lowered (added some test cases with vscale vectors)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by your comment. I wouldn't expect scalable vectors to be lowered via scalars, I'd expect this case to be lowered via the stack and thus your new costing would be relevant. Are you saying that we fail to lower this case entirely with scalable vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the bad wording, I mean that is possible to lower only fixed-width vector insertelement/extractelement via stack, scalable vectors will fail now: https://godbolt.org/z/9qj68zfTv

Copy link
Collaborator

@preames preames Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test crashes in argument lowering, not insertelement handling. However, your point stands, here's an alternative example: https://godbolt.org/z/13v643fGe

It does look like we're trying to scalarize the type rather than legalize the operation via the stack.

@@ -1456,6 +1456,24 @@ InstructionCost RISCVTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
Opcode != Instruction::InsertElement)
return BaseT::getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);

// Extract/InsertElement with non-constant index is very costly without
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placement wise, this should probably be under the if (!LT.second.isVector()) check below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, after moving the code I've discovered one more missing case here: there are floating-point tests in rvv-insertelement.ll, rvv-extractelement.ll and with zve64x extension the reported cost was zero. However, the actual cost is much higher because in absence of zve64f/zve64d extensions it's also lowered via stack (proof: https://godbolt.org/z/aob3MEMPP). Now these cases are also processed (and reflected in tests)

@@ -0,0 +1,114 @@
; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you precommit the tests so that we can see the diff? You don't need separate review for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request consists from 2 commits (pre-commit test and code change), that can be checked separately (by choosing the required one on GitHub's commits tab)

if (!LT.second.isVector()) {
auto *FixedVecTy = dyn_cast<FixedVectorType>(Val);
// Scalable vectors can't be legalized with scalars, return invalid cost.
if (!FixedVecTy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this bit actually get hit? I'd have expected it to go through the if (LT.second.isScalableVector() && !LT.first.isValid()) check just below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the condition below handles it, removed this

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…onstant index when vector instructions are not available

When vector instructions are not available, Extract/InsertElement with N-element
vector types and non-constant indexes are lowered in this way:
ExtractElement: N stores of each element on stack, 1 load of required element
InsertElement: N stores of each element on stack, 1 store to rewrite required
element, N loads of each element back

This patch implements cost calculation of these operations to fix compilation
time of matrix-types-spec test.
@skachkov-sc skachkov-sc merged commit 3d7df0a into llvm:main Sep 27, 2023
legrosbuffle pushed a commit to legrosbuffle/llvm-project that referenced this pull request Sep 29, 2023
…nstant index when vector instructions are not available (llvm#67334)

This patch fixes the compilation time issue of matrix-types-spec test
from test-suite.

Reproduction of the problem:
```
clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=<sysroot path> --gcc-toolchain=<gcc path> -O2 -fenable-matrix <test-suite-path>/SingleSource/UnitTests/matrix-types-spec.cpp
```

On my machine, compilation takes 50.44s. In comparison, the same test
with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it
takes 1.71s. It turns out that the main issue is unrolling of loop in
multiplySpec function, that has extractelements with non-constant index:
```
for.body9.i:                                      ; preds = %for.body9.i, %for.cond6.preheader.i
  %indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
  %Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
  %77 = mul nuw nsw i64 %indvars.iv.i92, 25
  %78 = add nuw nsw i64 %77, %indvars.iv39.i91
  %matrixext.i = extractelement <475 x double> %62, i64 %78
  %79 = add nuw nsw i64 %indvars.iv.i92, %74
  %matrixext13.i = extractelement <209 x double> %73, i64 %79
  %80 = tail call double @llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
  %indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
  %exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
  br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21
```

When RVV is supported, extractelement/insertelement with non-constant
index can be lowered quite efficiently with vslidedown/vslideup;
otherwise it's lowered via stack memory operations, i.e. for
extractelement each vector element is stored on stack and then the
needed element is loaded back; for insertelement is stores all vector
elements, rewrites the required element value and then loads vector
back. Currently the cost of such expensive operation is estimated as
zero, so loop unroll processes the loop more aggresively. The proper
estimation of cost (in a way like in X86 target) prohibits unrolling of
this loop and fixes compilation time (2.77s on my machine).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants