[LV] Enable considering higher VFs when data extend ops are present i… #137593

sushgokh · 2025-04-28T08:00:59Z

…n the loop

LV currently limits the VF based on the widest type in the loop. This might not be beneficial for loops with data extend ops in them. In some cases, this strategy has been found to inhibit considering higher VFs even though a higher VF might be profitable.

This patch aims to relax this constraint to enable higher VFs and lets the cost model take the decision of considering whether a particular VF is beneficial or not.

…n the loop LV currently limits the VF based on the widest type in the loop. This might not be beneficial for loops with data extend ops in them. In some cases, this strategy has been found to inhibit considering higher VFs even though a higher VF might be profitable. This patch aims to relax this constraint to enable higher VFs and lets the cost model take the decision of considering whether a particular VF is beneficial or not.

llvmbot · 2025-04-28T08:01:47Z

@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-webassembly
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-backend-powerpc

Author: Sushant Gokhale (sushgokh)

Changes

…n the loop

LV currently limits the VF based on the widest type in the loop. This might not be beneficial for loops with data extend ops in them. In some cases, this strategy has been found to inhibit considering higher VFs even though a higher VF might be profitable.

This patch aims to relax this constraint to enable higher VFs and lets the cost model take the decision of considering whether a particular VF is beneficial or not.

Patch is 650.15 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137593.diff

50 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+9)
(modified) llvm/test/CodeGen/WebAssembly/int-mac-reduction-loops.ll (+104-159)
(modified) llvm/test/CodeGen/WebAssembly/interleave.ll (+29-5)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+61-31)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll (+30-30)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-epilogue.ll (+22-74)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-mixed.ll (+140-88)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (+646-518)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub.ll (+35-35)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/store-costs-sve.ll (+77-51)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/streaming-compatible-sve-no-maximize-bandwidth.ll (+21-21)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll (+67-19)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/type-shrinkage-zext-costs.ll (+38-26)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/wider-VF-for-callinst.ll (+14-13)
(modified) llvm/test/Transforms/LoopVectorize/ARM/gcc-examples.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/ARM/optsize_minsize.ll (+32-32)
(modified) llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll (+7-7)
(modified) llvm/test/Transforms/LoopVectorize/ARM/sphinx.ll (+15-15)
(modified) llvm/test/Transforms/LoopVectorize/ARM/tail-folding-not-allowed.ll (+10-10)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+48-138)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll (+170-122)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/pr30990.ll (+8-8)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/bf16.ll (+10-10)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/blend-any-of-reduction-cost.ll (+67-24)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll (+66-26)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/illegal-type.ll (+10-10)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+17-17)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/masked_gather_scatter.ll (+40-40)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-cost.ll (+20-20)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/type-info-cache-evl-crash.ll (+17-17)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-cast-intrinsics.ll (+35-35)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-cast-intrinsics.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/WebAssembly/int-mac-reduction-costs.ll (+83-8)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+247-52)
(modified) llvm/test/Transforms/LoopVectorize/X86/gcc-examples.ll (+7-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (+15-15)
(modified) llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll (+237-480)
(modified) llvm/test/Transforms/LoopVectorize/X86/no_fpmath.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+63-24)
(modified) llvm/test/Transforms/LoopVectorize/X86/reduction-crash.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll (+54-14)
(modified) llvm/test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks.ll (+1-1)
(modified) llvm/test/Transforms/PhaseOrdering/X86/pixel-splat.ll (+37-18)
(modified) llvm/test/Transforms/PhaseOrdering/X86/preserve-access-group.ll (+15-15)
(modified) llvm/test/Transforms/PhaseOrdering/X86/vector-reduction-known-first-value.ll (+255-19)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f985e883d0dde..84444435bacbd 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4123,6 +4123,15 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   auto MaxVectorElementCount = ElementCount::get(
       llvm::bit_floor(WidestRegister.getKnownMinValue() / WidestType),
       ComputeScalableMaxVF);
+
+  // For loops with extend operations e.g. zext, sext etc., limiting the max VF
+  // based on widest type inhibits considering higher VFs even though
+  // vectorizing with higher VF might be profitable. In such cases, we should
+  // limit the max VF based on smallest type and the decision whether a
+  // particular VF is beneficial or not be left to cost model.
+  if (WidestType != SmallestType)
+    MaximizeBandwidth = true;
+
   MaxVectorElementCount = MinVF(MaxVectorElementCount, MaxSafeVF);
   LLVM_DEBUG(dbgs() << "LV: The Widest register safe to use is: "
                     << (MaxVectorElementCount * WidestType) << " bits.\n");
diff --git a/llvm/test/CodeGen/WebAssembly/int-mac-reduction-loops.ll b/llvm/test/CodeGen/WebAssembly/int-mac-reduction-loops.ll
index 0184e22a3b40d..ae31672854077 100644
--- a/llvm/test/CodeGen/WebAssembly/int-mac-reduction-loops.ll
+++ b/llvm/test/CodeGen/WebAssembly/int-mac-reduction-loops.ll
@@ -1,27 +1,20 @@
 ; RUN: opt -mattr=+simd128 -passes=loop-vectorize %s | llc -mtriple=wasm32 -mattr=+simd128 -verify-machineinstrs -o - | FileCheck %s
-; RUN: opt -mattr=+simd128 -passes=loop-vectorize -vectorizer-maximize-bandwidth %s | llc -mtriple=wasm32 -mattr=+simd128 -verify-machineinstrs -o - | FileCheck %s --check-prefix=MAX-BANDWIDTH
+; RUN: opt -mattr=+simd128 -passes=loop-vectorize -vectorizer-maximize-bandwidth %s | llc -mtriple=wasm32 -mattr=+simd128 -verify-machineinstrs -o - | FileCheck %s
 
 target triple = "wasm32"
 
 define hidden i32 @i32_mac_s8(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i32_mac_s8:
-; CHECK:    v128.load32_zero 0:p2align=0
-; CHECK:    i16x8.extend_low_i8x16_s
-; CHECK:    v128.load32_zero 0:p2align=0
-; CHECK:    i16x8.extend_low_i8x16_s
-; CHECK:    i32x4.extmul_low_i16x8_s
-; CHECK:    i32x4.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i16x8.extend_low_i8x16_s
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i16x8.extend_low_i8x16_s
-; MAX-BANDWIDTH: i32x4.dot_i16x8_s
-; MAX-BANDWIDTH: i16x8.extend_high_i8x16_s
-; MAX-BANDWIDTH: i16x8.extend_high_i8x16_s
-; MAX-BANDWIDTH: i32x4.dot_i16x8_s
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i32x4.add
+; CHECK: v128.load
+; CHECK: i16x8.extend_low_i8x16_s
+; CHECK: v128.load
+; CHECK: i16x8.extend_low_i8x16_s
+; CHECK: i32x4.dot_i16x8_s
+; CHECK: i16x8.extend_high_i8x16_s
+; CHECK: i16x8.extend_high_i8x16_s
+; CHECK: i32x4.dot_i16x8_s
+; CHECK: i32x4.add
+; CHECK: i32x4.add
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -49,14 +42,9 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i32 @i32_mac_s16(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i32_mac_s16:
-; CHECK:    i32x4.load16x4_s 0:p2align=1
-; CHECK:    i32x4.load16x4_s 0:p2align=1
-; CHECK:    i32x4.mul
-; CHECK:    i32x4.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.dot_i16x8_s
+; CHECK: v128.load
+; CHECK: v128.load
+; CHECK: i32x4.dot_i16x8_s
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -84,37 +72,30 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i64 @i64_mac_s16(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i64_mac_s16:
-; CHECK:    v128.load32_zero 0:p2align=1
-; CHECK:    i32x4.extend_low_i16x8_s
-; CHECK:    v128.load32_zero 0:p2align=1
-; CHECK:    i32x4.extend_low_i16x8_s
-; CHECK:    i64x2.extmul_low_i32x4_s
-; CHECK:    i64x2.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
+; CHECK: v128.load
+; CHECK: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: v128.load
+; CHECK: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i64x2.extmul_low_i32x4_s
+; CHECK: i64x2.add
+; CHECK: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i64x2.extmul_low_i32x4_s
+; CHECK: i64x2.add
+; CHECK: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i64x2.extmul_low_i32x4_s
+; CHECK: i64x2.add
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i64x2.extmul_low_i32x4_s
+; CHECK: i64x2.add
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -142,19 +123,13 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i64 @i64_mac_s32(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i64_mac_s32:
-; CHECK:    v128.load64_zero 0:p2align=2
-; CHECK:    v128.load64_zero 0:p2align=2
-; CHECK:    i32x4.mul
-; CHECK:    i64x2.extend_low_i32x4_s
-; CHECK:    i64x2.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.mul
-; MAX-BANDWIDTH: i64x2.extend_high_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i64x2.extend_low_i32x4_s
-; MAX-BANDWIDTH: i64x2.add
+; CHECK: v128.load
+; CHECK: v128.load
+; CHECK: i32x4.mul
+; CHECK: i64x2.extend_high_i32x4_s
+; CHECK: i64x2.add
+; CHECK: i64x2.extend_low_i32x4_s
+; CHECK: i64x2.add
 
 entry:
   %cmp6.not = icmp eq i32 %N, 0
@@ -181,25 +156,18 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i32 @i32_mac_u8(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i32_mac_u8:
-; CHECK:    v128.load32_zero 0:p2align=0
-; CHECK:    i16x8.extend_low_i8x16_u
-; CHECK:    v128.load32_zero 0:p2align=0
-; CHECK:    i16x8.extend_low_i8x16_u
-; CHECK:    i32x4.extmul_low_i16x8_u
-; CHECK:    i32x4.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i16x8.extmul_low_i8x16_u
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i32x4.extend_high_i16x8_u
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i16x8.extmul_high_i8x16_u
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i32x4.extend_high_i16x8_u
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i32x4.add
+; CHECK: v128.load
+; CHECK: v128.load
+; CHECK: i16x8.extmul_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.extend_high_i16x8_u
+; CHECK: i32x4.add
+; CHECK: i16x8.extmul_high_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.extend_high_i16x8_u
+; CHECK: i32x4.add
+; CHECK: i32x4.add
+; CHECK: i32x4.add
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -227,17 +195,12 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i32 @i32_mac_u16(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i32_mac_u16:
-; CHECK:    i32x4.load16x4_u 0:p2align=1
-; CHECK:    i32x4.load16x4_u 0:p2align=1
-; CHECK:    i32x4.mul
-; CHECK:    i32x4.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.extmul_low_i16x8_u
-; MAX-BANDWIDTH: i32x4.extmul_high_i16x8_u
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i32x4.add
+; CHECK: v128.load
+; CHECK: v128.load
+; CHECK: i32x4.extmul_low_i16x8_u
+; CHECK: i32x4.extmul_high_i16x8_u
+; CHECK: i32x4.add
+; CHECK: i32x4.add
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -265,21 +228,16 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i32 @i32_mac_u16_s16(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i32_mac_u16_s16:
-; CHECK:    i32x4.load16x4_s 0:p2align=1
-; CHECK:    i32x4.load16x4_u 0:p2align=1
-; CHECK:    i32x4.mul
-; CHECK:    i32x4.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.extend_high_i16x8_s
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.extend_high_i16x8_u
-; MAX-BANDWIDTH: i32x4.mul
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_s
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i32x4.mul
-; MAX-BANDWIDTH: i32x4.add
-; MAX-BANDWIDTH: i32x4.add
+; CHECK: v128.load
+; CHECK: i32x4.extend_high_i16x8_s
+; CHECK: v128.load
+; CHECK: i32x4.extend_high_i16x8_u
+; CHECK: i32x4.mul
+; CHECK: i32x4.extend_low_i16x8_s
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.mul
+; CHECK: i32x4.add
+; CHECK: i32x4.add
 
 entry:
   %cmp7.not = icmp eq i32 %N, 0
@@ -307,37 +265,30 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i64 @i64_mac_u16(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i64_mac_u16:
-; CHECK:    v128.load32_zero 0:p2align=1
-; CHECK:    i32x4.extend_low_i16x8_u
-; CHECK:    v128.load32_zero 0:p2align=1
-; CHECK:    i32x4.extend_low_i16x8_u
-; CHECK:    i64x2.extmul_low_i32x4_u
-; CHECK:    i64x2.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i32x4.extend_low_i16x8_u
-; MAX-BANDWIDTH: i64x2.extmul_low_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
+; CHECK: v128.load
+; CHECK: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: v128.load
+; CHECK: i8x16.shuffle	12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i64x2.extmul_low_i32x4_u
+; CHECK: i64x2.add
+; CHECK: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i8x16.shuffle	8, 9, 10, 11, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i64x2.extmul_low_i32x4_u
+; CHECK: i64x2.add
+; CHECK: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i8x16.shuffle	4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i64x2.extmul_low_i32x4_u
+; CHECK: i64x2.add
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i64x2.extmul_low_i32x4_u
+; CHECK: i64x2.add
 
 entry:
   %cmp8.not = icmp eq i32 %N, 0
@@ -365,19 +316,13 @@ for.body:                                         ; preds = %entry, %for.body
 
 define hidden i64 @i64_mac_u32(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i32 noundef %N) {
 ; CHECK-LABEL: i64_mac_u32:
-; CHECK:    v128.load64_zero 0:p2align=2
-; CHECK:    v128.load64_zero 0:p2align=2
-; CHECK:    i32x4.mul
-; CHECK:    i64x2.extend_low_i32x4_u
-; CHECK:    i64x2.add
-
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: v128.load
-; MAX-BANDWIDTH: i32x4.mul
-; MAX-BANDWIDTH: i64x2.extend_high_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
-; MAX-BANDWIDTH: i64x2.extend_low_i32x4_u
-; MAX-BANDWIDTH: i64x2.add
+; CHECK: v128.load
+; CHECK: v128.load
+; CHECK: i32x4.mul
+; CHECK: i64x2.extend_high_i32x4_u
+; CHECK: i64x2.add
+; CHECK: i64x2.extend_low_i32x4_u
+; CHECK: i64x2.add
 
 entry:
   %cmp6.not = icmp eq i32 %N, 0
diff --git a/llvm/test/CodeGen/WebAssembly/interleave.ll b/llvm/test/CodeGen/WebAssembly/interleave.ll
index c20b5e42c4850..5572510fa02ea 100644
--- a/llvm/test/CodeGen/WebAssembly/interleave.ll
+++ b/llvm/test/CodeGen/WebAssembly/interleave.ll
@@ -15,13 +15,37 @@ target datalayout = "e-m:e-p:32:32-p10:8:8-p20:8:8-i64:64-i128:128-n32:64-S128-n
 ; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite)
 define hidden void @accumulate8x2(ptr dead_on_unwind noalias writable sret(%struct.Output32x2) align 4 captures(none) %0, ptr noundef readonly captures(none) %1, i32 noundef %2) local_unnamed_addr #0 {
 ; CHECK-LABEL: accumulate8x2:
-; CHECK: loop
-; CHECK: v128.load64_zero
-; CHECK: i8x16.shuffle 1, 3, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: v128.load       16:p2align=0                
+; CHECK: i8x16.shuffle   9, 11, 13, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 ; CHECK: i16x8.extend_low_i8x16_u
 ; CHECK: i32x4.extend_low_i16x8_u
-; CHECK: i32x4.add
-; CHECK: i8x16.shuffle 0, 2, 4, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   1, 3, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                        
+; CHECK: v128.load       0:p2align=0                
+; CHECK: i8x16.shuffle   9, 11, 13, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   1, 3, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   8, 10, 12, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   0, 2, 4, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   8, 10, 12, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK: i16x8.extend_low_i8x16_u
+; CHECK: i32x4.extend_low_i16x8_u
+; CHECK: i32x4.add                                
+; CHECK: i8x16.shuffle   0, 2, 4, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 ; CHECK: i16x8.extend_low_i8x16_u
 ; CHECK: i32x4.extend_low_i16x8_u
 ; CHECK: i32x4.add
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index b96a768bba24d..a34079387e246 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -661,29 +661,59 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; DEFAULT-LABEL: define void @multiple_exit_conditions(
 ; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
 ; DEFAULT-NEXT:  [[ENTRY:.*]]:
-; DEFAULT-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP0]], 16
+; DEFAULT-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP6]]
+; DEFAULT-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
 ; DEFAULT:       [[VECTOR_PH]]:
-; DEFAULT-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048
-; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
-; DEFAULT:       [[VECTOR_BODY]]:
-; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; DEFAULT-NEXT:    [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]]
+; DEFAULT-NEXT:    [[INDEX:%.*]] = sub i64 257, [[N_MOD_VF]]
+; DEFAULT-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 16
 ; DEFAULT-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
 ; DEFAULT-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
+; DEFAULT-NEXT:    [[TMP8:%.*]] = mul i64 [[INDEX]], 2
+; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
+; DEFAULT:       [[VECTOR_BODY]]:
+; DEFAULT-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT:    [[OFFSET_IDX1:%.*]] = mul i64 [[INDEX1]], 8
+; DEFAULT-NEXT:    [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX1]]
 ; DEFAULT-NEXT:    [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2
-; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0
-; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
-; DEFAULT-NEXT:    [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
-; DEFAULT-NEXT:    [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double>
-; DEFAULT-NEXT:    [[TMP4:%.*]] = getelementptr double, ptr [[NEXT_GEP]], i32 0
-; DEFAULT-NEXT:    store <8 x double> [[TMP3]], ptr [[TMP4]], align 8
-; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
-; DEFAULT-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
-; DEFAULT-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
+; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[TMP1]], i64 0
+; DEFAULT-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
+; DEFAULT-NEXT:    [[TMP10:%.*]] = or <vscale x 4 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP11:%.*]] = or <vscale x 4 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP12:%.*]] = or <vscale x 4 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP13:%.*]] = or <vscale x 4 x i16> [[BROADCAST_SPLAT]], splat (i16 1)
+; DEFAULT-NEXT:    [[TMP14:%.*]] = uitofp <vscale x 4 x i16> [[TMP10]] to <vscale x 4 x double>
+; DEFAULT-NEXT:    [[TMP15:%.*]] = uitofp <vscale x 4 x i16> [[TMP11]] to <vscale x 4 x double>
+; DEFAULT-NEXT:    [[TMP16:%.*]] = uitofp <vscale x 4 x i16> [[TMP12]] to <vscale x 4 x double>
+; DEFAULT-NEXT:    [[TMP17:%.*]] = uitofp <vscale x 4 x i16> [[TMP13]] to <vscale x 4 x double>
+; DEFAULT-NEXT:    [[TMP18:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i32 0
+; DEFAUL...
[truncated]

davemgreen · 2025-04-28T08:11:05Z

It sounds like you are looking for shouldMaximizeVectorBandwidth? It is currently a target-dependant decision. @huntergr-arm was working on enabling it for SVE but it can lead to issues with cost-modelling and perforance, which I believe were being worked through.

sushgokh · 2025-04-28T08:29:47Z

It sounds like you are looking for shouldMaximizeVectorBandwidth?

Maybe. Looking at the code for AArch64, I think it just considers if vector is FixedWidth. But other than that, there are no other conditions.
This patch adds the condition for this decision. Maybe this can be refined now or later.

@huntergr-arm was working on enabling it for SVE but it can lead to issues with cost-modelling and perforance, which I believe were being worked through.

I tested the patch on Neoverse-v2 with SPEC17 and there are no regressions. Are there any specific cost-modelling/performance issue you know of?

david-arm · 2025-04-28T08:36:00Z

shouldMaximizeVectorBandwidth

I agree. I think the correct place to do this would be in shouldMaximizeVectorBandwidth, since not all targets want to maximise the bandwidth. Enabling this by default has been observed may lead to performance regressions.

madhur13490 · 2025-04-28T08:40:22Z

Enabling this by default has been observed may lead to performance regressions

+1. I am also wondering what will be compile-time impact of this patch, given that you are now exploring more VFs.

sushgokh · 2025-04-28T09:29:01Z

I think the correct place to do this would be in shouldMaximizeVectorBandwidth, since not all targets want to maximise the bandwidth.

ok thanks @david-arm . Will try to make this target specific and make amendment to AArch64 shouldMaximizeVectorBandwidth to start with

sushgokh · 2025-04-28T09:30:52Z

I am also wondering what will be compile-time impact of this patch, given that you are now exploring more VFs.

Since this is enabling more profitable code, I didnt bother much to measure this. But will try to do so

david-arm · 2025-04-28T10:29:57Z

I think the correct place to do this would be in shouldMaximizeVectorBandwidth, since not all targets want to maximise the bandwidth.

ok thanks @david-arm . Will try to make this target specific and make amendment to AArch64 shouldMaximizeVectorBandwidth to start with

I think @SamTebbs33 is working on improving register pressure calculations for partial reductions. It looks like in general with a better cost model we will maximise the vector bandwidth automatically because the phi nodes for the larger types disappear.

sushgokh · 2025-04-28T10:37:47Z

I think @SamTebbs33 is working on improving register pressure calculations for partial reductions. It looks like in general with a better cost model we will maximise the vector bandwidth automatically because the phi nodes for the larger types disappear.

Thanks. But this issue has come up in one of the internal benchmark which does not have partial reduction.

davemgreen · 2025-04-28T11:05:30Z

Could you explain what the performance difference was and why it led to improvements? What did the two versions of the assembly look like? Thanks

sushgokh · 2025-04-29T11:24:30Z

Could you explain what the performance difference was and why it led to improvements? What did the two versions of the assembly look like? Thanks

For the benchmark, the difference is currently, the cost model is selecting VF=1. However, we know that with VF=vscale x 4, the code is more profitable. The difference is ~14%

A code, which is somewhat similar to the benchmark, can be found here: https://godbolt.org/z/Wfjhb8PPT
This is also getting scalarized but vscale x 16, this is more profitable as measured with llvm-mca.

sushgokh · 2025-05-06T13:59:06Z

ping

sushgokh · 2025-05-13T11:14:27Z

ping

david-arm · 2025-05-13T12:03:20Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+  // vectorizing with higher VF might be profitable. In such cases, we should
+  // limit the max VF based on smallest type and the decision whether a
+  // particular VF is beneficial or not be left to cost model.
+  return WidestType != SmallestType;


I don't really understand why this is any different to simply returning true here, because LoopVectorizationCostModel::getMaximizedVFForTarget will only change the MaxVF if the types are different anyway. For example, MaxVectorElementCount will be identical to MaxVectorElementCountMaxBW when all types are the same. @fhahn any thoughts?

Also, do we need to still test whether SVE or NEON is available? For example, something like this

return ST->isNeonAvailable() || ST->isSVEAvailable();

@huntergr-arm definitely found performance regressions when maximising the bandwidth for SVE. I'll see if I can find some examples.

This PR does conflict somewhat with current (undocumented) plans; the intent is to enable max bandwidth for the scalable vector register kind by default as soon as we've fixed some regressions we're aware of (at least for cores implementing SVE2 or higher, specifics tbd.). I did try enabling this before but reverted once we found the regressions. We don't need to know the smallest and largest types to do so, as we're improving the cost model to reject suboptimal VFs.

Some PRs that should hopefully let us enable maxbw by default once they all land:

[VPlan] Implement VPExtendedReduction, VPMulAccumulateReductionRecipe and corresponding vplan transformations. #137746 -- Allows vplan to bundle up sequences of operations into a meta-recipe (VPMulAccumulateReductionRecipe) before modeling the cost.

[VPlan] Impl VPlan-based pattern match for ExtendedRed and MulAccRed #113903 -- Implements cost modeling for the above PR. (Or will do once rebased, as the work was split up).

[LoopVectorizer] Bundle partial reductions with different extensions #136997 -- Extends the VPMulAccumulateReductionRecipe to support differing extension types to support usdot instructions.

[LoopVectorizer] Prune VFs based on plan register pressure #132190 -- Prunes vplans with wider VFs if the estimated register pressure would be too high; doing it here after we know about partial reductions lets us model things better instead of assuming we'll have phi nodes with too-wide types taking up multiple registers as the legacy cost model does now.

We'll probably need a few more improvements later as we run more benchmarks, but those PRs cover the basic mechanisms needed for now.

Thanks @huntergr-arm for the pointers.

sushgokh requested review from fhahn, SamTebbs33, madhur13490, davemgreen, sjoerdmeijer and david-arm April 28, 2025 08:00

llvmbot added backend:PowerPC backend:WebAssembly vectorizers llvm:transforms labels Apr 28, 2025

sushgokh requested a review from huntergr-arm April 28, 2025 08:30

address comments

2c1e878

llvmbot added backend:AArch64 backend:Hexagon llvm:analysis labels Apr 29, 2025

david-arm reviewed May 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LV] Enable considering higher VFs when data extend ops are present i… #137593

[LV] Enable considering higher VFs when data extend ops are present i… #137593

sushgokh commented Apr 28, 2025

llvmbot commented Apr 28, 2025 •

edited

Loading

davemgreen commented Apr 28, 2025

sushgokh commented Apr 28, 2025

david-arm commented Apr 28, 2025

madhur13490 commented Apr 28, 2025 •

edited

Loading

sushgokh commented Apr 28, 2025

sushgokh commented Apr 28, 2025

david-arm commented Apr 28, 2025

sushgokh commented Apr 28, 2025

davemgreen commented Apr 28, 2025

sushgokh commented Apr 29, 2025

sushgokh commented May 6, 2025

sushgokh commented May 13, 2025

david-arm May 13, 2025

huntergr-arm May 13, 2025

sushgokh May 15, 2025

[LV] Enable considering higher VFs when data extend ops are present i… #137593

Are you sure you want to change the base?

[LV] Enable considering higher VFs when data extend ops are present i… #137593

Conversation

sushgokh commented Apr 28, 2025

llvmbot commented Apr 28, 2025 • edited Loading

davemgreen commented Apr 28, 2025

sushgokh commented Apr 28, 2025

david-arm commented Apr 28, 2025

madhur13490 commented Apr 28, 2025 • edited Loading

sushgokh commented Apr 28, 2025

sushgokh commented Apr 28, 2025

david-arm commented Apr 28, 2025

sushgokh commented Apr 28, 2025

davemgreen commented Apr 28, 2025

sushgokh commented Apr 29, 2025

sushgokh commented May 6, 2025

sushgokh commented May 13, 2025

david-arm May 13, 2025

Choose a reason for hiding this comment

huntergr-arm May 13, 2025

Choose a reason for hiding this comment

sushgokh May 15, 2025

Choose a reason for hiding this comment

llvmbot commented Apr 28, 2025 •

edited

Loading

madhur13490 commented Apr 28, 2025 •

edited

Loading