[LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689

nilanjana87 · 2023-12-07T02:39:40Z

Added/modified tests for evaluating changes to loop interleaving count computation in (#73766). The new set of tests address the change in IC computation to minimize the remainder TC of the vectorized loop while maximizing the IC when the remainder TC is the same.

… count computation Added/modified tests for evaluating changes to loop interleaving count computation in (llvm#73766). The new set of tests address the change in IC computation to minimize the remainder TC of the vectorized loop while maximizing the IC when the remainder TC is the same.

llvmbot · 2023-12-07T02:40:09Z

@llvm/pr-subscribers-llvm-transforms

Author: Nilanjana Basu (nilanjana87)

Changes

Added/modified tests for evaluating changes to loop interleaving count computation in (#73766). The new set of tests address the change in IC computation to minimize the remainder TC of the vectorized loop while maximizing the IC when the remainder TC is the same.

Full diff: https://github.com/llvm/llvm-project/pull/74689.diff

1 Files Affected:

(modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll (+179-10)

diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
index 061cdb5643671..a6642b72993ef 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
@@ -1,4 +1,4 @@
-; RUN: opt < %s -tiny-trip-count-interleave-threshold=32 -p loop-vectorize -S -pass-remarks=loop-vectorize -disable-output 2>&1 | FileCheck %s
+; RUN: opt < %s -tiny-trip-count-interleave-threshold=16 -p loop-vectorize -S -pass-remarks=loop-vectorize -disable-output 2>&1 | FileCheck %s
 ; TODO: remove -tiny-trip-count-interleave-threshold once the interleave threshold is removed
 
 target triple = "aarch64-linux-gnu"
@@ -6,7 +6,7 @@ target triple = "aarch64-linux-gnu"
 %pair = type { i8, i8 }
 
 ; For this loop with known TC of 32, when the auto-vectorizer chooses VF 16, it should choose
-; IC 2 since there is no remainder loop run needed when the vector loop runs.
+; IC 2 since there is no remainder loop run needed after the vector loop runs.
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
 define void @loop_with_tc_32(ptr noalias %p, ptr noalias %q) {
 entry:
@@ -29,8 +29,8 @@ for.end:
   ret void
 }
 
-; TODO: For this loop with known TC of 33, when the auto-vectorizer chooses VF 16, it should choose
-; IC 1 since there may be a remainder loop that needs to run after the vector loop.
+; For this loop with known TC of 33, when the auto-vectorizer chooses VF 16, it should choose
+; IC 2 since there is a small remainder loop TC that needs to run after the vector loop.
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
 define void @loop_with_tc_33(ptr noalias %p, ptr noalias %q) {
 entry:
@@ -53,9 +53,104 @@ for.end:
   ret void
 }
 
-; For a loop with unknown trip count but a profile showing an approx TC estimate of 32, when the
-; auto-vectorizer chooses VF 16, it should choose IC 2 since chances are high that the remainder loop
-; won't need to run
+; For this loop with known TC of 39, when the auto-vectorizer chooses VF 16, it should choose
+; IC 2 since there is a small remainder loop that needs to run after the vector loop.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_39(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 39
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; TODO: For this loop with known TC of 48, when the auto-vectorizer chooses VF 16, it should choose
+; IC 1 since there will be no remainder loop that needs to run after the vector loop.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_48(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 48
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; TODO: For this loop with known TC of 49, when the auto-vectorizer chooses VF 16, it should choose
+; IC 1 since a remainder loop TC of 1 is more efficient than remainder loop TC of 17 with IC 2
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_49(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 49
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; TODO: For this loop with known TC of 55, when the auto-vectorizer chooses VF 16, it should choose
+; IC 1 since a remainder loop TC of 7 is more efficient than remainder loop TC of 23 with IC 2
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_55(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 55
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; TODO: For a loop with a profile-guided estimated TC of 32, when the auto-vectorizer chooses VF 16, 
+; it should conservatively choose IC 1 so that the vector loop runs twice at least
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
 define void @loop_with_profile_tc_32(ptr noalias %p, ptr noalias %q, i64 %n) {
 entry:
@@ -78,9 +173,8 @@ for.end:
   ret void
 }
 
-; TODO: For a loop with unknown trip count but a profile showing an approx TC estimate of 33, 
-; when the auto-vectorizer chooses VF 16, it should choose IC 1 since chances are high that the 
-; remainder loop will need to run
+; TODO: For a loop with a profile-guided estimated TC of 33, when the auto-vectorizer chooses VF 16, 
+; it should conservatively choose IC 1 so that the vector loop runs twice at least
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
 define void @loop_with_profile_tc_33(ptr noalias %p, ptr noalias %q, i64 %n) {
 entry:
@@ -103,5 +197,80 @@ for.end:
   ret void
 }
 
+; TODO: For a loop with a profile-guided estimated TC of 48, when the auto-vectorizer chooses VF 16, 
+; it should conservatively choose IC 1 so that the vector loop runs twice at least
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_profile_tc_48(ptr noalias %p, ptr noalias %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !2
+
+for.end:
+  ret void
+}
+
+; TODO: For a loop with a profile-guided estimated TC of 63, when the auto-vectorizer chooses VF 16, 
+; it should conservatively choose IC 1 so that the vector loop runs twice at least
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_profile_tc_63(ptr noalias %p, ptr noalias %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !3
+
+for.end:
+  ret void
+}
+
+; For a loop with a profile-guided estimated TC of 64, when the auto-vectorizer chooses VF 16, 
+; it should choose conservatively IC 2 so that the vector loop runs twice at least
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_profile_tc_64(ptr noalias %p, ptr noalias %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !4
+
+for.end:
+  ret void
+}
+
 !0 = !{!"branch_weights", i32 1, i32 31}
 !1 = !{!"branch_weights", i32 1, i32 32}
+!2 = !{!"branch_weights", i32 1, i32 47}
+!3 = !{!"branch_weights", i32 1, i32 62}
+!4 = !{!"branch_weights", i32 1, i32 63}

llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll

…t's max interleaving count

fhahn

LGTM, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll

…t in one and for profile-guided estimated trip count in another.

[LV] Change loops' interleave count computation A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#56), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial when the vector loop runs at least twice or when the epilogue loop trip count (TC) is minimal. Therefore, we choose interleaving count (IC) between TC/VF & TC/2*VF (VF = vectorization factor), such that remainder TC for the epilogue loop is minimum while the IC is maximum in case the remainder TC is same for both. The initial tests for this change were submitted in PRs: #70272 and #74689.

…utation (llvm#74689) Added more pre-commit tests for evaluating changes to loop interleaving count computation in (llvm#73766). The new set of tests address the change in IC computation to minimize the remainder TC of the vectorized loop while maximizing the IC when the remainder TC is the same.

[LV] Change loops' interleave count computation A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#56), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial when the vector loop runs at least twice or when the epilogue loop trip count (TC) is minimal. Therefore, we choose interleaving count (IC) between TC/VF & TC/2*VF (VF = vectorization factor), such that remainder TC for the epilogue loop is minimum while the IC is maximum in case the remainder TC is same for both. The initial tests for this change were submitted in PRs: llvm#70272 and llvm#74689.

nilanjana87 requested a review from fhahn December 7, 2023 02:39

llvmbot added the llvm:transforms label Dec 7, 2023

nilanjana87 requested a review from david-arm December 7, 2023 02:39

nilanjana87 mentioned this pull request Dec 7, 2023

[LV] Change loops' interleave count computation #73766

Merged

fhahn reviewed Dec 7, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll Outdated Show resolved Hide resolved

Added tests for loops with larger trip counts and increased the targe…

9c5740b

…t's max interleaving count

fhahn approved these changes Dec 10, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll Outdated Show resolved Hide resolved

Splitting off the tests into two files with tests for known trip coun…

31dfca6

…t in one and for profile-guided estimated trip count in another.

nilanjana87 merged commit 41a3828 into llvm:main Dec 12, 2023

nilanjana87 mentioned this pull request Jan 4, 2024

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689

[LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689

Uh oh!

nilanjana87 commented Dec 7, 2023

Uh oh!

llvmbot commented Dec 7, 2023

Uh oh!

Uh oh!

fhahn left a comment

Uh oh!

Uh oh!

Uh oh!

[LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689

[LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689

Uh oh!

Conversation

nilanjana87 commented Dec 7, 2023

Uh oh!

llvmbot commented Dec 7, 2023

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!