-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[LoopVectorize] Refine runtime memory check costs when there is an outer loop #76034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: David Sherwood (david-arm) Changes[LoopVectorize] Refine runtime memory check costs when there is an outer loop When we generate runtime memory checks for an inner loop it's This fixes a 25% performance regression introduced by commit when building the SPEC2017 x264 benchmark with PGO, where we Full diff: https://github.com/llvm/llvm-project/pull/76034.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..8287090800dd37 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks {
}
}
- InstructionCost getCost() {
+ InstructionCost getCost(Loop *OuterLoop) {
if (SCEVCheckBlock || MemCheckBlock)
LLVM_DEBUG(dbgs() << "Calculating cost of runtime checks:\n");
@@ -2091,16 +2091,45 @@ class GeneratedRTChecks {
LLVM_DEBUG(dbgs() << " " << C << " for " << I << "\n");
RTCheckCost += C;
}
- if (MemCheckBlock)
+ if (MemCheckBlock) {
+ InstructionCost MemCheckCost = 0;
for (Instruction &I : *MemCheckBlock) {
if (MemCheckBlock->getTerminator() == &I)
continue;
InstructionCost C =
TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);
LLVM_DEBUG(dbgs() << " " << C << " for " << I << "\n");
- RTCheckCost += C;
+ MemCheckCost += C;
+ }
+
+ // If the runtime memory checks are being created inside an outer loop
+ // we should find out if these checks are outer loop invariant. If so,
+ // the checks will be hoisted out and so the effective cost will reduce
+ // according to the outer loop trip count.
+ if (OuterLoop) {
+ ScalarEvolution *SE = MemCheckExp.getSE();
+ const SCEV *Cond = SE->getSCEV(MemRuntimeCheckCond);
+ if (SE->isLoopInvariant(Cond, OuterLoop)) {
+ if (std::optional<unsigned> OuterTC =
+ getSmallBestKnownTC(*SE, OuterLoop))
+ MemCheckCost /= *OuterTC;
+ else {
+ // It seems reasonable to assume that we can reduce the effective
+ // cost of the checks even when we know nothing about the trip
+ // count. Here I've assumed that the outer loop executes at least
+ // twice.
+ MemCheckCost /= 2;
+ }
+
+ // Let's ensure the cost is always at least 1.
+ if (MemCheckCost == 0)
+ MemCheckCost = 1;
+ }
}
+ RTCheckCost += MemCheckCost;
+ }
+
if (SCEVCheckBlock || MemCheckBlock)
LLVM_DEBUG(dbgs() << "Total cost of runtime checks: " << RTCheckCost
<< "\n");
@@ -9754,7 +9783,7 @@ static bool areRuntimeChecksProfitable(GeneratedRTChecks &Checks,
std::optional<unsigned> VScale, Loop *L,
ScalarEvolution &SE,
ScalarEpilogueLowering SEL) {
- InstructionCost CheckCost = Checks.getCost();
+ InstructionCost CheckCost = Checks.getCost(L->getParentLoop());
if (!CheckCost.isValid())
return false;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
new file mode 100644
index 00000000000000..b740b055822991
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
@@ -0,0 +1,193 @@
+; RUN: opt -p loop-vectorize -mattr=+sve -debug-only=loop-vectorize -S -disable-output < %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+
+define void @outer_no_tc(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %m, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_no_tc'
+; CHECK: Calculating cost of runtime checks:
+; CHECK: Total cost of runtime checks: 3
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+ %cmp24 = icmp sgt i64 %m, 0
+ %cmp222 = icmp sgt i64 %n, 0
+ %or.cond = and i1 %cmp24, %cmp222
+ br i1 %or.cond, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+ %i.025.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+ %mul.us = mul nsw i64 %i.025.us, %n
+ br label %for.body4.us
+
+for.body4.us:
+ %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+ %add.us = add nuw nsw i64 %j.023.us, %mul.us
+ %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+ %0 = load i8, ptr %arrayidx.us, align 1
+ %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+ %1 = load i8, ptr %arrayidx7.us, align 1
+ %add9.us = add i8 %1, %0
+ store i8 %add9.us, ptr %arrayidx7.us, align 1
+ %inc.us = add nuw nsw i64 %j.023.us, 1
+ %exitcond.not = icmp eq i64 %inc.us, %n
+ br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+ %inc12.us = add nuw nsw i64 %i.025.us, 1
+ %exitcond27.not = icmp eq i64 %inc12.us, %m
+ br i1 %exitcond27.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+ ret void
+}
+
+
+define void @outer_known_tc3(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc3'
+; CHECK: Calculating cost of runtime checks:
+; CHECK: Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+ %cmp222 = icmp sgt i64 %n, 0
+ br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+ %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+ %mul.us = mul nsw i64 %i.024.us, %n
+ br label %for.body4.us
+
+for.body4.us:
+ %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+ %add.us = add nuw nsw i64 %j.023.us, %mul.us
+ %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+ %0 = load i8, ptr %arrayidx.us, align 1
+ %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+ %1 = load i8, ptr %arrayidx7.us, align 1
+ %add9.us = add i8 %1, %0
+ store i8 %add9.us, ptr %arrayidx7.us, align 1
+ %inc.us = add nuw nsw i64 %j.023.us, 1
+ %exitcond.not = icmp eq i64 %inc.us, %n
+ br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+ %inc12.us = add nuw nsw i64 %i.024.us, 1
+ %exitcond26.not = icmp eq i64 %inc12.us, 3
+ br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+ ret void
+}
+
+
+define void @outer_known_tc64(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc64'
+; CHECK: Calculating cost of runtime checks:
+; CHECK: Total cost of runtime checks: 1
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+ %cmp222 = icmp sgt i64 %n, 0
+ br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+ %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+ %mul.us = mul nsw i64 %i.024.us, %n
+ br label %for.body4.us
+
+for.body4.us:
+ %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+ %add.us = add nuw nsw i64 %j.023.us, %mul.us
+ %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+ %0 = load i8, ptr %arrayidx.us, align 1
+ %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+ %1 = load i8, ptr %arrayidx7.us, align 1
+ %add9.us = add i8 %1, %0
+ store i8 %add9.us, ptr %arrayidx7.us, align 1
+ %inc.us = add nuw nsw i64 %j.023.us, 1
+ %exitcond.not = icmp eq i64 %inc.us, %n
+ br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+ %inc12.us = add nuw nsw i64 %i.024.us, 1
+ %exitcond26.not = icmp eq i64 %inc12.us, 64
+ br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+ ret void
+}
+
+
+define void @outer_pgo_3(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %m, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_pgo_3'
+; CHECK: Calculating cost of runtime checks:
+; CHECK: Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+ %cmp222 = icmp sgt i64 %n, 0
+ br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+ %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+ %mul.us = mul nsw i64 %i.024.us, %n
+ br label %for.body4.us
+
+for.body4.us:
+ %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+ %add.us = add nuw nsw i64 %j.023.us, %mul.us
+ %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+ %0 = load i8, ptr %arrayidx.us, align 1
+ %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+ %1 = load i8, ptr %arrayidx7.us, align 1
+ %add9.us = add i8 %1, %0
+ store i8 %add9.us, ptr %arrayidx7.us, align 1
+ %inc.us = add nuw nsw i64 %j.023.us, 1
+ %exitcond.not = icmp eq i64 %inc.us, %n
+ br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+ %inc12.us = add nuw nsw i64 %i.024.us, 1
+ %exitcond26.not = icmp eq i64 %inc12.us, %m
+ br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us, !prof !0
+
+for.cond.cleanup:
+ ret void
+}
+
+
+define void @outer_known_tc3_full_range_checks(ptr nocapture noundef %dst, ptr nocapture noundef readonly %src, i64 noundef %n) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc3_full_range_checks'
+; CHECK: Calculating cost of runtime checks:
+; CHECK: Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:8
+entry:
+ br label %outer.loop
+
+outer.loop:
+ %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %inner.exit ]
+ %0 = mul nsw i64 %outer.iv, %n
+ br label %inner.loop
+
+inner.loop:
+ %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]
+ %1 = add nuw nsw i64 %iv.inner, %0
+ %arrayidx.us = getelementptr inbounds i32, ptr %src, i64 %1
+ %2 = load i32, ptr %arrayidx.us, align 4
+ %arrayidx8.us = getelementptr inbounds i32, ptr %dst, i64 %1
+ %3 = load i32, ptr %arrayidx8.us, align 4
+ %add9.us = add nsw i32 %3, %2
+ store i32 %add9.us, ptr %arrayidx8.us, align 4
+ %iv.inner.next = add nuw nsw i64 %iv.inner, 1
+ %inner.exit.cond = icmp eq i64 %iv.inner.next, %n
+ br i1 %inner.exit.cond, label %inner.exit, label %inner.loop
+
+inner.exit:
+ %outer.iv.next = add nuw nsw i64 %outer.iv, 1
+ %outer.exit.cond = icmp eq i64 %outer.iv.next, 3
+ br i1 %outer.exit.cond, label %outer.exit, label %outer.loop
+
+outer.exit:
+ ret void
+}
+
+
+!0 = !{!"branch_weights", i32 10, i32 20}
|
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
Outdated
Show resolved
Hide resolved
af1cf43
to
a152314
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems OK from what I can tell. LGTM
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks { | |||
} | |||
} | |||
|
|||
InstructionCost getCost() { | |||
InstructionCost getCost(Loop *OuterLoop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to store the inner or outer loop directly in the struct; the inner loop is passed in to Create
and there are 2 places that check for an outer loop already (via if (auto *PL = LI->getLoopFor(LoopVectorPreHeader))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the places that look up the outer loop to use the new variable as well?
Gentle ping! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, but otherwise LGTM! I'll leave someone else more familiar with the code to approve the change.
a4bb56b
to
e4742f6
Compare
…ter loop When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant.
…is an outer loop" This reverts commit a152314.
This reverts commit a4caa47.
e4742f6
to
c0aa602
Compare
Trying a rebase to see if that fixes the failing RISCV test that appears unrelated to my change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Trying a rebase to see if that fixes the failing RISCV test that appears unrelated to my change
Alternatively you could use the Update Branch
button to merge in the changes from current main, which I think is preferred to rebases, as the GH UI may lose comments after rebases.
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks { | |||
} | |||
} | |||
|
|||
InstructionCost getCost() { | |||
InstructionCost getCost(Loop *OuterLoop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the places that look up the outer loop to use the new variable as well?
Yeah I used to use the "Update Branch" button, but I find it makes life very difficult for reviewers (and authors) and I'd like to stop using it in future. The problem is that in Phabricator we could do stacked patches quite easily, i.e. NFC test patch followed by code change. The purpose of this is to allow people to see the impact of the code changes. In this new world we're supposed to always add a new commit every time we address comments, which completely defeats the whole purpose of splitting the test patch up from the code change. Quite often reviewers ask authors to rewrite tests or add new ones, which means that when you add a new commit you can no longer see the impact of your code change. It feels like we should either:
To be honest either approach is unsatisfactory, but I feel option 2) gives the reviewer a more fighting chance of seeing cause and effect! |
IMO including a separate commit for precommitting tests in a feature PR isn't ideal, as the default view in GH shows the diff for all changes compared to the main branch and when merging the PR all commits get squashed into one. It's now possible to get something close to Phabricator's stacked reviews discussed here https://discourse.llvm.org/t/update-on-github-pull-requests/71540/146. |
That's true, but to be honest the idea of creating loads of user branches on main fills me with dread. :) I've not tried it myself so I don't know how much extra overhead is involved. If you have examples where you have done this successfully/efficiently I'd be genuinely curious to see how it's done? |
c0aa602
to
feaef62
Compare
Hi @fhahn, I couldn't find any other instances in GeneratedRTChecks where we call |
There are 2 places that do something like |
feaef62
to
a5e2f19
Compare
Hi @fhahn, ok thanks. It wasn't obvious to me that those cases were checking for outer loops, so thanks for pointing them out! I've changed them now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…ter loop When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant.
a5e2f19
to
e1ec7db
Compare
…ter loop (llvm#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
…ter loop (#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
…ter loop (llvm#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
…ter loop (llvm#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
…ter loop (llvm#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
…ter loop (llvm#76034) When we generate runtime memory checks for an inner loop it's possible that these checks are invariant in the outer loop and so will get hoisted out. In such cases, the effective cost of the checks should reduce to reflect the outer loop trip count. This fixes a 25% performance regression introduced by commit 49b0e6d when building the SPEC2017 x264 benchmark with PGO, where we decided the inner loop trip count wasn't high enough to warrant the (incorrect) high cost of the runtime checks. Also, when runtime memory checks consist entirely of diff checks these are likely to be outer loop invariant. (cherry picked from commit 962fbaf)
When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.
This fixes a 25% performance regression introduced by commit
49b0e6d
when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.