-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[LV] Change loops' interleave count computation #70141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Nilanjana Basu (nilanjana87) Changes[LV] Change loops' interleave count computation A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases: We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416). Full diff: https://github.com/llvm/llvm-project/pull/70141.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cc17d91d4f43727..5786e8e38cb4974 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5745,8 +5745,12 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
}
// If trip count is known or estimated compile time constant, limit the
- // interleave count to be less than the trip count divided by VF, provided it
- // is at least 1.
+ // interleave count to be less than the trip count divided by VF * 2,
+ // provided VF is at least 1 and the trip count is not an exact multiple of
+ // VF, such that the vector loop runs at least twice to make interleaving seem
+ // profitable when there is an epilogue loop present. When
+ // InterleaveSmallLoopScalarReduction is true or trip count is an exact
+ // multiple of VF, we allow interleaving even when the vector loop runs once.
//
// For scalable vectors we can't know if interleaving is beneficial. It may
// not be beneficial for small loops if none of the lanes in the second vector
@@ -5755,10 +5759,15 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
// the InterleaveCount as if vscale is '1', although if some information about
// the vector is known (e.g. min vector size), we can make a better decision.
if (BestKnownTC) {
- MaxInterleaveCount =
- std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
- // Make sure MaxInterleaveCount is greater than 0.
- MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
+ if (InterleaveSmallLoopScalarReduction ||
+ (*BestKnownTC % VF.getKnownMinValue() == 0))
+ MaxInterleaveCount =
+ std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
+ else
+ MaxInterleaveCount = std::min(*BestKnownTC / (VF.getKnownMinValue() * 2),
+ MaxInterleaveCount);
+ // Make sure MaxInterleaveCount is greater than 0 & a power of 2.
+ MaxInterleaveCount = llvm::bit_floor(std::max(1u, MaxInterleaveCount));
}
assert(MaxInterleaveCount > 0 &&
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
new file mode 100644
index 000000000000000..501fc334bc8bf48
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
@@ -0,0 +1,106 @@
+; RUN: opt < %s -force-vector-width=64 -O3 -S -pass-remarks=loop-vectorize 2>&1 | FileCheck %s
+
+target triple = "aarch64-linux-gnu"
+
+%pair = type { i8, i8 }
+
+; For a loop with known trip count of 128, when we force VF 64, it should use
+; IC 2, since there is no remainder loop needed when the vector loop runs.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 2)
+define void @loop_with_tc_128(ptr %p, ptr %q) {
+entry:
+ br label %for.body
+
+for.body:
+ %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+ %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+ %tmp1 = load i8, ptr %tmp0, align 1
+ %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+ %tmp3 = load i8, ptr %tmp2, align 1
+ %add = add i8 %tmp1, %tmp3
+ %qi = getelementptr i8, ptr %q, i64 %i
+ store i8 %add, ptr %qi, align 1
+ %i.next = add nuw nsw i64 %i, 1
+ %cond = icmp eq i64 %i.next, 128
+ br i1 %cond, label %for.end, label %for.body
+
+for.end:
+ ret void
+}
+
+; For a loop with known trip count of 129, when we force VF 64, it should use
+; IC 1, since there may be a remainder loop that needs to run after the vector loop.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 1)
+define void @loop_with_tc_129(ptr %p, ptr %q) {
+entry:
+ br label %for.body
+
+for.body:
+ %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+ %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+ %tmp1 = load i8, ptr %tmp0, align 1
+ %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+ %tmp3 = load i8, ptr %tmp2, align 1
+ %add = add i8 %tmp1, %tmp3
+ %qi = getelementptr i8, ptr %q, i64 %i
+ store i8 %add, ptr %qi, align 1
+ %i.next = add nuw nsw i64 %i, 1
+ %cond = icmp eq i64 %i.next, 129
+ br i1 %cond, label %for.end, label %for.body
+
+for.end:
+ ret void
+}
+
+; For a loop with unknown trip count but a profile showing an approx TC estimate of 128,
+; when we force VF 64, it should use IC 2, since chances are high that the remainder loop
+; won't need to run
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 2)
+define void @loop_with_profile_tc_128(ptr %p, ptr %q, i64 %n) {
+entry:
+ br label %for.body
+
+for.body:
+ %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+ %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+ %tmp1 = load i8, ptr %tmp0, align 1
+ %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+ %tmp3 = load i8, ptr %tmp2, align 1
+ %add = add i8 %tmp1, %tmp3
+ %qi = getelementptr i8, ptr %q, i64 %i
+ store i8 %add, ptr %qi, align 1
+ %i.next = add nuw nsw i64 %i, 1
+ %cond = icmp eq i64 %i.next, %n
+ br i1 %cond, label %for.end, label %for.body, !prof !0
+
+for.end:
+ ret void
+}
+
+; For a loop with unknown trip count but a profile showing an approx TC estimate of 129,
+; when we force VF 64, it should use IC 1, since chances are high that the remainder loop
+; will need to run
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 1)
+define void @loop_with_profile_tc_129(ptr %p, ptr %q, i64 %n) {
+entry:
+ br label %for.body
+
+for.body:
+ %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+ %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+ %tmp1 = load i8, ptr %tmp0, align 1
+ %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+ %tmp3 = load i8, ptr %tmp2, align 1
+ %add = add i8 %tmp1, %tmp3
+ %qi = getelementptr i8, ptr %q, i64 %i
+ store i8 %add, ptr %qi, align 1
+ %i.next = add nuw nsw i64 %i, 1
+ %cond = icmp eq i64 %i.next, %n
+ br i1 %cond, label %for.end, label %for.body, !prof !1
+
+for.end:
+ ret void
+}
+
+!0 = !{!"branch_weights", i32 1, i32 127}
+!1 = !{!"branch_weights", i32 1, i32 128}
|
The initial tests are added in a separate PR in #70272. |
You can test this locally with the following command:git-clang-format --diff 2f4328e6979004fbf531d69a40c2e06d43d3128c 3104a425e31aad4defdeda4e0bb8a8c2d0549703 -- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp View the diff from clang-format here.diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5786e8e38c..45f2c442cd 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3841,8 +3841,7 @@ void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,
TTI->preferPredicatedReductionSelect(
RdxDesc.getOpcode(), PhiTy,
TargetTransformInfo::ReductionFlags())) {
- auto *VecRdxPhi =
- cast<PHINode>(State.get(PhiR, Part));
+ auto *VecRdxPhi = cast<PHINode>(State.get(PhiR, Part));
VecRdxPhi->setIncomingValueForBlock(VectorLoopLatch, Sel);
}
}
@@ -8309,18 +8308,17 @@ VPWidenCallRecipe *VPRecipeBuilder::tryToWidenCall(CallInst *CI,
// Is it beneficial to perform intrinsic call compared to lib call?
bool ShouldUseVectorIntrinsic =
- ID && LoopVectorizationPlanner::getDecisionAndClampRange(
- [&](ElementCount VF) -> bool {
- Function *Variant;
- // Is it beneficial to perform intrinsic call compared to lib
- // call?
- InstructionCost CallCost =
- CM.getVectorCallCost(CI, VF, &Variant);
- InstructionCost IntrinsicCost =
- CM.getVectorIntrinsicCost(CI, VF);
- return IntrinsicCost <= CallCost;
- },
- Range);
+ ID &&
+ LoopVectorizationPlanner::getDecisionAndClampRange(
+ [&](ElementCount VF) -> bool {
+ Function *Variant;
+ // Is it beneficial to perform intrinsic call compared to lib
+ // call?
+ InstructionCost CallCost = CM.getVectorCallCost(CI, VF, &Variant);
+ InstructionCost IntrinsicCost = CM.getVectorIntrinsicCost(CI, VF);
+ return IntrinsicCost <= CallCost;
+ },
+ Range);
if (ShouldUseVectorIntrinsic)
return new VPWidenCallRecipe(*CI, make_range(Ops.begin(), Ops.end()), ID);
|
With #70272 landed, could you rebase this one again? |
TargetLowering.h shouldn't include any passes and thus pull in the entire pass infrastructure. Replace the include with forward declarations.
…vm#72444) This patch removes the val field from the `MapInfoOp`. Previously when lowering `TargetOp`, the bounds information for the `BoxValues` were also being mapped. Instead these ops are now cloned inside the target region to prevent mapping of non reference typed values.
The following pattern fails on recent GCC versions with -std=c++20 flag passed and succeeds with -std=c++17. Such behavior is not observed on Clang 16.0. ``` template <typename T> struct Foo { Foo<T>(int a) {} }; ``` This patch removes template parameter from constructor in two occurences to make the following command complete successfully: bazel build -c fastbuild --cxxopt=-std=c++20 --host_cxxopt=-std=c++20 @llvm-project//mlir/... This patch is similar to https://reviews.llvm.org/D154782 Co-authored-by: Alexander Batashev <[email protected]>
Struct Module::Header is not a POD type. As such, qsort() and llvm::array_pod_sort() must not be used to sort it. This became an issue with the new implementation of qsort() in glibc 2.39 that is not guaranteed to be a stable sort, causing Headers to be re-ordered and corrupted. Replace the usage of llvm::array_pod_sort() with std::stable_sort() in order to fix this issue. The signature of compareModuleHeaders() has to be modified. Fixes llvm#73145.
Update regex to _explicitly_ show which exp versions are added. The previous regex used `exp[^e]` to avoid matching calls like: `@llvm.experimental.stepvector`. Note: ArmPL Mappings for scalable types are not yet utilized (eg, `llvm.exp10.nxv2f64`, `llvm.exp10.nxv4f32`), as `replace-with-veclib` pass needs improvements.
…lvm#72526) The code in the CloneInstructionsIntoPredec... function modified by this patch has a long history that dates back to 2011, see d715ec8. There, when folding branches, all dbg.value intrinsics seen when folding would be saved and then re-inserted at the end of whatever was folded. Over the last 12 years this behaviour has been preserved. However, IMO it's bad behaviour. If we have: inst1 dbg.value1 inst2 dbg.value2 And we fold that sequence into a different block, then we would want the instructions and variable assignments to appear in the same order. However because of this old behaviour, the dbg.values are sunk, and we get: inst1 inst2 dbg.value1 dbg.value2 This clustering of dbg.values can make assignments to the same variable invisible, as well as reducing the coverage of other assignments. This patch relaxes the CloneInstructions... function and allows it to clone and update dbg.values in-place, causing them to appear in the original order in the destination block. I've added some extra dbg.values to the updated test: without the changes to the pass, the dbg.values sink into a blob ahead of the select. The RemoveDIs code can't cope with this right now so I've removed the "--try..." flag, restored in a commit to land in a couple of hours. (Metadata changes to make the LLVM-IR parser not drop the debug-info for it being out of date. The RemoveDIs related RUN line has been removed because it was spuriously passing due to the debug-info being dropped).
…n type of function. (llvm#69724) Import of a function with `auto` return type that is expanded to a `SubstTemplateTypeParmType` could fail if the function itself is the template specialization where the parameter was replaced.
) This patch is rearranging code a bit to add WatchpointResources to Process. A WatchpointResource is meant to represent a hardware watchpoint register in the inferior process. It has an address, a size, a type, and a list of Watchpoints that are using this WatchpointResource. This current patch doesn't add any of the features of WatchpointResources that make them interesting -- a user asking to watch a 24 byte object could watch this with three 8 byte WatchpointResources. Or a Watchpoint on 1 byte at 0x1002 and a second watchpoint on 1 byte at 0x1003, these must both be served by a single WatchpointResource on that doubleword at 0x1000 on a 64-bit target, if two hardware watchpoint registers were used to track these separately, one of them may not be hit. Or if you have one Watchpoint on a variable with a condition set, and another Watchpoint on that same variable with a command defined or different condition, or ignorecount, both of those Watchpoints need to evaluate their criteria/commands when their WatchpointResource has been hit. There's a bit of code movement to rearrange things in the direction I'll need for implementing this feature, so I want to start with reviewing & landing this mostly NFC patch and we can focus on the algorithmic choices about how WatchpointResources are shared and handled as they're triggeed, separately. This patch also stops printing "Watchpoint <n> hit: old value: <x>, new vlaue: <y>" for Read watchpoints. I could make an argument for print "Watchpoint <n> hit: current value <x>" but the current output doesn't make any sense, and the user can print the value if they are particularly interested. Read watchpoints are used primarily to understand what code is reading a variable. This patch adds more fallbacks for how to print the objects being watched if we have types, instead of assuming they are all integral values, so a struct will print its elements. As large watchpoints are added, we'll be doing a lot more of those. To track the WatchpointSP in the WatchpointResources, I changed the internal API which took a WatchpointSP and devolved it to a Watchpoint*, which meant touching several different Process files. I removed the watchpoint code in ProcessKDP which only reported that watchpoints aren't supported, the base class does that already. I haven't yet changed how we receive a watchpoint to identify the WatchpointResource responsible for the trigger, and identify all Watchpoints that are using this Resource to evaluate their conditions etc. This is the same work that a BreakpointSite needs to do when it has been tiggered, where multiple Breakpoints may be at the same address. There is not yet any printing of the Resources that a Watchpoint is implemented in terms of ("watchpoint list", or SBWatchpoint::GetDescription). "watchpoint set var" and "watchpoint set expression" take a size argument which was previously 1, 2, 4, or 8 (an enum). I've changed this to an unsigned int. Most hardware implementations can only watch 1, 2, 4, 8 byte ranges, but with Resources we'll allow a user to ask for different sized watchpoints and set them in hardware-expressble terms soon. I've annotated areas where I know there is work still needed with LWP_TODO that I'll be working on once this is landed. I've tested this on aarch64 macOS, aarch64 Linux, and Intel macOS. https://discourse.llvm.org/t/rfc-large-watchpoint-support-in-lldb/72116
…lvm#68926) This PR adds support for the PPA2 fields. --------- Co-authored-by: Yusra Syeda <[email protected]>
The previous optional class would call the destructor on a non-trivially destructible object regardless of if it had already been reset. This patch fixes this by moving tracking for if the object exists into the internal storage class for optional.
On macOS <vector> was already included via another header file, but this failed on the CI bots. I'd tested on linux earlier in this patch's life when the headers were differently arranged.
…3439) In the FreeBSD base system, re-executing the main binary when ASLR is detected was implemented in the following commits: * freebsd/freebsd-src@7cafe89f9ce33 * freebsd/freebsd-src@96fe7c8ab0f65 * freebsd/freebsd-src@930a7c2ac67e1 * freebsd/freebsd-src@0a736f0a6aeb0 * freebsd/freebsd-src@4c9a0adad1826 Squash all these to bring them into upstream compiler-rt. When ASLR is detected to be enabled, this first force-disables ASLR for the current process, then calls ReExec(). The ReExec() function gets a FreeBSD specific implementation for finding the path of the executed program, via the ELF auxiliary vector. This is done without calling into the regular elf_aux_info(3) function, as that makes use of several already-intercepted functions.
Summary: These tests are currently failing, disable them so we can keep the bots green until we find a better solution. The x64 tests are not the core target so this is low priority.
This adds functionality to tbd files similar to what `lipo -extract` and `lipo -remove` does for binaries.
The "Dim" prefix is a legacy left-over that no longer makes sense, since we have a very strict "Dimension" vs. "Level" definition for sparse tensor types and their storage.
…lvm#72666) This is the first in a planned patch series to teach our vector lowering how to exploit register boundaries in LMUL>1 types when VLEN is known to be an exact constant. This corresponds to code compiled by clang with the -mrvv-vector-bits=zvl option. For extract_vector_elt, if we have a constant index and a known vlen, then we can identify which register out of a register group is being accessed. Given this, we can do a sub-register extract for that register, and then shift any remaining index. This results in all constant index extracts becoming m1 operations, and thus eliminates the complexity concern for explode-vector idioms at high lmul.
This sanitizes the string for printing.
std::hash and ADT/Hashing::hash_value are non-deterministic functions whose results might vary across implementation/process/execution. Using xxh3 instead for computing hashes of BinaryFunctions and BinaryBasicBlock for stale profile matching. (A possible alternative is to use ADT/StableHashing.h based on FNV hashing but xxh3 seems to be more popular in LLVM) This is to address llvm#65241.
I recently moved from using vim to lunarvim. I noticed that `gcc` (lol) wasn't able to comment out lines, because commentstring is not set. Thanks to Chase Colman for the suggestion in LunarVim/LunarVim#4394.
This pass isn't used anywhere upstream and thus has no test coverage. For these reasons, remove it.
Since this patch is the first part of a two-part patch for changing loop interleaving count computation, followed by removing loop interleaving threshold, these tests remain unchanged for the current (first) patch but will be modified for the second patch. The original PR for these tests is in llvm#70272.
I think something has gone wrong with the rebase - your patch now has 250 commits. |
llvm#70209) …to also take into account the op itself
…vm#70907) Unless specified as a "raw" string, Python will try to interpret backslashes, which means they don't get passed on correctly to the underlying regex-engine, unless escaped manually (`\\`). Use raw strings in `llvm/test/lit.cfg.py` to avoid a `SyntaxWarning`: ``` llvm/test/lit.cfg.py:275: SyntaxWarning: invalid escape sequence '\d' match = re.search("release (\d+)\.(\d+)", ptxas_out) ``` Other such warning present in 17.0.x were already fixed in 7ed0f5b.
Not sure if this was because I had multiple remotes configured, the rebasing got completely messed up! I have tried several ways to fix it, but it's not going back to the previous stage. I'll close this PR & create a new one with the final version. |
[LV] Change loops' interleave count computation
A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases:
1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice
2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run
where, TC = trip count, VW = vectorization width, IC = interleaving count
We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).