-
Notifications
You must be signed in to change notification settings - Fork 343
[MicroBenchmark,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MicroBenchmark,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations #26
Conversation
…nterleaving Count with varying loop iterations. This microbenchmark attempts to find the right loop trip count threshold for deciding whether to interleave a loop or not for different types of loops, such as loops with or without reduction inside it, loops with or without vectorization inside it. Note: Interleaving count of 1 means interleaving is disabled. Differential Revision: https://reviews.llvm.org/D159475
IIUC this adds benchmarks for the best case for interleaving. Would it be possible to also add variants where interleaving is less beneficial? |
…ed a redundant function, added test cases where the compiler selects the vectorization configuration
This file adds benchmarks for testing cases that may or may not be beneficial for interleaving. For example, cases with low trip counts are better off without interleaving, whereas if the interleaved loop runs at least twice it starts showing performance benefit. If we need to add more cases where it is less beneficial, can you suggest some? |
Some additional cases could be loops with a reduction, but also a number of additional independent memory and compute operation chains (i.e. larger loop body with multiple independent operation chains) and one with a larger loop body without reduction (in that case, LLVM likely will decide not to interleave). They might provide additional insight into current issues of the cost modeling for interleaving. |
A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases: 1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice 2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run where, TC = trip count, VW = vectorization width, IC = interleaving count We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).
…d vectorization hints with preprocessor-macro-based template functions, as per reviewer comments
Replicated the existing tests for loops with bigger bodies with additional independent memory operations. These can be found in the latest patch, if you do case-insensitive search of function names for "BigLoop". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the extra loops, a few more comments inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Hello. I think this patch broke several bots:
Could you please look at this ? |
Reverted it. Hope that it's OK and that it was indeed the culprit commit. (edit: yes it was) |
Thanks for figuring this out & reverting it. Seems like adding too many tests ended up hitting the timeout. I'll try to reduce the test points to keep it within the time limit & re-land it. The tests passed for me locally though, even with the On a separate matter, I was checking the bot links you posted above, but I don't see myself or this patch in the |
Yes the test does completes successfully here as well. I don't know the rules for adding new tests, but this one seems really long. For the other thing, the responsible users and changes tabs should be read with care. |
Also, I wonder why only the arm/aarch64 bots failed with this test. Maybe there's something to analyze here. |
It runs for little more than 40min in my machine, but locally it doesn't hit the timeout. I added a new PR #56 which runs a reduced test set by default and runs the whole set when compiled with a flag.
|
…terleaving Count with varying loop iterations (llvm#26) * [MicroBenchmarks,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations. This microbenchmark attempts to find the impact of loop interleaving count for different types of loops (big or small, with or without reductions inside them) over different vectorization factors for varying loop trip counts. Note: Interleaving count of 1 means interleaving is disabled. These microbenchmarks are to help guide changes in loop interleaving count computation and removal of trip count threshold for interleaving loops in llvm/llvm-project#67725 & related patches.
…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
…ave a loop (#67725) A set of microbenchmarks (llvm/llvm-test-suite#26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
…ave a loop (llvm#67725) A set of microbenchmarks (llvm/llvm-test-suite#26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
This microbenchmark attempts to find the right loop trip count threshold for deciding whether to interleave a loop or not for different types of loops, such as loops with or without reduction inside it, loops with or without vectorization inside it. For context, the loop vectorizer uses a threshold of TinyTripCountInterleaveThreshold that is currently set to 128.