-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SystemZ] Large compile time regression in SystemZTTIImpl::adjustInliningThreshold()
#134714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ok, I take that back:
There doesn't actually appear to be any globals with a disproportionately high number of users, but this is still an awful lot of globals and users thereof to be iterating on ~every inlining decision. |
Thanks for the report, I will take a look. Any chance of a test case, like a single file that takes a lot longer to compile? |
I can try to get you a bitcode file tomorrow. It's going to be fairly large; do you have a preference for where I should upload it? In the meantime, if you feel like it and have cores that need exercise (it's fine if you don't!), the issue is reproducible by cloning zig-bootstrap and running e.g. |
Would it work to upload it here if compressed? Otherwise I don't know one way or the other, I guess there should be some place for large files we could use. |
I think the limit on here is 25M, but let's see... |
zig.txt (actually a Weirdly enough, I can't actually reproduce the issue when I run LLVM 20's
Maybe I'm missing some |
ah, the inliner is run in the middle-end, while llc runs the backend passes. It should work if you run the 'opt' tool instead of llc (with the IR emitted by the front-end). If you use -save-temps with clang, you should find a .bc file you could use with e.g. opt -O3 -mtriple=s390x-linux-gnu -mcpu=z16 |
Ah, that'll do it. |
I tried it - and on my machine it takes 19 minutes with opt, but just 5 if I comment out the section about GlobalVariables (per above). I wonder if you would get rid of the compile time issue and also not see any performance regressions, if that part simply didn't run on your huge module? It seems that there may not be any easy way to improve compile time except for guarding it against huge modules / callees like this. I tried std::unoredered_set, but that did not help for me. The only thing that improved it a little (19 ->17 minutes) was a guard that the GlobalValue has at least 22 uses, per the checks below it. |
The test does appear unnecessarily complex to me. The following should have the same effect, while removing all memory allocation as well as the second loop nest:
Might be interesting to try. But this still might be too much if called for every inlining decision ... |
Some more observations: zig.bc is a huge module with ~61k GlobalVariables, and ~600k calls into adjustInliningThreshold(), and no less than ~1 billion iterations in the loop over M->globals(). As comparison, there was ~10 million iterations over all of SPEC. Looking at the number of uses each GV has, this is concentrated in lower numbers, but goes all the way up to ~2300 with zig.bc. Over SPEC, when the bonus is actually given, it actually goes even up to ~6500 users. However, all this was aimed at the regression with perlbench in a particular file - regexec.c - namely inlining S_regcppop() and S_regcppush() into S_regmatch(). In this particular file, the number of globals is only 71. Those functions are inlined about 4 times each and they are relatively small: ~100 instructions. Over SPEC, this bonus is given ~18k times (out of ~660k calls), which may be too much, as IIRC the only benefit seen so far was with perl. Disregarding compile-time, (I thought) this should be a good idea generally, but I guess if there is no benefit it could be restricted a bit. I tried guarding all this with the double of the values of number of globals and callee instruction count in rexecec.c: if (M->global_size() < 150 && Callee->getInstructionCount() < 200) { It was actually 1s faster on perlbench (preliminary), and might be one way forward. This would exclude the heuristic completely in modules like zig.bc. I would say that the inlining case that needs to be done is for relatively small functions called +3 times from Caller. The guard against number of globals in the module is more involuntary as it doesn't really matter for the result. Maybe a value higher than 100 and less than 61k would be useful to avoid this problem. Using the instruction count alone is not enough. |
I tried this:
but this was to my surprise 1 minute slower on zig.bc than the original. It seemed to be about the same across SPEC. |
You could instead scan the instructions in the caller and callee instead, but that would have other pathological cases. You could implement a cached analysis that analyzes global usage (a la GlobalModRef). Though more generally, if you have an inlining heuristic that would get immediately rejected (and for very good reason!) if you tried to contribute it to InlineCost/InlineAdvisor, that heuristic probably should not exist in target-specific code either... |
@uweigand Should we use a limiter like 'if (M->global_size() < 150 && Callee->getInstructionCount() < 200) {' or similar? |
@nikic I didn't find anything named GlobalModRef, but maybe a cache could work. How would that work? I am wondering if it could work to have a cache in the TargetLowering (SystemZTargetLowering), much like the IsInternalCache already there. The difference here is that the Function:s can now still be modified. Could each inliner pass call (at startup) a TTI hook to allow the target to clear such a cache and then assume that only inlining will happen, so that a cache with Function attributes, such as a set of users of a particular Value could be used? Altneratively, maybe there could be an explicit object passed by the inliner to adjustInliningThreshold() that would hold this cache?Something like a InliningHeurCache base type that the target would derive and extend. |
It seems to work well to first scan a small Callee for any such GlobalVariable that is used +10 times, and then scan Caller only for those GVs already found. This way there doesn't have to be a limit to the number of GVs in the Module. See #137527. |
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714. (cherry picked from commit 98b895d)
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
#106058 appears to have caused a large compile time regression in LLVM 20. zig-bootstrap used to be able to compile and link the
zig
binary fors390x-linux-(gnu,musl)
in a matter of minutes when we were on LLVM 19. Now on LLVM 20, I decided to cancel the build after it had been running for about an hour.Some investigation shows that a lot of time is being spent in
SystemZTTIImpl::adjustInliningThreshold()
, in particular in this section:llvm-project/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
Lines 98 to 105 in c1c0d55
Some thoughts:
RelWithDebInfo
build of LLVM wherep Global.users()
was optimized out.Debug
compiler as I file this to hopefully find out...std::set
seems less than ideal compared tostd::unordered_set
here.std::unordered_set
but it didn't seem to have too much of an impact. Still seems desirable to do, though.In general, all of this seems like a lot of work to be doing on ~every inlining decision.
cc @JonPsson1 @uweigand
The text was updated successfully, but these errors were encountered: