Skip to content

[SystemZ] Large compile time regression in SystemZTTIImpl::adjustInliningThreshold() #134714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alexrp opened this issue Apr 7, 2025 · 16 comments · Fixed by #137527
Closed

[SystemZ] Large compile time regression in SystemZTTIImpl::adjustInliningThreshold() #134714

alexrp opened this issue Apr 7, 2025 · 16 comments · Fixed by #137527

Comments

@alexrp
Copy link
Member

alexrp commented Apr 7, 2025

#106058 appears to have caused a large compile time regression in LLVM 20. zig-bootstrap used to be able to compile and link the zig binary for s390x-linux-(gnu,musl) in a matter of minutes when we were on LLVM 19. Now on LLVM 20, I decided to cancel the build after it had been running for about an hour.

Some investigation shows that a lot of time is being spent in SystemZTTIImpl::adjustInliningThreshold(), in particular in this section:

for (const GlobalVariable &Global : M->globals())
for (const User *U : Global.users())
if (const Instruction *User = dyn_cast<Instruction>(U)) {
if (User->getParent()->getParent() == Callee)
CalleeGlobals.insert(&Global);
if (User->getParent()->getParent() == Caller)
CallerGlobals.insert(&Global);
}

Some thoughts:

  • A module can have lots of globals.
    • This doesn't appear to be the problem here, however.
  • A global can have lots of users (think common deduplicated/merged constant values).
    • This seems to be the case here, though I couldn't determine exactly how many users we're talking about because I was using a RelWithDebInfo build of LLVM where p Global.users() was optimized out.
      • Building a Debug compiler as I file this to hopefully find out...
  • std::set seems less than ideal compared to std::unordered_set here.
    • I tried changing this to std::unordered_set but it didn't seem to have too much of an impact. Still seems desirable to do, though.

In general, all of this seems like a lot of work to be doing on ~every inlining decision.

cc @JonPsson1 @uweigand

@alexrp
Copy link
Member Author

alexrp commented Apr 7, 2025

  • A module can have lots of globals.
    • This doesn't appear to be the problem here, however.

Ok, I take that back:

(gdb) p M->GlobalList.size()
$24 = 61011

There doesn't actually appear to be any globals with a disproportionately high number of users, but this is still an awful lot of globals and users thereof to be iterating on ~every inlining decision.

@JonPsson1
Copy link
Contributor

Thanks for the report, I will take a look. Any chance of a test case, like a single file that takes a lot longer to compile?

@JonPsson1 JonPsson1 self-assigned this Apr 7, 2025
@alexrp
Copy link
Member Author

alexrp commented Apr 7, 2025

I can try to get you a bitcode file tomorrow. It's going to be fairly large; do you have a preference for where I should upload it?

In the meantime, if you feel like it and have cores that need exercise (it's fine if you don't!), the issue is reproducible by cloning zig-bootstrap and running e.g. CMAKE_GENERATOR=Ninja CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) ./build s390x-linux-gnu baseline. That'll build host LLVM + Zig and then use those to build LLVM + Zig to run on s390x-linux-gnu. The latter step is where the slowdown kicks in.

@JonPsson1
Copy link
Contributor

Would it work to upload it here if compressed? Otherwise I don't know one way or the other, I guess there should be some place for large files we could use.

@alexrp
Copy link
Member Author

alexrp commented Apr 8, 2025

Would it work to upload it here if compressed?

I think the limit on here is 25M, but let's see...

@alexrp
Copy link
Member Author

alexrp commented Apr 8, 2025

zig.txt (actually a .tar.xz containing zig.bc but GitHub is silly about file extensions)

Weirdly enough, I can't actually reproduce the issue when I run LLVM 20's llc on this file, whether it's with -O0 or -O3. I don't quite understand why that would be the case, because I don't think we do anything particularly crazy in setting up the optimization pipeline in Zig:

Maybe I'm missing some llc flags?

@JonPsson1
Copy link
Contributor

ah, the inliner is run in the middle-end, while llc runs the backend passes. It should work if you run the 'opt' tool instead of llc (with the IR emitted by the front-end). If you use -save-temps with clang, you should find a .bc file you could use with e.g.

opt -O3 -mtriple=s390x-linux-gnu -mcpu=z16

@alexrp
Copy link
Member Author

alexrp commented Apr 8, 2025

Ah, that'll do it. opt -O3 -mtriple s390x-linux-gnu zig.bc on the file uploaded above seems to exhibit the problem.

@JonPsson1
Copy link
Contributor

I tried it - and on my machine it takes 19 minutes with opt, but just 5 if I comment out the section about GlobalVariables (per above).

I wonder if you would get rid of the compile time issue and also not see any performance regressions, if that part simply didn't run on your huge module?

It seems that there may not be any easy way to improve compile time except for guarding it against huge modules / callees like this. I tried std::unoredered_set, but that did not help for me. The only thing that improved it a little (19 ->17 minutes) was a guard that the GlobalValue has at least 22 uses, per the checks below it.

@uweigand
Copy link
Member

The test does appear unnecessarily complex to me. The following should have the same effect, while removing all memory allocation as well as the second loop nest:

  for (const GlobalVariable &Global : M->globals()) {
    bool UsedByCaller = false, UsedByCallee = false;
    for (const User *U : Global.users())
      if (const Instruction *User = dyn_cast<Instruction>(U)) {
        if (User->getParent()->getParent() == Callee)
          UsedByCallee = true;
        if (User->getParent()->getParent() == Caller)
          UserByCaller = true;
      }
    if (UsedByCallee && UsedByCaller) {
      [...]
    }
  }

Might be interesting to try. But this still might be too much if called for every inlining decision ...

@JonPsson1
Copy link
Contributor

Some more observations:

zig.bc is a huge module with ~61k GlobalVariables, and ~600k calls into adjustInliningThreshold(), and no less than ~1 billion iterations in the loop over M->globals(). As comparison, there was ~10 million iterations over all of SPEC.

Looking at the number of uses each GV has, this is concentrated in lower numbers, but goes all the way up to ~2300 with zig.bc. Over SPEC, when the bonus is actually given, it actually goes even up to ~6500 users.

However, all this was aimed at the regression with perlbench in a particular file - regexec.c - namely inlining S_regcppop() and S_regcppush() into S_regmatch(). In this particular file, the number of globals is only 71. Those functions are inlined about 4 times each and they are relatively small: ~100 instructions.

Over SPEC, this bonus is given ~18k times (out of ~660k calls), which may be too much, as IIRC the only benefit seen so far was with perl. Disregarding compile-time, (I thought) this should be a good idea generally, but I guess if there is no benefit it could be restricted a bit.

I tried guarding all this with the double of the values of number of globals and callee instruction count in rexecec.c:

if (M->global_size() < 150 && Callee->getInstructionCount() < 200) {
...

It was actually 1s faster on perlbench (preliminary), and might be one way forward. This would exclude the heuristic completely in modules like zig.bc.

I would say that the inlining case that needs to be done is for relatively small functions called +3 times from Caller. The guard against number of globals in the module is more involuntary as it doesn't really matter for the result. Maybe a value higher than 100 and less than 61k would be useful to avoid this problem. Using the instruction count alone is not enough.

@JonPsson1
Copy link
Contributor

I tried this:

  // Give bonus for globals used much in both caller and callee.    
  for (const GlobalVariable &Global : M->globals()) {
    bool UsedByCaller = false, UsedByCallee = false;
    for (const User *U : Global.users()) {
      if (const Instruction *User = dyn_cast<Instruction>(U)) {
        if (User->getParent()->getParent() == Callee)
          UsedByCallee = true;
        if (User->getParent()->getParent() == Caller)
          UsedByCaller = true;
      }
      if (UsedByCallee && UsedByCaller) {
        unsigned CalleeStores = 0, CalleeLoads = 0;
        unsigned CallerStores = 0, CallerLoads = 0;
        countNumMemAccesses(&Global, CalleeStores, CalleeLoads, Callee);
        countNumMemAccesses(&Global, CallerStores, CallerLoads, Caller);
        if ((CalleeStores + CalleeLoads) > 10 &&
            (CallerStores + CallerLoads) > 10)
          Bonus = 1000;
        break;
      }
    }
    if (Bonus == 1000)
      break;
  }

but this was to my surprise 1 minute slower on zig.bc than the original. It seemed to be about the same across SPEC.

@nikic
Copy link
Contributor

nikic commented Apr 11, 2025

You could instead scan the instructions in the caller and callee instead, but that would have other pathological cases. You could implement a cached analysis that analyzes global usage (a la GlobalModRef).

Though more generally, if you have an inlining heuristic that would get immediately rejected (and for very good reason!) if you tried to contribute it to InlineCost/InlineAdvisor, that heuristic probably should not exist in target-specific code either...

@JonPsson1
Copy link
Contributor

@uweigand Should we use a limiter like 'if (M->global_size() < 150 && Callee->getInstructionCount() < 200) {' or similar?

@JonPsson1
Copy link
Contributor

@nikic I didn't find anything named GlobalModRef, but maybe a cache could work. How would that work?

I am wondering if it could work to have a cache in the TargetLowering (SystemZTargetLowering), much like the IsInternalCache already there. The difference here is that the Function:s can now still be modified. Could each inliner pass call (at startup) a TTI hook to allow the target to clear such a cache and then assume that only inlining will happen, so that a cache with Function attributes, such as a set of users of a particular Value could be used?

Altneratively, maybe there could be an explicit object passed by the inliner to adjustInliningThreshold() that would hold this cache?Something like a InliningHeurCache base type that the target would derive and extend.

@JonPsson1
Copy link
Contributor

It seems to work well to first scan a small Callee for any such GlobalVariable that is used +10 times, and then scan Caller only for those GVs already found. This way there doesn't have to be a limit to the number of GVs in the Module. See #137527.

swift-ci pushed a commit to swiftlang/llvm-project that referenced this issue Apr 29, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times.

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.

(cherry picked from commit 98b895d)
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this issue May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this issue May 9, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants