-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Closed
Labels
I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.P-highHigh priorityHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Description
I noticed a significant regression since nightly 2018-01-27 when enabling LTO. Here is a simplified example that has 10x difference.
fn fill<T: Clone + 'static>(init: &'static T) -> Box<Fn(&usize) -> Vec<T>> {
Box::new(move |&i| {
let mut vec = Vec::<T>::new();
vec.resize(i, init.clone());
vec
})
}
fn main() {
let zeroes = fill(&0);
for _ in 0..1_000_000 {
zeroes(&1000);
}
}
Without LTO enabled:
$ perf stat -B -e cycles,instructions ./target/release/columnar
Performance counter stats for './target/release/regr':
335,160,070 cycles
539,078,439 instructions # 1.61 insn per cycle
0.112408607 seconds time elapsed
with LTO enabled:
Performance counter stats for './target/release/regr':
1,121,648,955 cycles
5,233,309,105 instructions # 4.67 insn per cycle
0.361163946 seconds time elapsed
Tested variants
Bad: -C lto=fat
Good: -C lto=thin
Good: -C lto=fat -C codegen-units=1
Metadata
Metadata
Assignees
Labels
I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.P-highHigh priorityHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
siddontang commentedon Jan 30, 2018
Refer http://www.brendangregg.com/perf.html#CPUstatistics, Gregg says:
But I am confused that why LTO uses so many instructions.
ollie27 commentedon Jan 30, 2018
What results do you get with
-Ccodegen-units=1
?luben commentedon Jan 30, 2018
@siddontang , higher IPC is good if we execute the same (or similar) number if instructions. Here with LTO it executes 10x number of instructions but without 10x increase of IPC that leads to 3x increased run-time.
@ollie27 , just tested, it solves it. Another correction, -C lto=thin also solves it, I didn't test it properly before.
alexcrichton commentedon Jan 30, 2018
The change list for this regression is unfortunately pretty large (between
rust-nightly-2018-01-26
andrust-nightly-2018-01-27
). Like @luben I can confirm that-C lto=fat
produces poor performance while-C lto=thin
produces good performance.The primary change in that change list though was switching the default back to "fat" LTO, it seems at some point in the past our fat lto regressed? I'm running out of time to help narrow this down but what I've found so far is that we had a surprising regression of this program between
nightly-2017-12-15
andnightly-2017-12-16
of just the optimized version (not even the LTO version). The change list doesn't have anything suspicious though.nikomatsakis commentedon Feb 1, 2018
(Just to be clear, this is a regression in the perf of the compiled code, right?)
nikomatsakis commentedon Feb 1, 2018
@alexcrichton do you think you will have time to chase this down? Should we try to find someone to assign?
nikomatsakis commentedon Feb 1, 2018
triage: P-high
Need to figure out what's happening here.
alexcrichton commentedon Feb 1, 2018
@nikomatsakis unfortunately I'm pretty tight on time, so it's probably best to find another soul to help out with this.
alexcrichton commentedon Feb 1, 2018
Er ok had some time to debug this. @eddyb it looks like this regression may be caused by #46623 or at least something is caused by that.
The logs I have are:
before
after
That is, with #46623 the benchmark without LTO regressed by 10x
eddyb commentedon Feb 1, 2018
@alexcrichton Maybe it's similar to #47062 (comment)? Is there any change around that nightly?
alexcrichton commentedon Feb 1, 2018
False alarm, @eddyb is right in that the commit I bisected above was fixed by #47007
alexcrichton commentedon Feb 1, 2018
Hm ok some more investigation. The numbers here all over the place and there's a whole bunch of changes that were in flight over the past few months. That being said I don't think that this is a regression!
So first up "fat" LTO is much slower than "thin" LTO on this benchmark. That doesn't make a lot of sense because it's supposed to be the other way around! I tried to hunt backwards seeing what could have changed around "fat" LTO but I went back pretty far and it was basically always constant. On my machine going all the way back to
nightly-2017-11-01
the benchmark takes ~0.4s when compiling with fat LTO and 16 CGUs.The next suspicion was that fat LTO was slower than turning off LTO (odd!). Turns out #46382 somehow magically fixed this. More specifically f9b0897 (the merge before #46382 ) resulted in 1.5s runtime w/o LTO and 0.4s with it (both using 16 cgus). The next merge (#46382) had a runtime of 0.08s w/o LTO (yes, that's without) and 0.4s with it (fat LTO that is). So somehow #46382 improved non LTO builds on this benchmark but kept normal LTO builds the same.
Now the final change here is the difference between thin and fat LTO. The switch to ThinLTO happened in #46382 as well, which means a
-C lto
argument actually improves to 0.08s on that commit. (I forcibly disabled thin via-Z thinlto=no
). It turns out though that this change was always there.So tl;dr;
All in all this is probably bug in LLVM why fat LTO isn't optimizing as well as a non-LTO build, but I'm overall going to remove the regression tag as I don't believe this is a regression any more as stable is a "fat" LTO by default as are all other channels right now.
dotdash commentedon Feb 1, 2018
Reverting 8bde2ac "rustc: Add
-C lto=val
option" fixes this.alexcrichton commentedon Feb 1, 2018
@dotdash indeed yeah but all that's doing is switching the default from fat LTO to ThinLTO, you can test that out by passing
-C lto=thin
and-C lto=fat
to see the difference for that.I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.
nikomatsakis commentedon Feb 2, 2018
@alexcrichton great investigation! Seems like we can close the issue, and that @luben ought to use thin lto =)
I imagine this can happen, I mean it's all heuristics in the end.
alexcrichton commentedon Feb 2, 2018
Ah yes I believe we can close, @luben if you've got any more questions though please just let me know!