Skip to content

LTO: 10x regression #47866

@luben

Description

@luben

I noticed a significant regression since nightly 2018-01-27 when enabling LTO. Here is a simplified example that has 10x difference.

fn fill<T: Clone + 'static>(init: &'static T) -> Box<Fn(&usize) -> Vec<T>> {
    Box::new(move |&i| {
        let mut vec = Vec::<T>::new();
        vec.resize(i, init.clone());
        vec
    })
}

fn main() {
    let zeroes = fill(&0); 
    for _ in 0..1_000_000 {
        zeroes(&1000);
    }
}

Without LTO enabled:

$ perf stat -B -e cycles,instructions ./target/release/columnar

 Performance counter stats for './target/release/regr':

       335,160,070      cycles                                                      
       539,078,439      instructions              #    1.61  insn per cycle         

       0.112408607 seconds time elapsed

with LTO enabled:


 Performance counter stats for './target/release/regr':

     1,121,648,955      cycles                                                      
     5,233,309,105      instructions              #    4.67  insn per cycle         

       0.361163946 seconds time elapsed

Tested variants
Bad: -C lto=fat
Good: -C lto=thin
Good: -C lto=fat -C codegen-units=1

Activity

siddontang

siddontang commented on Jan 30, 2018

@siddontang

Refer http://www.brendangregg.com/perf.html#CPUstatistics, Gregg says:

This includes instructions per cycle (IPC), labled "insns per cycle", or in earlier versions, "IPC". This is a commonly examined metric, either IPC or its invert, CPI. Higher IPC values mean higher instruction throughput, and lower values indicate more stall cycles. I'd generally interpret high IPC values (eg, over 1.0) as good, indicating optimal processing of work.

But I am confused that why LTO uses so many instructions.

ollie27

ollie27 commented on Jan 30, 2018

@ollie27
Member

What results do you get with -Ccodegen-units=1?

added
I-slowIssue: Problems and improvements with respect to performance of generated code.
on Jan 30, 2018
luben

luben commented on Jan 30, 2018

@luben
Author

@siddontang , higher IPC is good if we execute the same (or similar) number if instructions. Here with LTO it executes 10x number of instructions but without 10x increase of IPC that leads to 3x increased run-time.

@ollie27 , just tested, it solves it. Another correction, -C lto=thin also solves it, I didn't test it properly before.

alexcrichton

alexcrichton commented on Jan 30, 2018

@alexcrichton
Member

The change list for this regression is unfortunately pretty large (between rust-nightly-2018-01-26 and rust-nightly-2018-01-27). Like @luben I can confirm that -C lto=fat produces poor performance while -C lto=thin produces good performance.

The primary change in that change list though was switching the default back to "fat" LTO, it seems at some point in the past our fat lto regressed? I'm running out of time to help narrow this down but what I've found so far is that we had a surprising regression of this program between nightly-2017-12-15 and nightly-2017-12-16 of just the optimized version (not even the LTO version). The change list doesn't have anything suspicious though.

added
T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.
on Jan 30, 2018
nikomatsakis

nikomatsakis commented on Feb 1, 2018

@nikomatsakis
Contributor

(Just to be clear, this is a regression in the perf of the compiled code, right?)

nikomatsakis

nikomatsakis commented on Feb 1, 2018

@nikomatsakis
Contributor

@alexcrichton do you think you will have time to chase this down? Should we try to find someone to assign?

nikomatsakis

nikomatsakis commented on Feb 1, 2018

@nikomatsakis
Contributor

triage: P-high

Need to figure out what's happening here.

alexcrichton

alexcrichton commented on Feb 1, 2018

@alexcrichton
Member

@nikomatsakis unfortunately I'm pretty tight on time, so it's probably best to find another soul to help out with this.

alexcrichton

alexcrichton commented on Feb 1, 2018

@alexcrichton
Member

Er ok had some time to debug this. @eddyb it looks like this regression may be caused by #46623 or at least something is caused by that.

The logs I have are:

before

+ rustc -vV
rustc 1.24.0-nightly (50f6c3ece 2017-12-15)
binary: rustc
commit-hash: 50f6c3ece0ec738da48f8e77e6379a14bd02d1f4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

       331,404,612      cycles

       0.085892079 seconds time elapsed

after

+ rustc -vV
rustc 1.24.0-nightly (77efd6800 2017-12-15)
binary: rustc
commit-hash: 77efd6800c57ba83923dddbbabf03c7afa6a34a4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

     3,129,021,965      cycles

       0.806494586 seconds time elapsed

That is, with #46623 the benchmark without LTO regressed by 10x

eddyb

eddyb commented on Feb 1, 2018

@eddyb
Member

@alexcrichton Maybe it's similar to #47062 (comment)? Is there any change around that nightly?

alexcrichton

alexcrichton commented on Feb 1, 2018

@alexcrichton
Member

False alarm, @eddyb is right in that the commit I bisected above was fixed by #47007

alexcrichton

alexcrichton commented on Feb 1, 2018

@alexcrichton
Member

Hm ok some more investigation. The numbers here all over the place and there's a whole bunch of changes that were in flight over the past few months. That being said I don't think that this is a regression!

So first up "fat" LTO is much slower than "thin" LTO on this benchmark. That doesn't make a lot of sense because it's supposed to be the other way around! I tried to hunt backwards seeing what could have changed around "fat" LTO but I went back pretty far and it was basically always constant. On my machine going all the way back to nightly-2017-11-01 the benchmark takes ~0.4s when compiling with fat LTO and 16 CGUs.

The next suspicion was that fat LTO was slower than turning off LTO (odd!). Turns out #46382 somehow magically fixed this. More specifically f9b0897 (the merge before #46382 ) resulted in 1.5s runtime w/o LTO and 0.4s with it (both using 16 cgus). The next merge (#46382) had a runtime of 0.08s w/o LTO (yes, that's without) and 0.4s with it (fat LTO that is). So somehow #46382 improved non LTO builds on this benchmark but kept normal LTO builds the same.

Now the final change here is the difference between thin and fat LTO. The switch to ThinLTO happened in #46382 as well, which means a -C lto argument actually improves to 0.08s on that commit. (I forcibly disabled thin via -Z thinlto=no). It turns out though that this change was always there.


So tl;dr;

All in all this is probably bug in LLVM why fat LTO isn't optimizing as well as a non-LTO build, but I'm overall going to remove the regression tag as I don't believe this is a regression any more as stable is a "fat" LTO by default as are all other channels right now.

dotdash

dotdash commented on Feb 1, 2018

@dotdash
Contributor

Reverting 8bde2ac "rustc: Add -C lto=val option" fixes this.

alexcrichton

alexcrichton commented on Feb 1, 2018

@alexcrichton
Member

@dotdash indeed yeah but all that's doing is switching the default from fat LTO to ThinLTO, you can test that out by passing -C lto=thin and -C lto=fat to see the difference for that.

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

nikomatsakis

nikomatsakis commented on Feb 2, 2018

@nikomatsakis
Contributor

@alexcrichton great investigation! Seems like we can close the issue, and that @luben ought to use thin lto =)

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

I imagine this can happen, I mean it's all heuristics in the end.

alexcrichton

alexcrichton commented on Feb 2, 2018

@alexcrichton
Member

Ah yes I believe we can close, @luben if you've got any more questions though please just let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    I-slowIssue: Problems and improvements with respect to performance of generated code.P-highHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @alexcrichton@eddyb@kennytm@nikomatsakis@dotdash

        Issue actions

          LTO: 10x regression · Issue #47866 · rust-lang/rust