Skip to content

derive(Debug) creates a fair bit of reduntant inlining work for the llvm optimizer #43843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
oyvindln opened this issue Aug 13, 2017 · 2 comments

Comments

@oyvindln
Copy link
Contributor

After discussing it on the forums I bit I decided to look into what the optimizer spends time on, and inlining seems to be a significant part of it. There seems to be a lot of room for improvement here.
Tested by compiling syntex in rustc-benchmarks (though the inlining time seem quite significant in other crates as well.).

I counted which functions were inlined the most, and fmt::debug_tuple and to a lesser extent fmt::debug_struct, which are emitted when automatically deriving Debug, and which simply wrap two other functions inside fmt::builder show up a lot. Maybe derive(debug) could output the wrapped functions instead, or the equivalent code, so llvm doesn't have to spend time inlining them (unless this causes problems for debug info). fmt::arguments::new_v1 is also inlined a lot, and also plays a role in derive(Debug). Maybe something could be done about that one as well, as it's simply a function that creates a struct but doesn't do anything else.

Output from time-llvm-passes Using `RUSTFLAGS="-Z time-llvm-passes" cargo +nightly build --release 2> llvm_time.txt` in the syntex-0.42.2 directory in rustc-benchmarks.

It's a bit hard to say which crate is which, as the output from llvm and cargo isn't synced, though these are the bottom three (excluding the benchmark dummy) in reverse order:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 6.3800 seconds (6.2624 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7920 ( 14.1%)   0.0960 ( 12.8%)   0.8880 ( 13.9%)   0.8489 ( 13.6%)  Function Integration/Inlining
   0.5400 (  9.6%)   0.0760 ( 10.2%)   0.6160 (  9.7%)   0.5365 (  8.6%)  X86 DAG->DAG Instruction Selection
   0.3080 (  5.5%)   0.0280 (  3.7%)   0.3360 (  5.3%)   0.3031 (  4.8%)  Combine redundant instructions
   0.2400 (  4.3%)   0.0080 (  1.1%)   0.2480 (  3.9%)   0.2227 (  3.6%)  Dead Store Elimination
   0.2040 (  3.6%)   0.0120 (  1.6%)   0.2160 (  3.4%)   0.2147 (  3.4%)  SROA
...
  Total Execution Time: 96.6840 seconds (96.6925 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   9.5480 ( 10.4%)   0.5440 ( 10.3%)  10.0920 ( 10.4%)  10.1854 ( 10.5%)  Function Integration/Inlining
   6.9000 (  7.5%)   0.7120 ( 13.5%)   7.6120 (  7.9%)   7.7547 (  8.0%)  X86 DAG->DAG Instruction Selection
   5.8280 (  6.4%)   0.1120 (  2.1%)   5.9400 (  6.1%)   6.0239 (  6.2%)  Loop Invariant Code Motion
   4.5640 (  5.0%)   0.1680 (  3.2%)   4.7320 (  4.9%)   4.8229 (  5.0%)  Global Value Numbering
   3.8240 (  4.2%)   0.1920 (  3.6%)   4.0160 (  4.2%)   4.1534 (  4.3%)  Combine redundant instructions
...
  Total Execution Time: 5.4880 seconds (5.5137 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.4040 (  8.3%)   0.0680 ( 10.9%)   0.4720 (  8.6%)   0.4675 (  8.5%)  Function Integration/Inlining
   0.3000 (  6.2%)   0.0720 ( 11.5%)   0.3720 (  6.8%)   0.3340 (  6.1%)  X86 DAG->DAG Instruction Selection
   0.2720 (  5.6%)   0.0320 (  5.1%)   0.3040 (  5.5%)   0.2744 (  5.0%)  Global Value Numbering
   0.2440 (  5.0%)   0.0120 (  1.9%)   0.2560 (  4.7%)   0.2738 (  5.0%)  Loop Invariant Code Motion
   0.2280 (  4.7%)   0.0120 (  1.9%)   0.2400 (  4.4%)   0.2527 (  4.6%)  Combine redundant instructions
...
Most inlined functions: Running `RUSTFLAGS="-C remark=inline -g" cargo +nightly build --release 2> inlined.txt`, and filtering the output using this: ``` grep '.*[[:alnum:]]\{5\} inlined into' inlined.txt | sed 's/.*[0-9]: //g; s/d into.*/d into/' | sort | uniq -c | sort -gr > unique_inlines.txt ``` (Yeah it's a bit of a monstrosity, and takes a long time, maybe I could look into making some tool analyze this if it turns out to be useful.) gives:
   1865 _ZN4core3ptr13drop_in_place17hffb7ed55089717f8E inlined into
    917 _ZN50_$LT$T$u20$as$u20$core..convert..From$LT$T$GT$$GT$4from17h99c56b7d2123957eE inlined into
    677 _ZN4core3fmt9Arguments6new_v117h0a32116e8ade096aE inlined into
    549 _ZN4core3ptr13drop_in_place17h8905708a4bea3261E inlined into
    517 _ZN4core3fmt9Formatter11debug_tuple17h312acfb009988478E inlined into
    484 _ZN50_$LT$T$u20$as$u20$core..convert..From$LT$T$GT$$GT$4from17hf6536c6dc214691bE inlined into
    438 _ZN5alloc9allocator6Layout4size17h278602cb6317171dE inlined into
    402 _ZN63_$LT$alloc..allocator..Layout$u20$as$u20$core..clone..Clone$GT$5clone17hba86f6db001903cdE inlined into
    381 _ZN61_$LT$alloc..heap..Heap$u20$as$u20$alloc..allocator..Alloc$GT$7dealloc17h3db9762b4fdfd21eE inlined into
    344 _ZN5alloc4heap15exchange_malloc17he9800f033a6330e4E inlined into
    341 _ZN4core3ptr13drop_in_place17hb5325e51b98a161aE inlined into
...
    82 _ZN4core3fmt9Formatter12debug_struct17h5cfb920e90696799E inlined into
...
rustc --version ``` rustc 1.21.0-nightly (f774bce 2017-08-12) ```
@sfackler
Copy link
Member

We should just drop the #[inline] from those functions - debug printing is something we want to be optimizing on code size/compile time rather than perf.

@oyvindln
Copy link
Contributor Author

oyvindln commented Aug 13, 2017

Yup, that's a good point.

Though, for debug_tuple/debug_struct, it may still be worth to avoid that extra indirection given that the redundant function still has to be instantiated for each type (and option and other used containers that derive debug) even if it's not inlined.
EDIT: Actually since they're just generic over lifetimes it looks like that's not an issue.

@oyvindln oyvindln changed the title derive(Debug) creates a fair bit reduntant inlining work for the llvm optimizer derive(Debug) creates a fair bit of reduntant inlining work for the llvm optimizer Aug 13, 2017
sfackler added a commit to sfackler/rust that referenced this issue Aug 14, 2017
The inner methods aren't inlined, so this puts more pressure on LLVM for
literally no benefit.

Closes rust-lang#43843
bors added a commit that referenced this issue Aug 14, 2017
Don't inline debug methods

The inner methods aren't inlined, so this puts more pressure on LLVM for
literally no benefit.

Closes #43843
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants