Skip to content

LLVM IR for generic functions sometimes fails to inline post MergeFuncs with LTO #97552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Nils-TUD opened this issue May 30, 2022 · 12 comments
Closed
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Nils-TUD
Copy link

For some reason, the compiler decides to not inline Vec::deref anymore if LTO is enabled, which ruins performance in my case. I tried to force the compiler to inline it via -Znew-llvm-pass-manager=no -Cinline-threshold=N, but even N=10000 (resulting in insane compile times and binary sizes) doesn't convince the compiler to inline the function. This behavior changed between Rust version 76d770a (inlined) and 6af09d2 (not inlined).

Code

I tried to create a more or less minimal example that reproduces the problem, which resulted in the following Rust program:

#[inline(never)]
fn do_something(args: &Vec<String>) -> bool {
    args[0] == "test"
}

fn main() {
    println!("{}", do_something(&std::env::args().collect::<Vec<_>>()));
}

I expected to see this happen: the Deref implementation of Vec is inlined regardless of whether LTO is used or not.

Instead, this happened: If LTO is enabled, the Deref implementation of Vec is not inlined. If LTO is disabled, it is inlined. That is, with LTO, the generated assembly code looks like this:

0000000000037bd0 <vec_deref_inline::do_something>:
   37bd0:       50                      push   %rax
   37bd1:       48 83 7f 10 00          cmpq   $0x0,0x10(%rdi)
   37bd6:       74 1d                   je     37bf5 <vec_deref_inline::do_something+0x25>
   37bd8:       48 8b 3f                mov    (%rdi),%rdi
   37bdb:       e8 30 00 00 00          call   37c10 <<alloc::vec::Vec<T,A> as core::ops::deref::Deref>::deref>
   37be0:       48 83 fa 04             cmp    $0x4,%rdx
   37be4:       75 0b                   jne    37bf1 <vec_deref_inline::do_something+0x21>
   37be6:       81 38 74 65 73 74       cmpl   $0x74736574,(%rax)
   37bec:       0f 94 c0                sete   %al
   37bef:       59                      pop    %rcx
   37bf0:       c3                      ret    
   37bf1:       31 c0                   xor    %eax,%eax
   37bf3:       59                      pop    %rcx
   37bf4:       c3                      ret    
   37bf5:       48 8d 15 c4 df 00 00    lea    0xdfc4(%rip),%rdx        # 45bc0 <__do_global_dtors_aux_fini_array_entry+0x1b68>
   37bfc:       31 ff                   xor    %edi,%edi
   37bfe:       31 f6                   xor    %esi,%esi
   37c00:       e8 7b c5 fc ff          call   4180 <core::panicking::panic_bounds_check>
   37c05:       0f 0b                   ud2    
   37c07:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   37c0e:       00 00 

0000000000037c10 <<alloc::vec::Vec<T,A> as core::ops::deref::Deref>::deref>:
   37c10:       48 8b 07                mov    (%rdi),%rax
   37c13:       48 8b 57 10             mov    0x10(%rdi),%rdx
   37c17:       c3                      ret    

Version it worked on

rustc --version --verbose:

rustc 1.61.0-nightly (76d770ac2 2022-04-02)
binary: rustc
commit-hash: 76d770ac21d9521db6a92a48c7b3d5b2cc535941
commit-date: 2022-04-02
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0

In this version, Vec::deref is inlined in every Rust program I have (LTO=on). In the example above, it is also inlined regardless of whether LTO is used or not.

Version with regression

rustc --version --verbose:

rustc 1.61.0-nightly (6af09d250 2022-04-03)
binary: rustc
commit-hash: 6af09d2505f38e4f1df291df56d497fb2ad935ed
commit-date: 2022-04-03
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0

In this version, Vec::deref is never inlined as it seems (LTO=on). In the example above, it is only inlined if LTO is disabled.

@Nils-TUD Nils-TUD added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 30, 2022
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 30, 2022
@saethlin
Copy link
Member

See also #89389

@nbdd0121
Copy link
Contributor

I think this is just due to the missing #[inline] attribute -- I have seen multiple cases where putting #[inline] on a polymorphic function helps the inlining decision during optimisation.

@tmiasko
Copy link
Contributor

tmiasko commented May 30, 2022

I think this is just due to the missing #[inline] attribute -- I have seen multiple cases where putting #[inline] on a polymorphic function helps the inlining decision during optimisation.

#[inline] attribute should indeed work around the issue. The problem is more about the form of the IR than about inlining decision per se.

The MergeFunc merges functions with compatible but different struct return types. When linking modules for LTO, callsites to merged functions will now include extra casts. Such calls cannot be inlined unless placed into the form expected by the inliner, but InstCombine doesn't have necessary transform:

https://github.com/llvm/llvm-project/blob/2e101cca690645d63ae4de1eb7b0e11d322448fd/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp#L3218-L3219

I would expect that this would be a non-issue with opaque pointers.

@tmiasko tmiasko added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 30, 2022
@klensy
Copy link
Contributor

klensy commented May 30, 2022

I can't find deref in your example locally, even with rustc tt.rs --emit=asm -Copt-level=1 -Clto -C "llvm-args=-x86-asm-syntax=intel"
rustc 1.63.0-nightly (ebbcbfc23 2022-05-27)

opt-level=1:

_ZN2tt12do_something17h17355b1b96872ed5E:
.Lfunc_begin9:
.seh_proc _ZN2tt12do_something17h17355b1b96872ed5E
	sub	rsp, 40
	.seh_stackalloc 40
	.seh_endprologue
	cmp	qword ptr [rcx + 16], 0
	je	.LBB9_5
	mov	rax, qword ptr [rcx]
	mov	r8, qword ptr [rax + 16]
	cmp	r8, 4
	jne	.LBB9_2
	mov	rcx, qword ptr [rax]
	lea	rdx, [rip + anon.8da7dae456594b2e933d3f3e66d20965.4]
	call	memcmp
	test	eax, eax
	sete	al
	jmp	.LBB9_4
.LBB9_2:
	xor	eax, eax
.LBB9_4:
	add	rsp, 40
	ret
.LBB9_5:
	lea	r8, [rip + anon.8da7dae456594b2e933d3f3e66d20965.3]
	xor	ecx, ecx
	xor	edx, edx
	call	_ZN4core9panicking18panic_bounds_check17h9bd8ac36ac287196E
	ud2
.Lfunc_end9: 

opt-level=2:

_ZN2tt12do_something17h17355b1b96872ed5E:
.Lfunc_begin6:
.seh_proc _ZN2tt12do_something17h17355b1b96872ed5E
	sub	rsp, 40
	.seh_stackalloc 40
	.seh_endprologue
	cmp	qword ptr [rcx + 16], 0
	je	.LBB6_5
	mov	rax, qword ptr [rcx]
	cmp	qword ptr [rax + 16], 4
	jne	.LBB6_2
	mov	rax, qword ptr [rax]
	cmp	dword ptr [rax], 1953719668
	sete	al
	jmp	.LBB6_4
.LBB6_2:
	xor	eax, eax
.LBB6_4:
	add	rsp, 40
	ret
.LBB6_5:
	lea	r8, [rip + anon.8da7dae456594b2e933d3f3e66d20965.3]
	xor	ecx, ecx
	xor	edx, edx
	call	_ZN4core9panicking18panic_bounds_check17h9bd8ac36ac287196E
	ud2
.Lfunc_end6: 

@Nils-TUD
Copy link
Author

I can't find deref in your example locally, even with rustc tt.rs --emit=asm -Copt-level=1 -Clto -C "llvm-args=-x86-asm-syntax=intel"

hm, yes, that also results in no calls to Vec::deref for me.

I used this Cargo.toml:

[package]
name = "vec-deref-inline"
version = "0.1.0"

[profile.release]
lto = true
panic = 'abort'

And ran cargo build --release to obtain the assembly above.

@tmiasko
Copy link
Contributor

tmiasko commented May 30, 2022

I can't find deref in your example locally, even with rustc tt.rs --emit=asm -Copt-level=1 -Clto -C "llvm-args=-x86-asm-syntax=intel" rustc 1.63.0-nightly (ebbcbfc23 2022-05-27)

Try without --emit=asm or with -C codegen-units=16.

--emit=asm overrides the default number of codegen-units to 1, and multiple codegen units with particularly unfavorable partitioning are essential to reproduce the issue.

@thomcc
Copy link
Member

thomcc commented May 30, 2022

I wonder if this is related at all to the issue mentioned by @nikic here #96624 (comment) (which motivated that PR). This is different in that it happens on LTO, though, although I think the builds I was doing where that came up may have had LTO as well...

@tmiasko
Copy link
Contributor

tmiasko commented May 30, 2022

The minimal LLVM reproducer (opt is able to inline the first call in h, but not the second one, for reasons described in #97552 (comment)):

; ModuleID = 'llvm-link'
source_filename = "llvm-link"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@g = unnamed_addr alias { [0 x i64]*, i64 } (), bitcast ({ [0 x i8]*, i64 } ()* @f to { [0 x i64]*, i64 } ()*)

define { [0 x i8]*, i64 } @f() unnamed_addr {
start:
  ret { [0 x i8]*, i64 } undef
}

define void @h() unnamed_addr {
start:
  %0 = tail call { [0 x i8]*, i64 } @f()
  %1 = tail call { [0 x i64]*, i64 } @g()
  ret void
}

Nils-TUD added a commit to Barkhausen-Institut/M3 that referenced this issue May 31, 2022
Note that we only update to nightly-2022-04-03 due to a problem in newer
versions that leads to not inlining Vec::deref and thereby ruining
performance. See rust-lang/rust#97552 for
details.
@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high +T-compiler

@rustbot rustbot added P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 31, 2022
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 1, 2022
Add `#[inline]` to `Vec`'s `Deref/DerefMut`

This should help rust-lang#97552 (although I haven't verified).
@pnkfelix pnkfelix changed the title Vec::deref no longer inlined if LTO is enabled LLVM IR for generic functions sometimes fails to inline post MergeFuncs with LTO Oct 28, 2022
@pnkfelix
Copy link
Member

visiting for T-compiler 2022 Q3 P-high review

Retitled issue since the problem with Vec itself has been "addressed" by adding #[inline] attributes to the specific problematic impls. So the issue now tracks the more general problem that optimizations are not always working out as well as expected when mixing LTO and MergeFuncs and inlining.

Also downgrading priority since this is no longer a critical issue for the noted hot-path in Vec itself.

(also: We may want to form a project group or something about shifting to using opaque pointers in our LLVM backend.)

@rustbot label: -P-high +P-medium

@rustbot rustbot added P-medium Medium priority and removed P-high High priority labels Oct 28, 2022
@pnkfelix
Copy link
Member

(also @nikic notes that we already use opaque pointers, so this issue may actually be resolved? We should test that hypothesis.)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 1, 2022

I just went and confirmed the hypothesis; namely, I checked 1. that I could reproduce the original problem atop nightly-2022-04-04, and 2. that even after I locally removed the #[inline] annotations that had been added in PR #97553 as a workaround, the problem still does not arise anew (presumably because the more recently introduced opaque pointers in LLVM are allowing the code to be successfully inlined).

@pnkfelix pnkfelix closed this as completed Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants