Closed
Description
The benchmarks in crate matrixmultiply version 0.1.8 degrade with MIR enabled. (commit bluss/matrixmultiply@3d83647)
Tested using rustc 1.12.0-nightly (1deb02ea6 2016-08-12)
.
Typical output:
// -C target-cpu=native
test mat_mul_f32::m127 ... bench: 2,703,773 ns/iter (+/- 636,432)
// -Z orbit=off -C target-cpu=native
test mat_mul_f32::m127 ... bench: 648,817 ns/iter (+/- 22,379)
Sure, the matrix multiplication kernel uses some major muckery that it expects the compiler to optimize down and autovectorize, but since it technically is a regression, it gets a report.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.htmlIssue: Problems and improvements with respect to performance of generated code.High priorityRelevant to the compiler team, which will review and decide on the PR/issue.Performance or correctness regression from stable to beta.
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
pmarcelll commentedon Aug 14, 2016
This might be caused by the LLVM upgrade.
rustc 1.12.0-nightly (7333c4a 2016-07-31) (before the LLVM upgrade):
rustc 1.12.0-nightly (28ce3e8 2016-08-01) (after the LLVM upgrade, old-trans is still the default):
Measured on an Intel Haswell processor.
bluss commentedon Aug 14, 2016
Thank you, great info.
[-]MIR autovectorization regression[/-][+]LLVM autovectorization regression[/+]eddyb commentedon Aug 14, 2016
Thanks to @dikaiosune I produced this minimization (for a different case): https://godbolt.org/g/3Nuofl. Seems to reproduce with clang 3.8 vs running LLVM 3.9's
opt -O3
.eddyb commentedon Aug 14, 2016
@majnemer on IRC pointed out https://reviews.llvm.org/rL268972.
[-]LLVM autovectorization regression[/-][+]LLVM/MIR autovectorization regression[/+]eddyb commentedon Aug 14, 2016
@pmarcelll I misread some of those reports, so it's a combination of LLVM 3.9 and MIR trans being used?
On a Haswell,
-C target-cpu=native
should implysse4.2
AFAIK, so I'm not sure it's the linked issue for the original problem being observed here, only @dikaiosune's.bluss commentedon Aug 15, 2016
@eddyb Haswell has avx + avx2 too doesn't it, so it should imply those too
pmarcelll commentedon Aug 15, 2016
If my benchmarking is correct and i386 means no SIMD at all, then it's not just an autovectorization regression.
rustc 1.12.0-nightly (7333c4a 2016-07-31):
rustc 1.12.0-nightly (28ce3e8 2016-08-01):
The last one is especially interesting because the
i386
version is faster than thehaswell
version (512,995 ns/iter vs. 570,056 ns/iter).EDIT: same on the latest nightly.
30 remaining items
arielb1 commentedon Aug 21, 2016
Problem code:
eddyb commentedon Aug 21, 2016
@arielb1 The root problem can be seen in the IR generated by #35662 (comment) - which is left with a constant-length memset that isn't removed due to pass ordering problems. Solving that should help the more complex matrix multiplication code.
brson commentedon Aug 25, 2016
Is this fixed after #35740?
Edit: Seems no.
eddyb commentedon Aug 26, 2016
I've experimented with this change to LLVM:
It seems to result in the constant-length
memset
being removed in the simpler cases.However, I ended up re-doing the reduction and ended up with something similar.
That testcase does 2ns with old trans (beta) and 9ns with MIR trans + the modified LLVM.
The only real difference is the nesting of the
ab
array, which optimizes really poorly.eddyb commentedon Aug 29, 2016
I found the remaining problem: initializing the arrays right now uses
<
(UGT
) while our iterators and C++ use!=
(NE
) for the stop condition of the pointer.Fixing that and running GVN twice fixes the performance of @bluss' benchmark.
nikomatsakis commentedon Aug 31, 2016
This is so beautifully fragile.
eddyb commentedon Aug 31, 2016
@nikomatsakis See #36124 (comment) for a quick explanation of why LLVM's reluctance is correct in general (even though it has enough information to optimize nested
<
loops working on nested local arrays)bluss commentedon Oct 19, 2016
More or less reopened this issue as #37276. It's not affecting matrixmultiply because I think the uninitialized + assignments workaround is sound (until they take uninitialized away from us).
This issue is left closed since it did end up finding & fixing a problem.