[WIP] impl FromIterator for Option/Result via scan #59605

pnkfelix · 2019-04-01T14:52:43Z

This PR consists of three main things:

It swaps in a simpler (at least in terms of lines-of-code) implementation of FromIterator for Option and Result that uses the scan method to do the bulk of the work rather than the specialized adapter struct that the old implementation used.
It adds a micro-benchmark of FromIterator for Result in order to measure the performance of this operation, in order to ensure that this change (or other future changes) do not cause this operation to slow down significantly.
It revises the implementations of Vec::extend and Iterator::scan in order to address performance issues uncovered by above micro-benchmark.

Some (lightly edited) notes from the original PR post follow, but I have removed three of the four original benchmarks (you can find more about them in #11084).

The PR was initially marked WIP, because in my experiments on my Linux desktop machine, even when I compile with optimize=true, debug=false, codegen-units=1 and incremental=false, I still see performance regression on this particular micro-benchmark.

I have a bunch of notes and data about this in the comment thread here; it has an LTO off/on comparison. Compare the lines that say "using_baseline" (or "using_adapter", which should be roughly equivalent) with the lines that say "with_scan", to get an idea of the effect of this PR in various contexts.
The only way I have seen to reliably bring the performance back in line with expectations is to enable LTO in some form.
In particular, if you have codegen-units=1, you need to explicitly enable -C lto=thin in order to get competitive performance out of the "with_scan" implementation. The default with codegen-units=1 is suboptimal; -C lto=thin gives you "whole crate graph" LTO.
And if you have codegen-units > 1, then the default (which corresponds to something called "local Thin LTO") will yield the "best" performance.
- I put "best" in quotes, because the performance for codegen-units > 1 here is far worse than codegen-units = 1.

Anyway, the micro-benchmarks added here include an explicit encoding of the adapter-based implementation of FromIterator for Result, so that one can see how the new implementation compares out of the box (that is, without enabling ThinLTO for codegen-units=1 on the bootstrapped benchmark build).

rust-highfive · 2019-04-01T14:52:54Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

pnkfelix · 2019-04-01T15:01:09Z

As I noted in the description, I included micro-benchmarks for the old implementation.

This makes it really easy to see the performance regression that this re-implementation currently introduces: just run the benchmark (x.py bench src/libcore) and look at the lines that look like this (these are taken from my Linux Desktop box, a dual-processor (= 8-core) Intel i7-4790 @ 3.60Ghz) :

test iter::bench_result_from_iter_into_last                     ... bench:       3,640 ns/iter (+/- 121)
test iter::bench_result_from_iter_into_last_old                 ... bench:         665 ns/iter (+/- 0)
test iter::bench_result_from_iter_into_vec                      ... bench:       3,483 ns/iter (+/- 12)
test iter::bench_result_from_iter_into_vec_old                  ... bench:       1,929 ns/iter (+/- 33)

As you can see, when collecting into a Vec, we see a slowdown of 1.8x

When you collect into Last (an "anti-collection" type that side-steps allocation costs for benchmarking purposes), you see the slowdown is 5.5x.

That's why I've marked this PR a WIP: I don't want to blindly commit this change in the name of "code simplification" without seeing evidence that the microbenchmarks proposed in this PR are irrelevant.

Nonetheless, I still posted the PR itself (rather than abandoning these bits of code entirely). I did this for three main reasons:

If we do reject this change to the impl FromIterator for these two types, then these benchmarks should probably be added to the benchmark suite.
Also, if we do reject this change to the impl FromIterator for these two types, then we should also remove the // FIXME in each of them that suggests switching to a scan-based implementation after rust doesn't optimize closure in scan iterator #11084 is resolved.

rust/src/libcore/option.rs

Lines 1341 to 1342 in 6315221

    
           // FIXME(#11084): This could be replaced with Iterator::scan when this 
        
           // performance bug is closed.

rust/src/libcore/result.rs

Lines 1235 to 1236 in 6315221

    
           // FIXME(#11084): This could be replaced with Iterator::scan when this 
        
           // performance bug is closed.

There's a decent chance that these micro-benchmarks actually are irrelevant, and that this change to the impl FromIterator for these two types should still land.

alexcrichton · 2019-04-01T17:29:21Z

Some interesting numbers! @pnkfelix have you run a profiler to see if there's any particular hot spots in the new implementation that weren't in the old one? If it requires LTO to be turned on to be fast that probably means that something performance critical isn't getting inlined across crates and requires #[inline] maybe?

pnkfelix · 2019-04-02T09:24:24Z

@shepmaster ran a profiler early on in the investigation and found some "interesting" codegen in the hotspot: #11084 (comment)

I myself haven't run a profiler, not yet. I'll give it a quick whirl.

pnkfelix · 2019-04-02T10:04:39Z

Running the profiler on the benchmark iter::bench_result_from_iter_into_last (the one that shows the most egregious regression) shows a code sequence similar to that identified in @shepmaster's hostspot quoted above:

       │ 80:┌─→cmpq   $0x1,-0x8(%rbx)
 30.06 │    │↓ jne    b0
       │    │  lea    0x30(%rsp),%rdi
       │    │  mov    %rbx,%rsi
       │    │→ callq  *0x68a7b(%rip)        # 74ff0 <<alloc::string::String as core::clone::Clone>::clone>
       │    │  mov    0x30(%rsp),%rax
       │    │  lea    0x38(%rsp),%rcx
       │    │  movups (%rcx),%xmm0
       │    │  movaps %xmm0,(%rsp)
       │    │  mov    $0x1,%ecx
       │    │↓ jmp    b5
       │    │  nop
  0.24 │ b0:│  mov    (%rbx),%rax
       │    │  xor    %ecx,%ecx
       │ b5:│  mov    %rcx,0x50(%rsp)
 25.21 │    │  mov    %rax,0x58(%rsp)
 26.02 │    │  movaps (%rsp),%xmm0
  6.58 │    │  movups %xmm0,0x0(%r13)
  4.89 │    │  movups 0x0(%r13),%xmm0
  6.69 │    │  movaps %xmm0,(%rsp)
  0.09 │    │  test   %rcx,%rcx
       │    │↓ jne    100
       │    │  add    $0x20,%rbx
       │    │  mov    $0x1,%r12d
       │    │  mov    %rax,%rbp
       │    ├──add    $0xffffffffffffffe0,%r15
       │    └──jne    80

shepmaster · 2019-04-02T13:29:52Z

@shepmaster ran a profiler

I think you mean @dotdash, but I'm happy you thought of me ❤️

alexcrichton · 2019-04-02T14:32:16Z

Ok thanks! That looks like a pretty reasonable trace, without much to illuminate. I wonder thought if you could gist a version that's a profile of what's there today? That code looks relatively optimal (no extraneous function calls at least) but it may be the case that the old version vectorized better or something like that

pnkfelix · 2019-04-02T14:51:36Z

I wonder thought if you could gist a version that's a profile of what's there today?

What are you asking me to gist here; the analogous perf annotate output for the iter::bench_result_from_iter_into_last_old? Or something else?

bors · 2019-04-02T16:28:46Z

☔ The latest upstream changes (presumably #59632) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2019-04-02T19:01:25Z

Oh sure yeah, if *_old matches the current implementation in master that'd do it!

I'm basically just curious at the assembly level what the differences are to help understand why the new version is slower than the old

pnkfelix · 2019-04-03T12:26:34Z

Okay here are some more complete transcriptions of the perf annotate output for the three cases of interest.

https://gist.github.com/pnkfelix/1b54b3272201d9f096a2289fd5712b52

Since I've taken the effort to transcribe the full machine code provided by perf annotate, I'll attempt to at least do a cursory comparison of these outputs. (And maybe also peek at the original MIR and/or LLVM IR we generated that led to these machine code sequences, though of course one must remember the machine code is post LTO...)

alexcrichton · 2019-04-03T16:12:04Z

Ok thanks! Unfortunately nothing obviously jumps out at me, so it seems like it's just inherently more branchy in the version in this PR for whatever reason, but without digging into the LLVM IR and such I wouldn't know why

scottmcm · 2019-04-04T03:06:02Z

A possible thought: The general FromIterator for vec uses a while-let loop

rust/src/liballoc/vec.rs

Line 1929 in 9ebf478

while let Some(element) = iterator.next() {

You might try flipping that to a .for_each to hit the specialized implementation in Scan:

rust/src/libcore/iter/adapters/mod.rs

Lines 1668 to 1673 in 9ebf478

    
           self.iter.try_fold(init, move |acc, x| { 
        
               match f(state, x) { 
        
                   None => LoopState::Break(Try::from_ok(acc)), 
        
                   Some(x) => LoopState::from_try(fold(acc, x)), 
        
               } 
        
           }).into_try()

And if that doesn't work, scan is only overriding try_fold; it's possible that a custom fold would simplify easier in LLVM and get the old codegen back.

pnkfelix · 2019-04-04T10:44:07Z

You might try flipping that to a .for_each to hit the specialized implementation in Scan:

[...]

And if that doesn't work, scan is only overriding try_fold; it's possible that a custom fold would simplify easier in LLVM and get the old codegen back.

I went ahead and made changes based on this advice, and it does help iter::bench_result_from_iter_into_vec significantly, bringing the new implementation's performance inline with our expectations.

test iter::bench_result_from_iter_into_last_new                 ... bench:       3,339 ns/iter (+/- 98)
test iter::bench_result_from_iter_into_last_old                 ... bench:         661 ns/iter (+/- 2)
test iter::bench_result_from_iter_into_vec_new                  ... bench:       2,063 ns/iter (+/- 2)
test iter::bench_result_from_iter_into_vec_old                  ... bench:       1,934 ns/iter (+/- 8)

That's enough to convince me that we might be able land this change to the impl FromIterator for Result (and for Option). I'm willing to throw away the _into_last micro-benchmark as not measuring anything interesting.

Of course, it also requires that someone review my revisions to Vec::extend_desugared and a new specialized <Scan as Iterator>::fold. I'll put them up shortly. (I want to double check whether both revisions are actually necessary to get the desired performance, or if the Vec::extend_desugared change is sufficient on its own.)

(This is an attempt to ensure we do not regress performance here, since the performance of this operation varied pretty wildly of the course of rust-lang#11084.)

…end_desugared`. This makes use of specialized Iterator methods (when available).

…old`. (It is easier to subsequently optimize this body, rather than starting from `Scan::try_fold`.)

pnkfelix · 2019-04-05T10:16:28Z

(hmm, after a rebase, I am now seeing stack overflows with this PR applied. Marking WIP again.)

rust-highfive · 2019-04-05T11:11:31Z

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

travis_time:end:1a470f26:start=1554457028609868861,finish=1554457136322246540,duration=107712377679
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
Setting environment variables from .travis.yml
---
travis_time:start:test_assembly
Check compiletest suite=assembly mode=assembly (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:19:29] 
[01:19:29] running 9 tests
[01:19:29] iiiiiiiii
[01:19:29] 
[01:19:29]  finished in 0.164
[01:19:29] travis_fold:end:test_assembly

---
travis_time:start:test_debuginfo
Check compiletest suite=debuginfo mode=debuginfo-both (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:19:47] 
[01:19:47] running 121 tests
[01:20:17] .iiiii...i.....i..i...i..i.i.i..i.ii...i.....i..i....i..........iiii..........i...ii...i.......ii.i. 100/121
[01:20:23] i.i......iii.i.....ii
[01:20:23] 
[01:20:23]  finished in 35.531
[01:20:23] travis_fold:end:test_debuginfo

---
[01:32:15] ...............................................................................i.i.................. 400/931
[01:32:15] .................................................................................................... 500/931
[01:32:15] .................................................................................................... 600/931
[01:32:15] .................................................................................................... 700/931
[01:32:15] ......................F........................................F.................................... 800/931
[01:32:17] ...............................
[01:32:17] failures:
[01:32:17] 
[01:32:17] ---- option::test_collect stdout ----
---
[01:32:17] 
[01:32:17] error: test failed, to rerun pass '--test coretests'
[01:32:17] 
[01:32:17] 
[01:32:17] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "x86_64-unknown-linux-gnu" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "core" "--" "--quiet"
[01:32:17] 
[01:32:17] 
[01:32:17] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:32:17] Build completed unsuccessfully in 0:25:57
[01:32:17] Build completed unsuccessfully in 0:25:57
[01:32:17] Makefile:48: recipe for target 'check' failed
[01:32:17] make: *** [check] Error 1
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0b08d661
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri Apr  5 11:11:25 UTC 2019
---
travis_time:end:11e8dcf2:start=1554462687366630339,finish=1554462687372440308,duration=5809969
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:076c7964
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:11870736
travis_time:start:11870736
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:1968b032
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

Dylan-DPC-zz · 2019-04-29T19:33:41Z

ping from triage @pnkfelix any updates?

pnkfelix · 2019-05-17T10:31:25Z

I have higher priority items to attack in the near term, and there isn't much clear value provided here anyway. Closing PR.

@pnkfelix

…scottmcm Refactoring use common code between option, result and accum `Option` and `Result` have almost exactly the same code that in `accum.rs` that implement `Sum` and `Product`. This PR just move some code to use the same code for all of them. I believe is better to not implement this `Iterator` feature twice. I'm not very familiar with pub visibility hope I didn't make then public. However, maybe these adapters could be useful and we could think to make then pub. rust-lang#59605 rust-lang#11084 r? @pnkfelix

rust-highfive assigned alexcrichton Apr 1, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 1, 2019

pnkfelix mentioned this pull request Apr 1, 2019

rust doesn't optimize closure in scan iterator #11084

Closed

jonas-schievink added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 1, 2019

pnkfelix added 4 commits April 5, 2019 11:33

Benchmark for impl FromIterator for Result.

941d92c

(This is an attempt to ensure we do not regress performance here, since the performance of this operation varied pretty wildly of the course of rust-lang#11084.)

Rewrite impl FromIterator for Option and Result to use scan.

6bc90aa

Use v.for_each(|elem| ...) instead of for elem in v in Vec::ext…

6054658

…end_desugared`. This makes use of specialized Iterator methods (when available).

Specialize the Scan::fold method in the same manner as `Scan::try_f…

b210413

…old`. (It is easier to subsequently optimize this body, rather than starting from `Scan::try_fold`.)

pnkfelix force-pushed the from-iter-via-scan branch from c3fdf53 to b210413 Compare April 5, 2019 09:35

pnkfelix changed the title ~~[WIP] impl FromIterator for Option/Result via scan~~ impl FromIterator for Option/Result via scan Apr 5, 2019

pnkfelix changed the title ~~impl FromIterator for Option/Result via scan~~ [WIP] impl FromIterator for Option/Result via scan Apr 5, 2019

pnkfelix added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 5, 2019

pnkfelix closed this May 17, 2019

Stargateur mentioned this pull request Jul 22, 2019

Refactoring use common code between option, result and accum #62883

Merged

[WIP] impl FromIterator for Option/Result via scan #59605

[WIP] impl FromIterator for Option/Result via scan #59605

Uh oh!

Conversation

pnkfelix commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Apr 1, 2019

Uh oh!

pnkfelix commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Apr 1, 2019

Uh oh!

pnkfelix commented Apr 2, 2019

Uh oh!

pnkfelix commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shepmaster commented Apr 2, 2019

Uh oh!

alexcrichton commented Apr 2, 2019

Uh oh!

pnkfelix commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors commented Apr 2, 2019

Uh oh!

alexcrichton commented Apr 2, 2019

Uh oh!

pnkfelix commented Apr 3, 2019

Uh oh!

alexcrichton commented Apr 3, 2019

Uh oh!

scottmcm commented Apr 4, 2019

Uh oh!

pnkfelix commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pnkfelix commented Apr 5, 2019

Uh oh!

rust-highfive commented Apr 5, 2019

Uh oh!

Dylan-DPC-zz commented Apr 29, 2019

Uh oh!

pnkfelix commented May 17, 2019

Uh oh!

Uh oh!

pnkfelix commented Apr 1, 2019 •

edited

Loading

pnkfelix commented Apr 1, 2019 •

edited

Loading

pnkfelix commented Apr 2, 2019 •

edited

Loading

pnkfelix commented Apr 2, 2019 •

edited

Loading

pnkfelix commented Apr 4, 2019 •

edited

Loading