Skip to content

CTFE benchmarks need streamlining #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nnethercote opened this issue Aug 30, 2018 · 4 comments · Fixed by #282
Closed

CTFE benchmarks need streamlining #280

nnethercote opened this issue Aug 30, 2018 · 4 comments · Fixed by #282

Comments

@nnethercote
Copy link
Contributor

It's very useful to have CTFE benchmarked within rustc-perf. But there are some problems with the current benchmarks.

  • There are too many: 7 out of 36 benchmarks.
  • They have high variation: up to plus or minus 8% or so. I suspect this is because they have high hash table use, and hash table iteration is non-deterministic (though I could be wrong). In combination with the previous point, the compare page on perf.rust-lang.org now has a lot of entries that usually need to be ignored.
  • They are too long running. For check builds, they take between 50--100 billion instructions. The only longer-running benchmarks are style-servo and script-servo. This slows down benchmarking and profiling, esp. with slow profilers such as Callgrind. Many of the other benchmarks take fewer than 10 billion instructions. Furthermore, the CTFE ones are so repetitive that making them smaller would not lose information.
  • The names are too long. As a result, on perf.rust-lang.org some of them don't fit three graphs across the screen like they're supposed to.
  • There is not much difference between them. First, their source mostly consist of the expensive_static and const_repeat macros. Second, even though they are nominally stressing different aspects of CTFE, the profiles look pretty similar.

To expand on that last point, here are instruction counts for the the hottest four source files for each one:

cgout-Orig-ctfe-stress-cast-Check-Clean
              63,415,123,847 TOTAL
13.5% 13.5%    8,590,340,259 librustc/ty/query/plumbing.rs
12.4% 26.0%    7,887,149,209 librustc_mir/interpret/eval_context.rs
 8.5% 34.5%    5,408,861,948 libcore/cell.rs         
 7.9% 42.4%    4,998,540,292 librustc/ty/layout.rs   

cgout-Orig-ctfe-stress-const-fn-Check-Clean
              49,800,782,339 TOTAL
13.7% 13.7%    6,803,527,890 librustc/ty/query/plumbing.rs
12.5% 26.1%    6,201,217,834 librustc_mir/interpret/eval_context.rs
 8.6% 34.7%    4,293,120,813 libcore/cell.rs         
 5.6% 40.3%    2,775,820,287 libcore/ptr.rs          

cgout-Orig-ctfe-stress-force-alloc-Check-Clean
              56,259,650,108 TOTAL
13.2% 13.2%    7,444,960,046 librustc_mir/interpret/memory.rs
 8.0% 21.3%    4,526,115,767 librustc/ty/query/plumbing.rs
 7.5% 28.8%    4,210,254,628 librustc_mir/interpret/place.rs
 6.0% 34.7%    3,357,339,539 librustc_mir/interpret/eval_context.rs

cgout-Orig-ctfe-stress-index-check-Check-Clean
              54,348,254,144 TOTAL
13.9% 13.9%    7,535,276,480 librustc_mir/interpret/eval_context.rs
12.7% 26.5%    6,887,503,868 librustc/ty/query/plumbing.rs
 8.1% 34.6%    4,393,910,195 libcore/cell.rs         
 6.3% 40.9%    3,427,755,811 librustc/ty/layout.rs   

cgout-Orig-ctfe-stress-ops-Check-Clean
             100,246,255,577 TOTAL
13.4% 13.4%   13,480,923,396 librustc_mir/interpret/eval_context.rs
12.8% 26.3%   12,843,447,695 librustc/ty/query/plumbing.rs
 8.1% 34.3%    8,100,046,914 libcore/cell.rs         
 7.0% 41.3%    7,008,238,377 librustc/ty/layout.rs   

cgout-Orig-ctfe-stress-reloc-Check-Clean
              95,647,321,682 TOTAL
18.7% 18.7%   17,888,413,265 librustc_mir/interpret/eval_context.rs
14.5% 33.2%   13,885,565,400 librustc/ty/query/plumbing.rs
 9.2% 42.4%    8,810,350,240 libcore/cell.rs         
 5.9% 48.4%    5,686,371,066 librustc/ty/layout.rs   

cgout-Orig-ctfe-stress-unsize-slice-Check-Clean
              60,974,886,778 TOTAL
16.9% 16.9%   10,318,564,804 librustc_mir/interpret/eval_context.rs
13.7% 30.6%    8,349,291,251 librustc/ty/query/plumbing.rs
 8.7% 39.3%    5,305,447,236 libcore/cell.rs         
 5.9% 45.2%    3,608,659,105 librustc/ty/layout.rs  

There is not a lot of variation.

I suggest combining all 7 into a single benchmark, called ctfe-stress. It would have 7 invocations of the expensive_static macro. Also, that macro would be changed so the number of sub-expressions is 5 or 10x smaller.

This would fix all the above problems except for the high variation. The only downside I can see is that the single benchmark would be measuring multiple things, rather than a single thing, which muddies the waters when doing local profiling. But there is a pretty simple workaround for that: if you are doing local profiling, just comment out whichever macro invocations you aren't interested in.

Thoughts?

CC @Mark-Simulacrum @RalfJung @oli-obk

@RalfJung
Copy link
Member

RalfJung commented Aug 31, 2018

I suggest combining all 7 into a single benchmark, called ctfe-stress. It would have 7 invocations of the expensive_static macro.

The funny thing is that this is exactly what my first PR did, but I was told to split them up. ;) But it is fine for me. However, I'd like to include two more kinds of operations (merging this branch). This was blocked on those benchmarks not terminating in reasonable time due to a regression, but somehow that regression got fixed and I don't even know when...

Myself, I have no experience writing such benchmarks. So I am happy for any advice I can get.

What you are seeing in terms of where the cost is matches my experience debugging performance regressions in some of them: A huge part of the costs is hits into the query cache. I have two questions related to that:

  • Why is the variance so high? Isn't the FxHashMap supposed to be using a stable hash?
  • Is there a way to get cache hit/miss stats out of the query mechanism, so that I can see for every query when the number of cache hits/misses changes? That seems like useful information to track on perf.rlo in general, but even more this is something I'd like to have for my local performance debugging.

More generally it is also rather frustrating that the bottleneck in CTFE is "doing stuff with types" (those queries are monomorphization and layout computation), not the actual CTFE operations. I wonder if there is something we can do about that, but that is a separate topic.

@RalfJung
Copy link
Member

I opened #282 to hopefully fix this. However, I am still interested in figuring out why these have such high variance -- is caching just so non-deterministic, or do we accidentally have a "real" source of non-determinism in the compler?

One thing coming to my mind is -- maybe we are hashing pointers, and thanks to ASLR that can change even behavior of the FxHashMap.

@nnethercote
Copy link
Contributor Author

Great! Thank you for doing this.

On the non-determinism front, hash table iteration is another possibility. style-servo used to have high variance but that got fixed at some point -- @Mark-Simulacrum do you remember how?

@Mark-Simulacrum
Copy link
Member

Style servo had a non-deterministic build script which generated code. If I recall correctly, that was due to hashmap/set use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants