-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
ICU4X has a concept of "baked data", a way of "baking" locale data into the source of a program in the form of consts. This has a bunch of performance benefits: loading data from the binary is essentially free and doesn't involve any sort of deserialization.
However, we have been facing issues with cases where a single crate contains a lot of data.
I have a minimal testcase here: https://github.com/Manishearth/icu4x_compile_sample. It removes most of the cruft whilst still having an interesting-enough AST in the const data. cargo build
in the demo
folder takes 51s, using almost a gigabyte of RAM. Removing the macro does improve things slightly, but not overly slow.
Some interesting snippets of time-passes
:
...
time: 1.194; rss: 52MB -> 595MB ( +543MB) expand_crate
time: 1.194; rss: 52MB -> 595MB ( +543MB) macro_expand_crate
...
time: 3.720; rss: 682MB -> 837MB ( +155MB) type_check_crate
...
time: 55.505; rss: 837MB -> 1058MB ( +221MB) MIR_borrow_checking
...
time: 0.124; rss: 1080MB -> 624MB ( -456MB) free_global_ctxt
Full time-passes
time: 0.001; rss: 47MB -> 49MB ( +1MB) parse_crate
time: 0.001; rss: 50MB -> 50MB ( +0MB) incr_comp_prepare_session_directory
time: 0.000; rss: 50MB -> 51MB ( +1MB) setup_global_ctxt
time: 0.000; rss: 52MB -> 52MB ( +0MB) crate_injection
time: 1.194; rss: 52MB -> 595MB ( +543MB) expand_crate
time: 1.194; rss: 52MB -> 595MB ( +543MB) macro_expand_crate
time: 0.013; rss: 595MB -> 595MB ( +0MB) AST_validation
time: 0.008; rss: 595MB -> 597MB ( +1MB) finalize_macro_resolutions
time: 0.285; rss: 597MB -> 642MB ( +45MB) late_resolve_crate
time: 0.012; rss: 642MB -> 642MB ( +0MB) resolve_check_unused
time: 0.020; rss: 642MB -> 642MB ( +0MB) resolve_postprocess
time: 0.326; rss: 595MB -> 642MB ( +46MB) resolve_crate
time: 0.011; rss: 610MB -> 610MB ( +0MB) write_dep_info
time: 0.011; rss: 610MB -> 611MB ( +0MB) complete_gated_feature_checking
time: 0.058; rss: 765MB -> 729MB ( -35MB) drop_ast
time: 1.213; rss: 610MB -> 681MB ( +71MB) looking_for_derive_registrar
time: 1.421; rss: 610MB -> 682MB ( +72MB) misc_checking_1
time: 0.086; rss: 682MB -> 690MB ( +8MB) coherence_checking
time: 3.720; rss: 682MB -> 837MB ( +155MB) type_check_crate
time: 0.000; rss: 837MB -> 837MB ( +0MB) MIR_coroutine_by_move_body
time: 55.505; rss: 837MB -> 1058MB ( +221MB) MIR_borrow_checking
time: 1.571; rss: 1058MB -> 1068MB ( +10MB) MIR_effect_checking
time: 0.217; rss: 1068MB -> 1067MB ( -1MB) module_lints
time: 0.217; rss: 1068MB -> 1067MB ( -1MB) lint_checking
time: 0.311; rss: 1067MB -> 1068MB ( +0MB) privacy_checking_modules
time: 0.607; rss: 1068MB -> 1068MB ( +0MB) misc_checking_3
time: 0.000; rss: 1136MB -> 1137MB ( +1MB) monomorphization_collector_graph_walk
time: 0.778; rss: 1068MB -> 1064MB ( -4MB) generate_crate_metadata
time: 0.005; rss: 1064MB -> 1085MB ( +22MB) codegen_to_LLVM_IR
time: 0.007; rss: 1076MB -> 1085MB ( +10MB) LLVM_passes
time: 0.014; rss: 1064MB -> 1085MB ( +22MB) codegen_crate
time: 0.257; rss: 1084MB -> 1080MB ( -4MB) encode_query_results
time: 0.270; rss: 1084MB -> 1080MB ( -4MB) incr_comp_serialize_result_cache
time: 0.270; rss: 1084MB -> 1080MB ( -4MB) incr_comp_persist_result_cache
time: 0.271; rss: 1084MB -> 1080MB ( -4MB) serialize_dep_graph
time: 0.124; rss: 1080MB -> 624MB ( -456MB) free_global_ctxt
time: 0.000; rss: 624MB -> 624MB ( +0MB) finish_ongoing_codegen
time: 0.127; rss: 624MB -> 653MB ( +29MB) link_rlib
time: 0.135; rss: 624MB -> 653MB ( +29MB) link_binary
time: 0.138; rss: 624MB -> 618MB ( -6MB) link_crate
time: 0.139; rss: 624MB -> 618MB ( -6MB) link
time: 65.803; rss: 32MB -> 187MB ( +155MB) total
Even without the intermediate macro, expand_crate
still increases RAM significantly, though the increase is halved:
time: 0.715; rss: 52MB -> 254MB ( +201MB) expand_crate
time: 0.715; rss: 52MB -> 254MB ( +201MB) macro_expand_crate
I understand that to some extent, we are simply feeding Rust a file that is megabytes in size and we cannot expect it to be too fast. It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check. I suspect there is MIR construction happening here too). The fact that the RAM usage is almost in the gigabytes is also somewhat concerning; the problematic source file is 7MB, but compilation takes a gigabyte of RAM, which is quite significant. Pair this with the fact that we have many such data files per crate (some of which are large) we end up hitting CI limits.
With the actual problem we were facing (unicode-org/icu4x#5230 (comment)), our time-passes numbers were:
...
time: 1.013; rss: 51MB -> 1182MB (+1130MB) expand_crate
time: 1.013; rss: 51MB -> 1182MB (+1131MB) macro_expand_crate
...
time: 6.609; rss: 1308MB -> 1437MB ( +128MB) type_check_crate
time: 36.802; rss: 1437MB -> 2248MB ( +811MB) MIR_borrow_checking
time: 2.214; rss: 2248MB -> 2270MB ( +22MB) MIR_effect_checking
...
I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } },
into icu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre")
. This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.
Activity
oli-obk commentedon Dec 17, 2024
Due to the query system this can also be const eval being invoked and generating and interning lots of allocations. So #93215 may be related
I don't remember how to get the diff or single output that's used to generate the table in https://perf.rust-lang.org/detailed-query.html?commit=52f4785f80c1516ebece019ae4b69763ffb9a618&benchmark=ripgrep-13.0.0-opt&scenario=incr-unchanged&base_commit=5afd5ad29c014de69bea61d028a1ce832ed75a75 but that gives per query timings. Just in case you want to dig some more
lqd commentedon Dec 17, 2024
You can get the raw data with
-Zself-profile
and use measureme tools on the .mm_profdata raw data withsummarize summarize $file
for the query summary, and iircsummarize diff
with 2 files to get the diff oli linked.lqd commentedon Dec 17, 2024
And without the macro:
So most of the time is in mir building and const qualif, not in borrowck per se.
oli-obk commentedon Dec 17, 2024
Hmm. There's def opportunity to improve const qualification
cc @RalfJung
Manishearth commentedon Dec 17, 2024
Sweet, thanks! If it's something straightforward enough I'm happy to help out too.
Would any of these fixes help with the RAM? The performance is an issue but it's not the blocker, the RAM usage is one that actively limits us at times.
oli-obk commentedon Dec 17, 2024
Hmm.. that's harder to debug. The mir const qualif query returns an always-tiny result, so that's not it. We could probably dump size changes of queries without nested queries, similar to the self time. But that's harder
We may be able to trim the borrowck result if that's the issue. Rustc may not need all the output anymore
lqd commentedon Dec 17, 2024
You may be in luck as that RAM usage seems to come from const qualif as well 😅. It's understandable, really, it's trying to do analyses on MIR that has 250K locals and 730K statements.
oli-obk commentedon Dec 17, 2024
Ah, transient peak memory? It shouldn't generate much persistent data
lqd commentedon Dec 17, 2024
Seems like it, yeah, since we're talking about max-rss.
And I wonder, if you remember the recent change in dataflow removing the acyclic MIR fast path (which seems a likely shape for a constant), if the xfer function is now cloning the state (I hope it still moves it if there's a single successor; but at the same time there may be unwinding paths all around) because there are 200K blocks. I'd need to check.
Also we should try the mixed bitsets (but I haven't checked the distribution of values here, I'd believe they should be very compressible tho).
RalfJung commentedon Dec 17, 2024
There probably is, but it's hard to say where without knowing which part is the bottleneck. Is there any way to get a profile of where const-qualif is spending its time?
We do have to iterate over that array at least once. But I doubt that alone would be so slow -- how many elements does this array have?
Maybe there's something accidentally quadratic in promotion? Not sure if that is also part of mir_const_qualif.
16 remaining items
nnethercote commentedon May 21, 2025
I just tried introducing a const fn like this:
and changed the array entries to be like this:
and the time for a
cargo check
dropped from 31 seconds to 9 seconds!The quadratic drop behaviour is still present, but everything else is majorly improved. Perhaps not surprising because the size of the
.data
file dropped from 12.0 MB to 2.8 MB.@Manishearth / @sffc: can you try this change, see how it works for you?
programmerjake commentedon May 21, 2025
to avoid quadratic behavior, maybe check if
drops.len() > 20
and switch to a different algorithm? something like a memoized binary tree instead of a list, where you can handle hashing/comparing exponentially large subtrees by just hashing/comparing the top node. (Inspired by Hashlife, which is a really awesome algorithm). That should change the overall algorithm to O(N log N)nnethercote commentedon May 21, 2025
That was with rust 1.84. With a 1.88 nightly it dropped from 18.6s to 8.5s and peak memory dropped from 1724MB to 422MB. (The 31s vs. 18.6s is probably due to #134438 being present in the 1.88 nightly.)
hanna-kruppe commentedon May 21, 2025
This is a wild guess, but from issues like #120604 I suspect that it’s not just that some data structures used during these parts of MIR building could be more efficient. Rather it seems that very long but relatively shallow expressions, under some conditions, result in a quadratic amount of “this temporary needs to be dropped on this CFG edge“ facts. In #120604 the edges in question are early exits from
?
while in this case they may be from the possibility ofPluralElementsPackedULE::from_bytes_unchecked
unwinding (I haven’t verified this but I expect the MIR building for constants doesn’t get special cases to omit unwinding edges).In this case there’s no drop glue to actually call so the only consequence is spending a lot of time working out the quadratic number of places where no code needs to be generated. In other cases like #120604, a quadratic amount of MIR is generated that triggers worse pathologies in later compiler phases processing all that code (which may be why the MIR building part of it never stood out in a profile before).
Of course, it’s a totally unproven assumption on my part that these two issues have the same root cause. They just seem suspiciously similar to me.
programmerjake commentedon May 21, 2025
e.g.: https://play.rust-lang.org/?version=stable&mode=release&edition=2024&gist=6fc15584a6dd96b0e6e760dc8b21382a
for drop glue, wouldn't it work to have later cleanup blocks branch back to the cleanup blocks for earlier blocks? that avoids quadratic code generation. e.g.:
hanna-kruppe commentedon May 21, 2025
Yeah I think it’s “just” a missing optimization, since the problem doesn’t affect locals, only temporaries. (Also pointed out in #120604 (comment).) But IIRC everything related to inserting and elaborating drops has historically had a lot of subtle and thorny bugs. So it’s probably relatively hard to make it smarter.
Manishearth commentedon May 21, 2025
Yeah this trick is essentially how we avoided this problem in the worst cases! I had documented this at the bottom of the issue:
nnethercote commentedon May 21, 2025
I overlooked that and assumed that the code example you provided was representative.
To summarize:
MixedBitSet
s in const qualif #134438.const fn
change, which improves compile time a lot and reduces peak memory use even more.I don't have time to work on this more.
Manishearth commentedon May 21, 2025
It's representative of most of our generated code. We've applied this optimization to cases where it's too much; but that optimization still leads to some pretty large numbers that are worrying. Apologies if this wasn't clear.
lqd's improvements did massively improve the status quo I think.
Thanks for the work you've done so far!
Remove `DropNodeKey::kind`.
Rollup merge of rust-lang#141627 - nnethercote:drop-cleanups, r=matth…
Rollup merge of rust-lang#141627 - nnethercote:drop-cleanups, r=matth…
Rollup merge of rust-lang#141627 - nnethercote:drop-cleanups, r=matth…
Unrolled build for #141627