Skip to content

Large files containing many tokens of const data compile very slowly and use a lot of memory (in MIR_borrow_checking and expand_crate) #134404

@Manishearth

Description

@Manishearth
Member

ICU4X has a concept of "baked data", a way of "baking" locale data into the source of a program in the form of consts. This has a bunch of performance benefits: loading data from the binary is essentially free and doesn't involve any sort of deserialization.

However, we have been facing issues with cases where a single crate contains a lot of data.

I have a minimal testcase here: https://github.com/Manishearth/icu4x_compile_sample. It removes most of the cruft whilst still having an interesting-enough AST in the const data. cargo build in the demo folder takes 51s, using almost a gigabyte of RAM. Removing the macro does improve things slightly, but not overly slow.

Some interesting snippets of time-passes:

...
time:   1.194; rss:   52MB ->  595MB ( +543MB)	expand_crate
time:   1.194; rss:   52MB ->  595MB ( +543MB)	macro_expand_crate
...
time:   3.720; rss:  682MB ->  837MB ( +155MB)	type_check_crate
...
time:  55.505; rss:  837MB -> 1058MB ( +221MB)	MIR_borrow_checking
...
time:   0.124; rss: 1080MB ->  624MB ( -456MB)	free_global_ctxt
Full time-passes
time:   0.001; rss:   47MB ->   49MB (   +1MB)	parse_crate
time:   0.001; rss:   50MB ->   50MB (   +0MB)	incr_comp_prepare_session_directory
time:   0.000; rss:   50MB ->   51MB (   +1MB)	setup_global_ctxt
time:   0.000; rss:   52MB ->   52MB (   +0MB)	crate_injection
time:   1.194; rss:   52MB ->  595MB ( +543MB)	expand_crate
time:   1.194; rss:   52MB ->  595MB ( +543MB)	macro_expand_crate
time:   0.013; rss:  595MB ->  595MB (   +0MB)	AST_validation
time:   0.008; rss:  595MB ->  597MB (   +1MB)	finalize_macro_resolutions
time:   0.285; rss:  597MB ->  642MB (  +45MB)	late_resolve_crate
time:   0.012; rss:  642MB ->  642MB (   +0MB)	resolve_check_unused
time:   0.020; rss:  642MB ->  642MB (   +0MB)	resolve_postprocess
time:   0.326; rss:  595MB ->  642MB (  +46MB)	resolve_crate
time:   0.011; rss:  610MB ->  610MB (   +0MB)	write_dep_info
time:   0.011; rss:  610MB ->  611MB (   +0MB)	complete_gated_feature_checking
time:   0.058; rss:  765MB ->  729MB (  -35MB)	drop_ast
time:   1.213; rss:  610MB ->  681MB (  +71MB)	looking_for_derive_registrar
time:   1.421; rss:  610MB ->  682MB (  +72MB)	misc_checking_1
time:   0.086; rss:  682MB ->  690MB (   +8MB)	coherence_checking
time:   3.720; rss:  682MB ->  837MB ( +155MB)	type_check_crate
time:   0.000; rss:  837MB ->  837MB (   +0MB)	MIR_coroutine_by_move_body
time:  55.505; rss:  837MB -> 1058MB ( +221MB)	MIR_borrow_checking
time:   1.571; rss: 1058MB -> 1068MB (  +10MB)	MIR_effect_checking
time:   0.217; rss: 1068MB -> 1067MB (   -1MB)	module_lints
time:   0.217; rss: 1068MB -> 1067MB (   -1MB)	lint_checking
time:   0.311; rss: 1067MB -> 1068MB (   +0MB)	privacy_checking_modules
time:   0.607; rss: 1068MB -> 1068MB (   +0MB)	misc_checking_3
time:   0.000; rss: 1136MB -> 1137MB (   +1MB)	monomorphization_collector_graph_walk
time:   0.778; rss: 1068MB -> 1064MB (   -4MB)	generate_crate_metadata
time:   0.005; rss: 1064MB -> 1085MB (  +22MB)	codegen_to_LLVM_IR
time:   0.007; rss: 1076MB -> 1085MB (  +10MB)	LLVM_passes
time:   0.014; rss: 1064MB -> 1085MB (  +22MB)	codegen_crate
time:   0.257; rss: 1084MB -> 1080MB (   -4MB)	encode_query_results
time:   0.270; rss: 1084MB -> 1080MB (   -4MB)	incr_comp_serialize_result_cache
time:   0.270; rss: 1084MB -> 1080MB (   -4MB)	incr_comp_persist_result_cache
time:   0.271; rss: 1084MB -> 1080MB (   -4MB)	serialize_dep_graph
time:   0.124; rss: 1080MB ->  624MB ( -456MB)	free_global_ctxt
time:   0.000; rss:  624MB ->  624MB (   +0MB)	finish_ongoing_codegen
time:   0.127; rss:  624MB ->  653MB (  +29MB)	link_rlib
time:   0.135; rss:  624MB ->  653MB (  +29MB)	link_binary
time:   0.138; rss:  624MB ->  618MB (   -6MB)	link_crate
time:   0.139; rss:  624MB ->  618MB (   -6MB)	link
time:  65.803; rss:   32MB ->  187MB ( +155MB)	total

Even without the intermediate macro, expand_crate still increases RAM significantly, though the increase is halved:

time:   0.715; rss:   52MB ->  254MB ( +201MB)	expand_crate
time:   0.715; rss:   52MB ->  254MB ( +201MB)	macro_expand_crate

I understand that to some extent, we are simply feeding Rust a file that is megabytes in size and we cannot expect it to be too fast. It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check. I suspect there is MIR construction happening here too). The fact that the RAM usage is almost in the gigabytes is also somewhat concerning; the problematic source file is 7MB, but compilation takes a gigabyte of RAM, which is quite significant. Pair this with the fact that we have many such data files per crate (some of which are large) we end up hitting CI limits.

With the actual problem we were facing (unicode-org/icu4x#5230 (comment)), our time-passes numbers were:

...
time:   1.013; rss:   51MB -> 1182MB (+1130MB)	expand_crate
time:   1.013; rss:   51MB -> 1182MB (+1131MB)	macro_expand_crate
...
time:   6.609; rss: 1308MB -> 1437MB ( +128MB)	type_check_crate
time:  36.802; rss: 1437MB -> 2248MB ( +811MB)	MIR_borrow_checking
time:   2.214; rss: 2248MB -> 2270MB (  +22MB)	MIR_effect_checking
...

I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } }, into icu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre"). This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.

Activity

added
I-slowIssue: Problems and improvements with respect to performance of generated code.
on Dec 17, 2024
added
needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.
on Dec 17, 2024
oli-obk

oli-obk commented on Dec 17, 2024

@oli-obk
Contributor

It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check.

Due to the query system this can also be const eval being invoked and generating and interning lots of allocations. So #93215 may be related

I don't remember how to get the diff or single output that's used to generate the table in https://perf.rust-lang.org/detailed-query.html?commit=52f4785f80c1516ebece019ae4b69763ffb9a618&benchmark=ripgrep-13.0.0-opt&scenario=incr-unchanged&base_commit=5afd5ad29c014de69bea61d028a1ce832ed75a75 but that gives per query timings. Just in case you want to dig some more

lqd

lqd commented on Dec 17, 2024

@lqd
Member

You can get the raw data with -Zself-profile and use measureme tools on the .mm_profdata raw data with summarize summarize $file for the query summary, and iirc summarize diff with 2 files to get the diff oli linked.

added
T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.
on Dec 17, 2024
lqd

lqd commented on Dec 17, 2024

@lqd
Member
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                                    | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_const_qualif                                                        | 16.22s    | 45.785          | 16.22s   | 3          | 4.56µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                                               | 9.98s     | 28.160          | 10.09s   | 4          | 96.73ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                                            | 2.49s     | 7.030           | 19.02s   | 4          | 5.70µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                                                  | 2.27s     | 6.420           | 2.36s    | 4          | 23.85ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| expand_crate                                                            | 942.46ms  | 2.660           | 952.89ms | 1          | 0.00ns                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+ 
...
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
Total cpu time: 35.430183553s
...

And without the macro:

+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                                    | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_const_qualif                                                        | 14.88s    | 45.361          | 14.88s   | 3          | 5.76µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                                               | 9.86s     | 30.074          | 9.97s    | 4          | 61.53ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                                            | 2.50s     | 7.612           | 17.67s   | 4          | 7.64µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                                                  | 2.27s     | 6.922           | 2.35s    | 4          | 23.99ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| expand_crate                                                            | 454.17ms  | 1.385           | 477.20ms | 1          | 0.00ns                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
...
Total cpu time: 32.792896135s
...

So most of the time is in mir building and const qualif, not in borrowck per se.

oli-obk

oli-obk commented on Dec 17, 2024

@oli-obk
Contributor

Hmm. There's def opportunity to improve const qualification

cc @RalfJung

Manishearth

Manishearth commented on Dec 17, 2024

@Manishearth
MemberAuthor

Sweet, thanks! If it's something straightforward enough I'm happy to help out too.

Would any of these fixes help with the RAM? The performance is an issue but it's not the blocker, the RAM usage is one that actively limits us at times.

oli-obk

oli-obk commented on Dec 17, 2024

@oli-obk
Contributor

Hmm.. that's harder to debug. The mir const qualif query returns an always-tiny result, so that's not it. We could probably dump size changes of queries without nested queries, similar to the self time. But that's harder

We may be able to trim the borrowck result if that's the issue. Rustc may not need all the output anymore

lqd

lqd commented on Dec 17, 2024

@lqd
Member

Would any of these fixes help with the RAM?

You may be in luck as that RAM usage seems to come from const qualif as well 😅. It's understandable, really, it's trying to do analyses on MIR that has 250K locals and 730K statements.

oli-obk

oli-obk commented on Dec 17, 2024

@oli-obk
Contributor

Ah, transient peak memory? It shouldn't generate much persistent data

lqd

lqd commented on Dec 17, 2024

@lqd
Member

Seems like it, yeah, since we're talking about max-rss.

And I wonder, if you remember the recent change in dataflow removing the acyclic MIR fast path (which seems a likely shape for a constant), if the xfer function is now cloning the state (I hope it still moves it if there's a single successor; but at the same time there may be unwinding paths all around) because there are 200K blocks. I'd need to check.

Also we should try the mixed bitsets (but I haven't checked the distribution of values here, I'd believe they should be very compressible tho).

added
I-compilememIssue: Problems and improvements with respect to memory usage during compilation.
on Dec 17, 2024
RalfJung

RalfJung commented on Dec 17, 2024

@RalfJung
Member

There's def opportunity to improve const qualification

There probably is, but it's hard to say where without knowing which part is the bottleneck. Is there any way to get a profile of where const-qualif is spending its time?

We do have to iterate over that array at least once. But I doubt that alone would be so slow -- how many elements does this array have?

Maybe there's something accidentally quadratic in promotion? Not sure if that is also part of mir_const_qualif.

16 remaining items

nnethercote

nnethercote commented on May 21, 2025

@nnethercote
Contributor

I just tried introducing a const fn like this:

const fn f(b: &[u8]) -> icu::experimental::dimension::provider::units::UnitsDisplayNameV1<'_> {
    icu::experimental::dimension::provider::units::UnitsDisplayNameV1 {
        patterns: icu::experimental::relativetime::provider::PluralPatterns {
            strings: icu::plurals::provider::PluralElementsPackedCow {
                elements: alloc::borrow::Cow::Borrowed(unsafe {
                    icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b)
                })  
            },  
            _phantom: core::marker::PhantomData,
        }   
    }   
}

and changed the array entries to be like this:

                    f(b"\0\x01 acre"),
                    f(b"\0\x01 eka"),
                    f(b"\0\x01 \xE1\x8A\xA4\xE1\x8A\xAD\xE1\x88\xAD"),

and the time for a cargo check dropped from 31 seconds to 9 seconds!

The quadratic drop behaviour is still present, but everything else is majorly improved. Perhaps not surprising because the size of the .data file dropped from 12.0 MB to 2.8 MB.

@Manishearth / @sffc: can you try this change, see how it works for you?

programmerjake

programmerjake commented on May 21, 2025

@programmerjake
Member

to avoid quadratic behavior, maybe check if drops.len() > 20 and switch to a different algorithm? something like a memoized binary tree instead of a list, where you can handle hashing/comparing exponentially large subtrees by just hashing/comparing the top node. (Inspired by Hashlife, which is a really awesome algorithm). That should change the overall algorithm to O(N log N)

nnethercote

nnethercote commented on May 21, 2025

@nnethercote
Contributor

the time for a cargo check dropped from 31 seconds to 9 seconds!

That was with rust 1.84. With a 1.88 nightly it dropped from 18.6s to 8.5s and peak memory dropped from 1724MB to 422MB. (The 31s vs. 18.6s is probably due to #134438 being present in the 1.88 nightly.)

hanna-kruppe

hanna-kruppe commented on May 21, 2025

@hanna-kruppe
Contributor

This is a wild guess, but from issues like #120604 I suspect that it’s not just that some data structures used during these parts of MIR building could be more efficient. Rather it seems that very long but relatively shallow expressions, under some conditions, result in a quadratic amount of “this temporary needs to be dropped on this CFG edge“ facts. In #120604 the edges in question are early exits from ? while in this case they may be from the possibility of PluralElementsPackedULE::from_bytes_unchecked unwinding (I haven’t verified this but I expect the MIR building for constants doesn’t get special cases to omit unwinding edges).

In this case there’s no drop glue to actually call so the only consequence is spending a lot of time working out the quadratic number of places where no code needs to be generated. In other cases like #120604, a quadratic amount of MIR is generated that triggers worse pathologies in later compiler phases processing all that code (which may be why the MIR building part of it never stood out in a profile before).

Of course, it’s a totally unproven assumption on my part that these two issues have the same root cause. They just seem suspiciously similar to me.

programmerjake

programmerjake commented on May 21, 2025

@programmerjake
Member

to avoid quadratic behavior, maybe check if drops.len() > 20 and switch to a different algorithm? something like a memoized binary tree instead of a list, where you can handle hashing/comparing exponentially large subtrees by just hashing/comparing the top node.

e.g.: https://play.rust-lang.org/?version=stable&mode=release&edition=2024&gist=6fc15584a6dd96b0e6e760dc8b21382a

In this case there’s no drop glue to actually call so the only consequence is spending a lot of time working out the quadratic number of places where no code needs to be generated. In other cases like #120604, a quadratic amount of MIR is generated that triggers worse pathologies in later compiler phases processing all that code (which may be why the MIR building part of it never stood out in a profile before).

for drop glue, wouldn't it work to have later cleanup blocks branch back to the cleanup blocks for earlier blocks? that avoids quadratic code generation. e.g.:

    a = call f() return to block1
block1:
    b = call f() unwind to cleanup1 return to block2
cleanup1:
    drop(a);
    resume unwind
block2:
    c = call f() unwind to cleanup2 return to block3
cleanup2:
    drop(b);
    branch to cleanup1
block3:
    d = call f() unwind to cleanup3 return to block4
cleanup3:
    drop(c);
    branch to cleanup2
block4:
    ...
hanna-kruppe

hanna-kruppe commented on May 21, 2025

@hanna-kruppe
Contributor

Yeah I think it’s “just” a missing optimization, since the problem doesn’t affect locals, only temporaries. (Also pointed out in #120604 (comment).) But IIRC everything related to inserting and elaborating drops has historically had a lot of subtle and thorny bugs. So it’s probably relatively hard to make it smarter.

Manishearth

Manishearth commented on May 21, 2025

@Manishearth
MemberAuthor

I just tried introducing a const fn like this:

Yeah this trick is essentially how we avoided this problem in the worst cases! I had documented this at the bottom of the issue:

I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } }, into icu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre"). This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.

nnethercote

nnethercote commented on May 21, 2025

@nnethercote
Contributor

I overlooked that and assumed that the code example you provided was representative.

To summarize:

  • The original complaint was about compile time and peak memory usage.
  • @lqd improved things significantly with Use MixedBitSets in const qualif #134438.
  • I diagnosed the quadratic drop behaviour, which is a big contributor to the compile time, though I don't know how to fix it.
  • I rediscovered the const fn change, which improves compile time a lot and reduces peak memory use even more.

I don't have time to work on this more.

Manishearth

Manishearth commented on May 21, 2025

@Manishearth
MemberAuthor

I overlooked that and assumed that the code example you provided was representative.

It's representative of most of our generated code. We've applied this optimization to cases where it's too much; but that optimization still leads to some pretty large numbers that are worrying. Apologies if this wasn't clear.

lqd's improvements did massively improve the status quo I think.

Thanks for the work you've done so far!

added a commit that references this issue on May 27, 2025
added a commit that references this issue on May 28, 2025
added a commit that references this issue on May 28, 2025
added a commit that references this issue on May 28, 2025
added a commit that references this issue on May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-compilememIssue: Problems and improvements with respect to memory usage during compilation.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @lqd@RalfJung@oli-obk@sffc@Manishearth

        Issue actions

          Large files containing many tokens of `const` data compile very slowly and use a lot of memory (in MIR_borrow_checking and expand_crate) · Issue #134404 · rust-lang/rust