Skip to content

Optimize llvm-nm #13465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 13, 2021
Merged

Optimize llvm-nm #13465

merged 9 commits into from
Feb 13, 2021

Conversation

juj
Copy link
Collaborator

@juj juj commented Feb 10, 2021

Optimize llvm-nm invocations from Emscripten. In python creating a multiprocessing pool is extremely slow, and can take 500ms-1 second. If there are only few files to llvm-nm, do those sequentially since a single llvm-nm only takes about 10-20ms.

@juj juj force-pushed the optimize_llvm_nm branch 2 times, most recently from b3c1989 to de65035 Compare February 10, 2021 17:17
@kripken
Copy link
Member

kripken commented Feb 10, 2021

We have a singleton multiprocessing pool, so if it's created anyhow I'm not sure this matters? Other uses are the JS optimizer (in wasm2js, but not wasm) and system library building, but there might be more.

If this is helpful to most builds, then this makes sense. But then can we do it as a wrapper around the pool, so it helps the other cases as well, and not just llvm-nm specifically? I assume the same reasoning applies there too.

@juj
Copy link
Collaborator Author

juj commented Feb 10, 2021

We have a singleton multiprocessing pool, so if it's created anyhow I'm not sure this matters?

That's right. Originally we had several multiprocessing pools. Then at some point I switched them to be a common singleton one. Then some time later I switched the singleton to be a create-on-demand pool. So we have a singleton pool that is kicked off when first needed.

Currently the pool is used exactly for these three purposes: for performing these parallel llvm-nms when linking code together, for the JS optimizer, and for building system libs. There are no other cases. JS optimizer is used only on wasm2js.

Note I don't change the pool behavior on those other use cases.

This PR change is a big improvement on debug build iteration times, which won't then need to spin up the pool at all in simple builds, as it will do the few llvm-nms in a single thread, which runs about 10x faster.

@juj
Copy link
Collaborator Author

juj commented Feb 11, 2021

Actually I now realized that we don't need the multiprocessing pool at all for this task. It is bad, slow, buggy, clunky, bloated, and all the other FUD adjectives that one can associate with a piece of software code they don't like. A major source of headache. :(

But now when looking at the problem this morning, I realize that of course the good old subprocess.Popen() is already an asynchronous parallel API - let's just use that. Rewrote the PR to do parallel Popen() calls instead. Much simpler and shorter.

Here are benchmarks in tiny and large contexts:

A tiny sized project

https://github.com/juj/wasm_webgpu , consists of a single .cpp file that builds to a .a library, and then three individual main apps that each have a single .cpp file, and each link in the static library. Generates two llvm-nm calls per app, for a total of 6x llvm-nm.

Before this PR, a debug iteration build took 3.3 seconds, of that time llvm-nm code eats up 2.3 seconds:

http://clb.confined.space/dump/toolchain_profiler.results_20210211_1209.html

tiny_baseline

After this PR, a debug iteration build takes 650 msecs, of that time llvm-nm code is 14 msecs: http://clb.confined.space/dump/toolchain_profiler.results_20210211_1214.html

tiny_optimized

A large sized project

https://github.com/gabrielcuvillier/d3wasm/ , Gabriel Cuvillier's port of Doom 3 over to Wasm. Final link consists of 225 calls to llvm-nm. CC @gabrielcuvillier

Before this PR, a release link step took 31.7 seconds, of which llvm-nm calls took up 3.8 seconds:

http://clb.confined.space/dump/d3wasm_baseline.html

d3wasm_baseline

After this PR, a release link step takes 28.9 seconds, of which llvm-nm calls take 900 msecs:

http://clb.confined.space/dump/d3wasm_llvm_nm_optimized.html

d3wasm_optimized

This should be a win across the board.

I dream of a world where python multiprocessing is not used at all.

@juj
Copy link
Collaborator Author

juj commented Feb 11, 2021

CI failures do not look related to this PR.

@juj
Copy link
Collaborator Author

juj commented Feb 11, 2021

Actually, it looks like llvm-nm supports taking in a list of files. Doing that, d3wasm build looks simply like

http://clb.confined.space/dump/d3wasm_optimized_2.html

d3wasm_optimized_2

taking up 570ms for the llvm-nm processing, a further -37% time reduction from the above, and a -85% time reduction to baseline.

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I am slightly surprised that running multiple invocations is not faster, as llvm-nm when given multiple files runs them sequentially, presumably. Is the issue that the process overhead is just bigger than the work llvm-nm does?

Assuming that is correct, then lgtm with also verifying the last iteration also speeds up a tiny program (I'm sure it will, but best to check).

@juj
Copy link
Collaborator Author

juj commented Feb 13, 2021

I am slightly surprised that running multiple invocations is not faster, as llvm-nm when given multiple files runs them sequentially, presumably. Is the issue that the process overhead is just bigger than the work llvm-nm does?

I believe so - python spawning the process, and llvm-nm getting to a state where it can start opening to process the files is probably 90% of the total execution time. Parallelizing amplifies the slow parts of the execution, plus python is ridiculously slow at spinning up parallelization.

Maybe if this was a native C/C++ app that spawned subprocesses it could be faster in parallel. But even then I doubt it, because spawning a load of subprocesses that all hit the disk could just be causing a bottleneck for the disk accesses, so a single program running through each disk access in neat sequence can be just as fast.

@juj
Copy link
Collaborator Author

juj commented Feb 13, 2021

Assuming that is correct, then lgtm with also verifying the last iteration also speeds up a tiny program (I'm sure it will, but best to check).

Given that the latest iteration optimized a whopping -37% of the time away, and the tiny program took just a magnitude of 10ms of time, it is trivially true that it won't regress (and even if it did, it would be in the scale of ~10ms).

@juj
Copy link
Collaborator Author

juj commented Feb 13, 2021

Ran through the profile nevertheless (btw would you have a chance to review #13464 at some point? I split it separate from this to avoid review churn, but it is awkward to keep octopus merging back working branches together after having posted a PR)

Result is the same 11ms from before. Compiler sanity checks and compiler.js dominate.

tiny_optimized_2

@juj juj force-pushed the optimize_llvm_nm branch from a948575 to 71aab45 Compare February 13, 2021 13:28
juj added 7 commits February 13, 2021 18:50
…ltiprocessing pool is *extremely* slow, and can take 500ms-1 second. If there are only few files to llvm-nm, do those sequentially since a single llvm-nm only takes about 10-20ms.
@juj juj force-pushed the optimize_llvm_nm branch from 4d0656a to 34ae9a9 Compare February 13, 2021 16:50
@juj juj enabled auto-merge (squash) February 13, 2021 16:50
@juj juj force-pushed the optimize_llvm_nm branch from 99b39a6 to 9b2e687 Compare February 13, 2021 18:50
@sbc100
Copy link
Collaborator

sbc100 commented Mar 15, 2021

Looks like this broke windows users with many input files: #13465

juj added a commit that referenced this pull request Mar 17, 2021
* Add a test for #13661. See #13664 and #13465.

* flake

* Address review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants