-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Optimize llvm-nm #13465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize llvm-nm #13465
Conversation
b3c1989
to
de65035
Compare
We have a singleton multiprocessing pool, so if it's created anyhow I'm not sure this matters? Other uses are the JS optimizer (in wasm2js, but not wasm) and system library building, but there might be more. If this is helpful to most builds, then this makes sense. But then can we do it as a wrapper around the pool, so it helps the other cases as well, and not just llvm-nm specifically? I assume the same reasoning applies there too. |
That's right. Originally we had several multiprocessing pools. Then at some point I switched them to be a common singleton one. Then some time later I switched the singleton to be a create-on-demand pool. So we have a singleton pool that is kicked off when first needed. Currently the pool is used exactly for these three purposes: for performing these parallel llvm-nms when linking code together, for the JS optimizer, and for building system libs. There are no other cases. JS optimizer is used only on wasm2js. Note I don't change the pool behavior on those other use cases. This PR change is a big improvement on debug build iteration times, which won't then need to spin up the pool at all in simple builds, as it will do the few llvm-nms in a single thread, which runs about 10x faster. |
Actually I now realized that we don't need the multiprocessing pool at all for this task. It is bad, slow, buggy, clunky, bloated, and all the other FUD adjectives that one can associate with a piece of software code they don't like. A major source of headache. :( But now when looking at the problem this morning, I realize that of course the good old Here are benchmarks in tiny and large contexts: A tiny sized projecthttps://github.com/juj/wasm_webgpu , consists of a single .cpp file that builds to a .a library, and then three individual main apps that each have a single .cpp file, and each link in the static library. Generates two llvm-nm calls per app, for a total of 6x llvm-nm. Before this PR, a debug iteration build took 3.3 seconds, of that time llvm-nm code eats up 2.3 seconds: http://clb.confined.space/dump/toolchain_profiler.results_20210211_1209.html After this PR, a debug iteration build takes 650 msecs, of that time llvm-nm code is 14 msecs: http://clb.confined.space/dump/toolchain_profiler.results_20210211_1214.html A large sized projecthttps://github.com/gabrielcuvillier/d3wasm/ , Gabriel Cuvillier's port of Doom 3 over to Wasm. Final link consists of 225 calls to llvm-nm. CC @gabrielcuvillier Before this PR, a release link step took 31.7 seconds, of which llvm-nm calls took up 3.8 seconds: http://clb.confined.space/dump/d3wasm_baseline.html After this PR, a release link step takes 28.9 seconds, of which llvm-nm calls take 900 msecs: http://clb.confined.space/dump/d3wasm_llvm_nm_optimized.html This should be a win across the board. I dream of a world where python multiprocessing is not used at all. |
CI failures do not look related to this PR. |
Actually, it looks like http://clb.confined.space/dump/d3wasm_optimized_2.html taking up 570ms for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I am slightly surprised that running multiple invocations is not faster, as llvm-nm
when given multiple files runs them sequentially, presumably. Is the issue that the process overhead is just bigger than the work llvm-nm does?
Assuming that is correct, then lgtm with also verifying the last iteration also speeds up a tiny program (I'm sure it will, but best to check).
I believe so - python spawning the process, and llvm-nm getting to a state where it can start opening to process the files is probably 90% of the total execution time. Parallelizing amplifies the slow parts of the execution, plus python is ridiculously slow at spinning up parallelization. Maybe if this was a native C/C++ app that spawned subprocesses it could be faster in parallel. But even then I doubt it, because spawning a load of subprocesses that all hit the disk could just be causing a bottleneck for the disk accesses, so a single program running through each disk access in neat sequence can be just as fast. |
Given that the latest iteration optimized a whopping -37% of the time away, and the tiny program took just a magnitude of 10ms of time, it is trivially true that it won't regress (and even if it did, it would be in the scale of ~10ms). |
Ran through the profile nevertheless (btw would you have a chance to review #13464 at some point? I split it separate from this to avoid review churn, but it is awkward to keep octopus merging back working branches together after having posted a PR) Result is the same 11ms from before. Compiler sanity checks and compiler.js dominate. |
a948575
to
71aab45
Compare
…ltiprocessing pool is *extremely* slow, and can take 500ms-1 second. If there are only few files to llvm-nm, do those sequentially since a single llvm-nm only takes about 10-20ms.
…: it accepts a list of files to process.
4d0656a
to
34ae9a9
Compare
99b39a6
to
9b2e687
Compare
Looks like this broke windows users with many input files: #13465 |
Optimize llvm-nm invocations from Emscripten. In python creating a multiprocessing pool is extremely slow, and can take 500ms-1 second. If there are only few files to llvm-nm, do those sequentially since a single llvm-nm only takes about 10-20ms.