-
Notifications
You must be signed in to change notification settings - Fork 3.4k
No system lib multiprocessing #13493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34bc196
to
6928508
Compare
Test suite passes for me when I run locally on Windows and on macOS. Oddly the CI fails on test-mac on:
but I am unable to reproduce this on my mac (or Windows). Not sure what to make of such a CI only failure? |
d97bfa1
to
065720c
Compare
This PR is now ready to land. |
tools/shared.py
Outdated
with ToolchainProfiler.profile_block('run_multiple_processes'): | ||
processes = [] | ||
start = 0 | ||
end = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start
and end
are not obvious enough to me as names, I need to really read the code below to understand what they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote this code to be less C-like and utilize dynamic arrays instead of static-like data. Hopefully it is more readable now.
tools/shared.py
Outdated
end += 1 | ||
else: | ||
# Too many commands running in parallel, wait for one to finish. | ||
out, err = processes[start].communicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something multiprocessing
avoids. For that reason it's not clear to me why this approach would be faster.
Also, multiprocessing
keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something
multiprocessing
avoids.
It is true that we won't have as tight utilization as a 100% OS-driven select() would have if runtimes are lopsided, but that does not slow this PR down from winning in benchmarks.
Running embuilder build ALL
in this PR on Windows PC 1 finishes in
main
: 342 seconds
this PR
: 307.4 seconds (+11.3% faster)
On Windows PC 2:
main
: DNF, python multiprocessing
hangs due to #13785
this PR
: 152 seconds (+∞ faster)
In #13465 another use of this PR was benchmarked to be +58.7% faster to use this code instead of Python multiprocessing
.
Also,
multiprocessing
keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?
Superficially it might seem like that.. but the reality is that our use of those pooled subprocesses is to just spawn other command line tasks, so we certainly won't gain any performance from having dedicated pooled up python subprocesses for such purpose. They'll run as fast as those cmdline tasks do.
The performance that we seemingly gain from pooling with multiprocessing
is to avoid/fix performance slowdowns that bringing up python multiprocessing pool itself has. But better yet, if we don't use multiprocessing
at all, we win 100% of that time that its pooling aims to fix.
I.e. before we'd have (main python process) -> (multiprocessing.pool subprocess) -> (Popen cmdline).
Now we have just (main python process) -> (Popen cmdline). If anything, we win in performance by cutting out the useless middle man python interpreter, which takes time to spin up and interprocess communicate with.
However, to reiterate, this PR does not exist to optimize performance, but to fix correctness. Even if multiprocessing
was any amount faster than this, it does not help when it is broken. (#8013, #718, #13785, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks for the detailed explanation!
The code looks good as well.
The one question I have before approving this is whether you've profiled it on anything other than a windows machine? Would be good to confirm on linux and mac.
458a9db
to
7d4c63e
Compare
c599dc4
to
edb0145
Compare
Looking at the Popen documentation, I notice it allows passing a wait timeout. Updated to use that, since it can provide a better polling responsiveness. Did benchmarks on macOS on a 4-core MacBook Pro from 2016, there I cannot find any meaningful speed difference between the runtimes in current |
Ok, I tested on a linux machine. On various workloads I see no regression with this patch (nor improvement either). I did have to change this to make the patch work, adding the middle line:
We can also compare CI times once CI passes here. But looks like speed is not an issue, nice. |
(the above is of course with the linux disabling-code removed, so it was testing the new code path) Thinking on this some more, the VM perf difference you saw @juj is a little worrying. I wonder if there are things like WSL also that might be affected. Perhaps we should enable this on all platforms by default (to keep complexity low), but have an env var to fall back to the old path, and document that in the changelog? Then if we get reports of regressions it will be easy to find out if it is caused by this. |
That works, updated.
Oh thanks, after introducing that I only benchmarked embuilder and nothing with JS optimizer, so that never got exercised. Updated now. |
What I found really surprising about this perf difference is that even if I disabled the polling timeout, i.e. dedicated the main thread to 100% busy-spin to loop through all the subprocesses to wait which one finishes first, and set EMCC_CORES=3 on the four-core VM, so that one of those threads could be dedicated to do such wait completion spinning, I would still see the performance issue compared to using python multiprocessing with 3 cores. That is really odd, since the main thread should be able to poll completion of the processes thousands of times per second, and it should in practice be as good as a select(), just that it wastes one core for that purpose. It would only make sense to be slower if there was a large API call overhead for On a 64-thread system I'd imagine I'd see the most dramatic effect of not having a select() primitive, but on that system the polling mechanism dominates in performance over Python multiprocessing - maybe because it is Windows and Python multiprocessing is weak there due to other reasons. |
@@ -38,6 +38,9 @@ | |||
EXPECTED_LLVM_VERSION = "13.0" | |||
PYTHON = sys.executable | |||
|
|||
# Used only on Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Used only on Linux | |
# Used only when EM_PYTHON_MULTIPROCESSING is set, see below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment was missed? No big deal but would be good to update it eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oh, it was because it had automerge and I approved it, heh...)
out, err = processes[0][1].communicate(0.2) | ||
return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '') | ||
except subprocess.TimeoutExpired: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about not always checking process 0 (oldest) in the last part here? instead, since we have a timeout, we can have a counter k
that replaces that 0, and cycles around. Then if the oldest process is very very slow we won't get stuck there. (The problem is probably very rare, but worth avoiding i think.)
also, I think this can be simplified to avoid the duplication of 168-169 and 174-175. instead of the second communicate(0.2)
we can call wait(0.2)
. if the wait succeeds we can set j
and go back to the top of the loop, where the polling will find it. but not sure if that's simpler overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then if the oldest process is very very slow we won't get stuck there.
I don't think that is the right call. If we have an array of processes, and at any time make an assumption that the oldest process should finish first - then after having checked that process, and having found that it has not finished, and no other process has finished either, none of that information will change our assumption that the oldest process should finish first again. I.e. it is optimal to wait for the first process again after we have observed that none of the processes in the array have finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of the second
communicate(0.2)
we can callwait(0.2)
I had it like that first, but it would deadlock harfbuzz build. Python docs state that wait
is broken:
Note
This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.
Why python
actually then carries that function around instead of fixing their code is beyond me. However it does deadlock in practice, even if the wait timeout is tiny like 0.2 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good on both points.
For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).
out, err = processes[0][1].communicate(0.2) | ||
return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '') | ||
except subprocess.TimeoutExpired: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good on both points.
For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).
I love this change in general, but why keep EM_PYTHON_MULTIPROCESSING around? Is it just temporary or profiling? I would assume we don't actually want users to use it otherwise we would need to add testing of it. |
Good point that we should test it, I opened #13832 I think we can remove it after a release or so if we don't get reports of issues. If we do get a bug "my build is 50% slower" then it will be nice to let the user flip it off to see if it's the cause. |
See #13493 Also contains a drive-by fix for a test that used env_modify incorrectly.
I think tests still use multiprocessing so testing for multiprocessing is moot |
This PR removes the use of the python
multiprocessing
library, and replaces the remaining uses withPopen
.The benefits of dropping multiprocessing are:
Using multiprocessing in test suite is fine to me. (although running the test suite does hard lock my 2990WX system still unless I demote to EMCC_CORES=32, but that is another issue) Test runner is not performance critical for end users.