Skip to content

No system lib multiprocessing #13493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Mar 31, 2021
Merged

Conversation

juj
Copy link
Collaborator

@juj juj commented Feb 13, 2021

This PR removes the use of the python multiprocessing library, and replaces the remaining uses with Popen.

The benefits of dropping multiprocessing are:

Using multiprocessing in test suite is fine to me. (although running the test suite does hard lock my 2990WX system still unless I demote to EMCC_CORES=32, but that is another issue) Test runner is not performance critical for end users.

@juj juj force-pushed the no_system_lib_multiprocessing branch 2 times, most recently from 34bc196 to 6928508 Compare February 14, 2021 11:23
@juj juj changed the title WIP: No system lib multiprocessing No system lib multiprocessing Feb 14, 2021
@juj
Copy link
Collaborator Author

juj commented Feb 14, 2021

Test suite passes for me when I run locally on Windows and on macOS. Oddly the CI fails on test-mac on:

======================================================================
FAIL: test_fignore_exceptions (test_other.other)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/Users/distiller/project/tests/test_other.py", line 8970, in test_fignore_exceptions
    self.assertGreater(enable_size, disable_size)
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 1251, in assertGreater
    self.fail(self._formatMessage(msg, standardMsg))
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 693, in fail
    raise self.failureException(msg)
AssertionError: 12045 not greater than 16059

======================================================================
FAIL: test_metadce_cxx_noexcept (test_other.other)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/Users/distiller/project/tests/runner.py", line 335, in resulting_test
    return func(self, *args)
  File "/Users/distiller/project/tests/test_other.py", line 6874, in test_metadce_cxx
    self.run_metadce_test('hello_libcxx.cpp', *args, check_funcs=False)
  File "/Users/distiller/project/tests/test_other.py", line 6806, in run_metadce_test
    self.assertLess(ratio, size_slack)
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 1239, in assertLess
    self.fail(self._formatMessage(msg, standardMsg))
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 693, in fail
    raise self.failureException(msg)
AssertionError: 0.6131595428213559 not less than 0.05

----------------------------------------------------------------------
Ran 559 tests in 1229.228s

FAILED (failures=2, skipped=20)
====================

but I am unable to reproduce this on my mac (or Windows). Not sure what to make of such a CI only failure?

Base automatically changed from master to main March 8, 2021 23:50
@juj juj force-pushed the no_system_lib_multiprocessing branch from d97bfa1 to 065720c Compare March 28, 2021 09:08
@juj juj enabled auto-merge (squash) March 28, 2021 11:12
@juj
Copy link
Collaborator Author

juj commented Mar 29, 2021

This PR is now ready to land.

tools/shared.py Outdated
with ToolchainProfiler.profile_block('run_multiple_processes'):
processes = []
start = 0
end = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start and end are not obvious enough to me as names, I need to really read the code below to understand what they do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote this code to be less C-like and utilize dynamic arrays instead of static-like data. Hopefully it is more readable now.

tools/shared.py Outdated
end += 1
else:
# Too many commands running in parallel, wait for one to finish.
out, err = processes[start].communicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something multiprocessing avoids. For that reason it's not clear to me why this approach would be faster.

Also, multiprocessing keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?

Copy link
Collaborator Author

@juj juj Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something multiprocessing avoids.

It is true that we won't have as tight utilization as a 100% OS-driven select() would have if runtimes are lopsided, but that does not slow this PR down from winning in benchmarks.

Running embuilder build ALL in this PR on Windows PC 1 finishes in

main: 342 seconds
this PR: 307.4 seconds (+11.3% faster)

On Windows PC 2:

main: DNF, python multiprocessing hangs due to #13785
this PR: 152 seconds (+∞ faster)

In #13465 another use of this PR was benchmarked to be +58.7% faster to use this code instead of Python multiprocessing.

Also, multiprocessing keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?

Superficially it might seem like that.. but the reality is that our use of those pooled subprocesses is to just spawn other command line tasks, so we certainly won't gain any performance from having dedicated pooled up python subprocesses for such purpose. They'll run as fast as those cmdline tasks do.

The performance that we seemingly gain from pooling with multiprocessing is to avoid/fix performance slowdowns that bringing up python multiprocessing pool itself has. But better yet, if we don't use multiprocessing at all, we win 100% of that time that its pooling aims to fix.

I.e. before we'd have (main python process) -> (multiprocessing.pool subprocess) -> (Popen cmdline).

Now we have just (main python process) -> (Popen cmdline). If anything, we win in performance by cutting out the useless middle man python interpreter, which takes time to spin up and interprocess communicate with.

However, to reiterate, this PR does not exist to optimize performance, but to fix correctness. Even if multiprocessing was any amount faster than this, it does not help when it is broken. (#8013, #718, #13785, etc.)

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for the detailed explanation!

The code looks good as well.

The one question I have before approving this is whether you've profiled it on anything other than a windows machine? Would be good to confirm on linux and mac.

@juj juj force-pushed the no_system_lib_multiprocessing branch 2 times, most recently from 458a9db to 7d4c63e Compare March 31, 2021 12:46
@juj juj force-pushed the no_system_lib_multiprocessing branch from c599dc4 to edb0145 Compare March 31, 2021 14:05
@juj
Copy link
Collaborator Author

juj commented Mar 31, 2021

The one question I have before approving this is whether you've profiled it on anything other than a windows machine? Would be good to confirm on linux and mac.

Looking at the Popen documentation, I notice it allows passing a wait timeout. Updated to use that, since it can provide a better polling responsiveness.

Did benchmarks on macOS on a 4-core MacBook Pro from 2016, there I cannot find any meaningful speed difference between the runtimes in current main vs this branch. I don't have a Linux system set up atm, but installed VirtualBox+Ubuntu with 4 cores, and there I do see the results being very noisy, oddly calling clang takes 2-4x longer compared to running it natively (already in main branch). Filtering out the noise, this branch looks like ~20% slower, so I restored the use of multiprocessing solely for Linux to avoid that. If you would like to run your own benchmarks on this PR vs main in a non-VM environment, that would be great.

@kripken
Copy link
Member

kripken commented Mar 31, 2021

Ok, I tested on a linux machine. On various workloads I see no regression with this patch (nor improvement either).

I did have to change this to make the patch work, adding the middle line:

def mp_run_process(command_tuple):
  temp_files = configuration.get_temp_files() # this was missing
  cmd, env, route_stdout_to_temp_files_suffix, pipe_stdout, check, cwd = command_tuple

We can also compare CI times once CI passes here. But looks like speed is not an issue, nice.

@kripken
Copy link
Member

kripken commented Mar 31, 2021

(the above is of course with the linux disabling-code removed, so it was testing the new code path)

Thinking on this some more, the VM perf difference you saw @juj is a little worrying. I wonder if there are things like WSL also that might be affected. Perhaps we should enable this on all platforms by default (to keep complexity low), but have an env var to fall back to the old path, and document that in the changelog? Then if we get reports of regressions it will be easy to find out if it is caused by this.

@juj
Copy link
Collaborator Author

juj commented Mar 31, 2021

. Perhaps we should enable this on all platforms by default (to keep complexity low), but have an env var to fall back to the old path, and document that in the changelog?

That works, updated.

I did have to change this to make the patch work, adding the middle line:

Oh thanks, after introducing that I only benchmarked embuilder and nothing with JS optimizer, so that never got exercised. Updated now.

@juj
Copy link
Collaborator Author

juj commented Mar 31, 2021

Thinking on this some more, the VM perf difference you saw @juj is a little worrying.

What I found really surprising about this perf difference is that even if I disabled the polling timeout, i.e. dedicated the main thread to 100% busy-spin to loop through all the subprocesses to wait which one finishes first, and set EMCC_CORES=3 on the four-core VM, so that one of those threads could be dedicated to do such wait completion spinning, I would still see the performance issue compared to using python multiprocessing with 3 cores.

That is really odd, since the main thread should be able to poll completion of the processes thousands of times per second, and it should in practice be as good as a select(), just that it wastes one core for that purpose. It would only make sense to be slower if there was a large API call overhead for .poll() or .communicate() on that system, or something along those lines.

On a 64-thread system I'd imagine I'd see the most dramatic effect of not having a select() primitive, but on that system the polling mechanism dominates in performance over Python multiprocessing - maybe because it is Windows and Python multiprocessing is weak there due to other reasons.

@@ -38,6 +38,9 @@
EXPECTED_LLVM_VERSION = "13.0"
PYTHON = sys.executable

# Used only on Linux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Used only on Linux
# Used only when EM_PYTHON_MULTIPROCESSING is set, see below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment was missed? No big deal but would be good to update it eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oh, it was because it had automerge and I approved it, heh...)

out, err = processes[0][1].communicate(0.2)
return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '')
except subprocess.TimeoutExpired:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about not always checking process 0 (oldest) in the last part here? instead, since we have a timeout, we can have a counter k that replaces that 0, and cycles around. Then if the oldest process is very very slow we won't get stuck there. (The problem is probably very rare, but worth avoiding i think.)

also, I think this can be simplified to avoid the duplication of 168-169 and 174-175. instead of the second communicate(0.2) we can call wait(0.2). if the wait succeeds we can set j and go back to the top of the loop, where the polling will find it. but not sure if that's simpler overall.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then if the oldest process is very very slow we won't get stuck there.

I don't think that is the right call. If we have an array of processes, and at any time make an assumption that the oldest process should finish first - then after having checked that process, and having found that it has not finished, and no other process has finished either, none of that information will change our assumption that the oldest process should finish first again. I.e. it is optimal to wait for the first process again after we have observed that none of the processes in the array have finished.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the second communicate(0.2) we can call wait(0.2)

I had it like that first, but it would deadlock harfbuzz build. Python docs state that wait is broken:

Note

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

Why python actually then carries that function around instead of fixing their code is beyond me. However it does deadlock in practice, even if the wait timeout is tiny like 0.2 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good on both points.

For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).

out, err = processes[0][1].communicate(0.2)
return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '')
except subprocess.TimeoutExpired:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good on both points.

For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).

@juj juj merged commit 19957ab into emscripten-core:main Mar 31, 2021
@sbc100
Copy link
Collaborator

sbc100 commented Apr 6, 2021

I love this change in general, but why keep EM_PYTHON_MULTIPROCESSING around? Is it just temporary or profiling? I would assume we don't actually want users to use it otherwise we would need to add testing of it.

@kripken
Copy link
Member

kripken commented Apr 6, 2021

Good point that we should test it, I opened #13832

I think we can remove it after a release or so if we don't get reports of issues. If we do get a bug "my build is 50% slower" then it will be nice to let the user flip it off to see if it's the cause.

kripken added a commit that referenced this pull request Apr 8, 2021
See #13493

Also contains a drive-by fix for a test that used env_modify incorrectly.
@truboxl
Copy link
Contributor

truboxl commented Apr 21, 2021

I think tests still use multiprocessing so testing for multiprocessing is moot
For example, Android still can't run the tests without EMCC_CORES=1 but it greatly benefited from this popen approach for normal compile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants