No system lib multiprocessing #13493

juj · 2021-02-13T13:24:57Z

This PR removes the use of the python multiprocessing library, and replaces the remaining uses with Popen.

The benefits of dropping multiprocessing are:

removing multiprocessing in system lib building gives ~1 second improvement on building libc (and ~10 second improvement on embuilder.py build ALL),
removing multiprocessing on JS optimizer gives ~4 second improvement when building d3wasm to wasm2js,
tapping Ctrl-C works better and doesn't actually make python kill your system on Windows, (Aborting emcc with Ctrl-C often hangs the terminal #8013)
- unfortunately the long ^C stack traces still occur, but at least won't have multiprocessing hangs in there, adding a signal handler seems to help things as well. (not done in this PR)
less code to maintain, and no need to deal with fragile python multiprocessing pickling and atexit bugs. (Speed up python multicore code #8719, Don't compare stderr to an empty string unnecessarily, which can result in random errors on windows (due to Access is denied issues). #9240, WindowsError [Error 5] Access Denied when trying to terminate a process gevent/gevent#1023, https://forum.unity.com/threads/2019-4-18-cli-build-fail-editor-works-emscripten-error-6-the-handle-is-invalid.1043419/, Occasional python crash in multiprocessing\forking.py on Windows. #718 which still exists..)

Using multiprocessing in test suite is fine to me. (although running the test suite does hard lock my 2990WX system still unless I demote to EMCC_CORES=32, but that is another issue) Test runner is not performance critical for end users.

juj · 2021-02-14T14:29:29Z

Test suite passes for me when I run locally on Windows and on macOS. Oddly the CI fails on test-mac on:

======================================================================
FAIL: test_fignore_exceptions (test_other.other)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/Users/distiller/project/tests/test_other.py", line 8970, in test_fignore_exceptions
    self.assertGreater(enable_size, disable_size)
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 1251, in assertGreater
    self.fail(self._formatMessage(msg, standardMsg))
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 693, in fail
    raise self.failureException(msg)
AssertionError: 12045 not greater than 16059

======================================================================
FAIL: test_metadce_cxx_noexcept (test_other.other)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/Users/distiller/project/tests/runner.py", line 335, in resulting_test
    return func(self, *args)
  File "/Users/distiller/project/tests/test_other.py", line 6874, in test_metadce_cxx
    self.run_metadce_test('hello_libcxx.cpp', *args, check_funcs=False)
  File "/Users/distiller/project/tests/test_other.py", line 6806, in run_metadce_test
    self.assertLess(ratio, size_slack)
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 1239, in assertLess
    self.fail(self._formatMessage(msg, standardMsg))
  File "/Users/distiller/emsdk/python/3.7.4-2_64bit/lib/python3.7/unittest/case.py", line 693, in fail
    raise self.failureException(msg)
AssertionError: 0.6131595428213559 not less than 0.05

----------------------------------------------------------------------
Ran 559 tests in 1229.228s

FAILED (failures=2, skipped=20)
====================

but I am unable to reproduce this on my mac (or Windows). Not sure what to make of such a CI only failure?

juj · 2021-03-29T09:22:01Z

This PR is now ready to land.

kripken · 2021-02-17T21:13:25Z

tools/shared.py

+  with ToolchainProfiler.profile_block('run_multiple_processes'):
+    processes = []
+    start = 0
+    end = 0


start and end are not obvious enough to me as names, I need to really read the code below to understand what they do.

Rewrote this code to be less C-like and utilize dynamic arrays instead of static-like data. Hopefully it is more readable now.

kripken · 2021-02-17T21:16:21Z

tools/shared.py

+        end += 1
+      else:
+        # Too many commands running in parallel, wait for one to finish.
+        out, err = processes[start].communicate()


The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something multiprocessing avoids. For that reason it's not clear to me why this approach would be faster.

Also, multiprocessing keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?

The worst case is that this blocks on a long-running process while all the other cores are idle, I think? I assume that is something multiprocessing avoids.

It is true that we won't have as tight utilization as a 100% OS-driven select() would have if runtimes are lopsided, but that does not slow this PR down from winning in benchmarks.

Running embuilder build ALL in this PR on Windows PC 1 finishes in

main: 342 seconds
this PR: 307.4 seconds (+11.3% faster)

On Windows PC 2:

main: DNF, python multiprocessing hangs due to #13785
this PR: 152 seconds (+∞ faster)

In #13465 another use of this PR was benchmarked to be +58.7% faster to use this code instead of Python multiprocessing.

Also, multiprocessing keeps up a pool instead of spawning new processes all the time - is that not more efficient as well?

Superficially it might seem like that.. but the reality is that our use of those pooled subprocesses is to just spawn other command line tasks, so we certainly won't gain any performance from having dedicated pooled up python subprocesses for such purpose. They'll run as fast as those cmdline tasks do.

The performance that we seemingly gain from pooling with multiprocessing is to avoid/fix performance slowdowns that bringing up python multiprocessing pool itself has. But better yet, if we don't use multiprocessing at all, we win 100% of that time that its pooling aims to fix.

I.e. before we'd have (main python process) -> (multiprocessing.pool subprocess) -> (Popen cmdline).

Now we have just (main python process) -> (Popen cmdline). If anything, we win in performance by cutting out the useless middle man python interpreter, which takes time to spin up and interprocess communicate with.

However, to reiterate, this PR does not exist to optimize performance, but to fix correctness. Even if multiprocessing was any amount faster than this, it does not help when it is broken. (#8013, #718, #13785, etc.)

kripken

Sounds good, thanks for the detailed explanation!

The code looks good as well.

The one question I have before approving this is whether you've profiled it on anything other than a windows machine? Would be good to confirm on linux and mac.

… simultaneously active

juj · 2021-03-31T15:50:09Z

The one question I have before approving this is whether you've profiled it on anything other than a windows machine? Would be good to confirm on linux and mac.

Looking at the Popen documentation, I notice it allows passing a wait timeout. Updated to use that, since it can provide a better polling responsiveness.

Did benchmarks on macOS on a 4-core MacBook Pro from 2016, there I cannot find any meaningful speed difference between the runtimes in current main vs this branch. I don't have a Linux system set up atm, but installed VirtualBox+Ubuntu with 4 cores, and there I do see the results being very noisy, oddly calling clang takes 2-4x longer compared to running it natively (already in main branch). Filtering out the noise, this branch looks like ~20% slower, so I restored the use of multiprocessing solely for Linux to avoid that. If you would like to run your own benchmarks on this PR vs main in a non-VM environment, that would be great.

kripken · 2021-03-31T17:45:49Z

Ok, I tested on a linux machine. On various workloads I see no regression with this patch (nor improvement either).

I did have to change this to make the patch work, adding the middle line:

def mp_run_process(command_tuple):
  temp_files = configuration.get_temp_files() # this was missing
  cmd, env, route_stdout_to_temp_files_suffix, pipe_stdout, check, cwd = command_tuple

We can also compare CI times once CI passes here. But looks like speed is not an issue, nice.

kripken · 2021-03-31T19:01:34Z

(the above is of course with the linux disabling-code removed, so it was testing the new code path)

Thinking on this some more, the VM perf difference you saw @juj is a little worrying. I wonder if there are things like WSL also that might be affected. Perhaps we should enable this on all platforms by default (to keep complexity low), but have an env var to fall back to the old path, and document that in the changelog? Then if we get reports of regressions it will be easy to find out if it is caused by this.

juj · 2021-03-31T20:00:21Z

. Perhaps we should enable this on all platforms by default (to keep complexity low), but have an env var to fall back to the old path, and document that in the changelog?

That works, updated.

I did have to change this to make the patch work, adding the middle line:

Oh thanks, after introducing that I only benchmarked embuilder and nothing with JS optimizer, so that never got exercised. Updated now.

juj · 2021-03-31T20:08:58Z

Thinking on this some more, the VM perf difference you saw @juj is a little worrying.

What I found really surprising about this perf difference is that even if I disabled the polling timeout, i.e. dedicated the main thread to 100% busy-spin to loop through all the subprocesses to wait which one finishes first, and set EMCC_CORES=3 on the four-core VM, so that one of those threads could be dedicated to do such wait completion spinning, I would still see the performance issue compared to using python multiprocessing with 3 cores.

That is really odd, since the main thread should be able to poll completion of the processes thousands of times per second, and it should in practice be as good as a select(), just that it wastes one core for that purpose. It would only make sense to be slower if there was a large API call overhead for .poll() or .communicate() on that system, or something along those lines.

On a 64-thread system I'd imagine I'd see the most dramatic effect of not having a select() primitive, but on that system the polling mechanism dominates in performance over Python multiprocessing - maybe because it is Windows and Python multiprocessing is weak there due to other reasons.

kripken · 2021-03-31T20:09:41Z

tools/shared.py

@@ -38,6 +38,9 @@
 EXPECTED_LLVM_VERSION = "13.0"
 PYTHON = sys.executable

+# Used only on Linux


Suggested change

# Used only on Linux

# Used only when EM_PYTHON_MULTIPROCESSING is set, see below

I think this comment was missed? No big deal but would be good to update it eventually.

(oh, it was because it had automerge and I approved it, heh...)

kripken · 2021-03-31T20:17:23Z

tools/shared.py

+              out, err = processes[0][1].communicate(0.2)
+              return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '')
+            except subprocess.TimeoutExpired:
+              pass


how about not always checking process 0 (oldest) in the last part here? instead, since we have a timeout, we can have a counter k that replaces that 0, and cycles around. Then if the oldest process is very very slow we won't get stuck there. (The problem is probably very rare, but worth avoiding i think.)

also, I think this can be simplified to avoid the duplication of 168-169 and 174-175. instead of the second communicate(0.2) we can call wait(0.2). if the wait succeeds we can set j and go back to the top of the loop, where the polling will find it. but not sure if that's simpler overall.

Then if the oldest process is very very slow we won't get stuck there.

I don't think that is the right call. If we have an array of processes, and at any time make an assumption that the oldest process should finish first - then after having checked that process, and having found that it has not finished, and no other process has finished either, none of that information will change our assumption that the oldest process should finish first again. I.e. it is optimal to wait for the first process again after we have observed that none of the processes in the array have finished.

instead of the second communicate(0.2) we can call wait(0.2)

I had it like that first, but it would deadlock harfbuzz build. Python docs state that wait is broken:

Note This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

Why python actually then carries that function around instead of fixing their code is beyond me. However it does deadlock in practice, even if the wait timeout is tiny like 0.2 seconds.

Sounds good on both points.

For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).

kripken · 2021-03-31T21:34:37Z

tools/shared.py

+              out, err = processes[0][1].communicate(0.2)
+              return (0, out.decode('UTF-8') if out else '', err.decode('UTF-8') if err else '')
+            except subprocess.TimeoutExpired:
+              pass


Sounds good on both points.

For the first, I missed that it goes back up anyhow in the loop, so yes, this seems optimal (or as optimal as can be without select).

sbc100 · 2021-04-06T17:46:15Z

I love this change in general, but why keep EM_PYTHON_MULTIPROCESSING around? Is it just temporary or profiling? I would assume we don't actually want users to use it otherwise we would need to add testing of it.

kripken · 2021-04-06T17:56:36Z

Good point that we should test it, I opened #13832

I think we can remove it after a release or so if we don't get reports of issues. If we do get a bug "my build is 50% slower" then it will be nice to let the user flip it off to see if it's the cause.

See #13493 Also contains a drive-by fix for a test that used env_modify incorrectly.

truboxl · 2021-04-21T02:55:05Z

I think tests still use multiprocessing so testing for multiprocessing is moot
For example, Android still can't run the tests without EMCC_CORES=1 but it greatly benefited from this popen approach for normal compile

juj force-pushed the no_system_lib_multiprocessing branch 2 times, most recently from 34bc196 to 6928508 Compare February 14, 2021 11:23

juj mentioned this pull request Feb 14, 2021

Is link_bitcode() needed anymore? #13492

Closed

juj changed the title ~~WIP: No system lib multiprocessing~~ No system lib multiprocessing Feb 14, 2021

truboxl mentioned this pull request Mar 6, 2021

New Package Request: emsdk termux/termux-packages#5043

Closed

Base automatically changed from master to main March 8, 2021 23:50

juj force-pushed the no_system_lib_multiprocessing branch from d97bfa1 to 065720c Compare March 28, 2021 09:08

juj mentioned this pull request Mar 28, 2021

Python multiprocessing regression on Windows after updating to Python 3.9.2 #13785

Closed

juj enabled auto-merge (squash) March 28, 2021 11:12

kripken reviewed Mar 29, 2021

View reviewed changes

juj added 17 commits March 31, 2021 15:44

Drop using python multiprocessing pool.

da5fa46

Change llvm_nm_multiple() to use run_multiple_processes()

a5d8ed7

flake

b9685b0

fix stdout pipe

806ac77

flake

2ac197c

Profile block string run_multiple_processes

76f7fca

Remove unnecessary EMCC_SKIP_SANITY_CHECK set

ed17d8e

Remove EMCC_CCACHE related code.

78a40fc

Restore bitcode linking.

aa1588c

cleanup

13b8912

Rewrite bitcode linking to avoid python multiprocessing pool.

83687ed

Flake

8bcbef9

Remove direct .py spawn in system_libs.py which prevents ccache.

5358095

Use os.cpu_count() instead of importing multiprocessing

b43bcdf

Rewrite run_multiple_processes() to avoid start and end iterators.

df35938

Use communicate() with timeout

c0b7a24

Fix TimeoutExpired

7d4c63e

juj force-pushed the no_system_lib_multiprocessing branch 2 times, most recently from 458a9db to 7d4c63e Compare March 31, 2021 12:46

Fix Linux UTF-8 decoding

edb0145

juj force-pushed the no_system_lib_multiprocessing branch from c599dc4 to edb0145 Compare March 31, 2021 14:05

juj added 6 commits March 31, 2021 17:12

Pipe stderr, add check that both file and string stdout piping is not…

7c6c2cf

… simultaneously active

Improve subprocess spawn utilization

f15bc9d

Test against multiprocessing

24d9255

Add missing stdout

0b24090

Remove launch_new_process()

9a857ef

Cleanup code

2402ef3

Add EM_PYTHON_MULTIPROCESSING

ca61c04

Flake

2930186

kripken reviewed Mar 31, 2021

View reviewed changes

kripken approved these changes Mar 31, 2021

View reviewed changes

juj merged commit 19957ab into emscripten-core:main Mar 31, 2021

kripken mentioned this pull request Apr 6, 2021

Add a test for EM_PYTHON_MULTIPROCESSING #13832

Merged

kripken added a commit that referenced this pull request Apr 8, 2021

Add a test for EM_PYTHON_MULTIPROCESSING (#13832)

c2f4f09

See #13493 Also contains a drive-by fix for a test that used env_modify incorrectly.

	# Used only on Linux
	# Used only when EM_PYTHON_MULTIPROCESSING is set, see below

No system lib multiprocessing #13493

No system lib multiprocessing #13493

Uh oh!

Conversation

juj commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juj commented Feb 14, 2021

Uh oh!

juj commented Mar 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

juj commented Mar 31, 2021

Uh oh!

kripken commented Mar 31, 2021

Uh oh!

kripken commented Mar 31, 2021

Uh oh!

juj commented Mar 31, 2021

Uh oh!

juj commented Mar 31, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbc100 commented Apr 6, 2021

Uh oh!

kripken commented Apr 6, 2021

Uh oh!

truboxl commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

juj commented Feb 13, 2021 •

edited

Loading

juj Mar 29, 2021 •

edited

Loading

truboxl commented Apr 21, 2021 •

edited

Loading