Fix race in blas_thread_shutdown. #3111

hawkinsp · 2021-02-18T18:49:28Z

blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.

This may be the same issue as #2270 although in my case it exhibited as a segmentation fault, not a deadlock.

blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.

…/execvp() on non-Android POSIX platforms. The goal of this change is to avoid calling pthread_atfork() handlers. Some libraries, in particular the version of OpenBLAS included in NumPy, have buggy pthread_atfork() handlers. See OpenMathLib/OpenBLAS#3111 and jax-ml/jax#5713 for details. Now, while we can and have fixed the buggy atfork handlers, it will take some time for the fix to be deployed in a NumPy release and for users to update to a new NumPy release. So we also take an additional step: avoid running atfork handlers in Subprocess. My copy of the glibc documentation says: " According to POSIX, it unspecified whether fork handlers established with pthread_atfork(3) are called when posix_spawn() is invoked. On glibc, fork handlers are called only if the child is created using fork(2). " It appears glibc 2.24 and newer do not call pthread_atfork() handlers from posix_spawn(). Using posix_spawn() should be at least no worse than an explicit fork()/execvp() pair, and on glibc it should do the right thing. PiperOrigin-RevId: 358317859 Change-Id: Ic1d95446706efa7c0db4e79bf8281f14b2bd99df

martin-frbg · 2021-02-19T08:57:09Z

Ouch, thanks. Don't know how often I (and certainly many others) have looked at that part of the code without realizing.

hawkinsp · 2021-02-19T14:16:41Z

Thanks!

I should add: I'm pretty sure some of the other code patterns in that file related to blas_server_avail and server_lock are also racy.

For example, suppose another thread calls fork() (and hence blas_thread_shutdown) immediately after the call to blas_thread_init in this line: https://github.com/xianyi/OpenBLAS/blob/1caa44bea999e196918afc095d26ad418db4ba19/driver/others/blas_server.c#L698
If I'm reading the code right, we might easily deadlock because the thread pool will be shutdown right before we enqueue work to it. That said, I don't know this code that well and it wasn't the bug I was hitting, so I didn't try for a wider fix.

In general it looks fairly suspicious to me to read blas_server_avail non-atomically without holding a lock (which happens in several functions) and without some sort of memory fence. This is the classic "double-checked locking" pattern and it is unsafe for the usual reasons. It might well work on x86 in practice with the right C compiler, but it is risky.

More broadly, I think it might be a mistake to shut down the parent's thread pool on fork; you most likely only really want to ensure that the child starts without a threadpool configured, and I think that could be done with a slightly more sophisticated pthread_atfork() handler. That said, calling fork() in a multithreaded program without following it with exec() is a fundamentally unsafe thing to do, so these kind of things are band-aid fixes at best. However, I don't know the history of this code and perhaps it is something users are relying on.

Fix race in blas_thread_shutdown.

dbbf92c

blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.

hawkinsp mentioned this pull request Feb 18, 2021

jax/tests/lax_scipy_sparse_test.py segfaults on GPU; other GPU test failures jax-ml/jax#5713

Closed

martin-frbg added this to the 0.3.14 milestone Feb 19, 2021

martin-frbg merged commit 1caa44b into OpenMathLib:develop Feb 19, 2021

martin-frbg mentioned this pull request Feb 25, 2021

Issue with LAPACKE_sgesvd() on custom compiled v0.3.7 for Win64 (new x86_64-w64-mingw32 might be the cause) #2297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in blas_thread_shutdown. #3111

Fix race in blas_thread_shutdown. #3111

hawkinsp commented Feb 18, 2021 •

edited

Loading

martin-frbg commented Feb 19, 2021

hawkinsp commented Feb 19, 2021

Fix race in blas_thread_shutdown. #3111

Fix race in blas_thread_shutdown. #3111

Conversation

hawkinsp commented Feb 18, 2021 • edited Loading

martin-frbg commented Feb 19, 2021

hawkinsp commented Feb 19, 2021

hawkinsp commented Feb 18, 2021 •

edited

Loading