Skip to content

Fix race in blas_thread_shutdown. #3111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 19, 2021
Merged

Conversation

hawkinsp
Copy link
Contributor

@hawkinsp hawkinsp commented Feb 18, 2021

blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.

This may be the same issue as #2270 although in my case it exhibited as a segmentation fault, not a deadlock.

blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 19, 2021
…/execvp() on non-Android POSIX platforms.

The goal of this change is to avoid calling pthread_atfork() handlers. Some libraries, in particular the version of OpenBLAS included in NumPy, have buggy pthread_atfork() handlers. See OpenMathLib/OpenBLAS#3111 and jax-ml/jax#5713 for details.

Now, while we can and have fixed the buggy atfork handlers, it will take some time for the fix to be deployed in a NumPy release and for users to update to a new NumPy release. So we also take an additional step: avoid running atfork handlers in Subprocess.

My copy of the glibc documentation says:
"
According  to  POSIX, it unspecified whether fork handlers established with pthread_atfork(3)
are called when posix_spawn() is invoked.  On glibc, fork handlers are  called  only  if  the
child is created using fork(2).
"
It appears glibc 2.24 and newer do not call pthread_atfork() handlers from posix_spawn().

Using posix_spawn() should be at least no worse than an explicit fork()/execvp() pair, and on glibc it should do the right thing.

PiperOrigin-RevId: 358317859
Change-Id: Ic1d95446706efa7c0db4e79bf8281f14b2bd99df
@martin-frbg
Copy link
Collaborator

Ouch, thanks. Don't know how often I (and certainly many others) have looked at that part of the code without realizing.

@martin-frbg martin-frbg added this to the 0.3.14 milestone Feb 19, 2021
@martin-frbg martin-frbg merged commit 1caa44b into OpenMathLib:develop Feb 19, 2021
@hawkinsp
Copy link
Contributor Author

Thanks!

I should add: I'm pretty sure some of the other code patterns in that file related to blas_server_avail and server_lock are also racy.

For example, suppose another thread calls fork() (and hence blas_thread_shutdown) immediately after the call to blas_thread_init in this line: https://github.com/xianyi/OpenBLAS/blob/1caa44bea999e196918afc095d26ad418db4ba19/driver/others/blas_server.c#L698
If I'm reading the code right, we might easily deadlock because the thread pool will be shutdown right before we enqueue work to it. That said, I don't know this code that well and it wasn't the bug I was hitting, so I didn't try for a wider fix.

In general it looks fairly suspicious to me to read blas_server_avail non-atomically without holding a lock (which happens in several functions) and without some sort of memory fence. This is the classic "double-checked locking" pattern and it is unsafe for the usual reasons. It might well work on x86 in practice with the right C compiler, but it is risky.

More broadly, I think it might be a mistake to shut down the parent's thread pool on fork; you most likely only really want to ensure that the child starts without a threadpool configured, and I think that could be done with a slightly more sophisticated pthread_atfork() handler. That said, calling fork() in a multithreaded program without following it with exec() is a fundamentally unsafe thing to do, so these kind of things are band-aid fixes at best. However, I don't know the history of this code and perhaps it is something users are relying on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants