Skip to content

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seberg opened this issue Oct 31, 2018 · 45 comments · Fixed by #1865
Closed

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

seberg opened this issue Oct 31, 2018 · 45 comments · Fixed by #1865

Comments

@seberg
Copy link
Contributor

seberg commented Oct 31, 2018

Since OpenBLAS is one of the most commonly used BLAS implementations with NumPy, there are currently two points being discussed within NumPy. One thing is the spawning of too many threads when using multi processing/python threads (although is probably something we need to find a solution for, but pointers are welcome!).

The main issue is the thread safety of OpenBLAS when multiple threads are used. This may not be problematic for many projects explicitly using OpenBLAS, but NumPy is a bit different:

  • NumPy does not actually know which BLAS is being used.
  • Thus, NumPy has no control on the number of threads (the system default is used).

Together this means NumPy will often use multi threaded OpenBLAS. Now if someone uses a numpy function within threads they might get silently incorrect results (numpy/numpy#11046) and this seems true for OpenBLAS versions 0.2.18, 0.2.20, 0.3.0 (probably generally).

The possibility of silently wrong results deeply troubles me. Downstream users of NumPy are unlikely to be aware of such issues, or might not even notice that they are using BLAS at all. The problem is that there seems to be no easy solution.

We could use locking to disable threading, but that probably is incorrect/unnecessary often as well.
I have been talking with @mattip and others at NumPy, and hoped we can start a small discussion here. Maybe there are fixes/workarounds we can implement in NumPy, but maybe we can also provide help/resources to make OpenBLAS thread safe?

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

Spawning too many threads is not exactly thread safety issue, it is a configuration choice made by user.

The accuracy issue indeed reproduces even with OpenBLAS threading disabled (but not with ATLAS or reference BLAS)

btw numpy perfectly knows which blas is used, and can set threads as needed (at first sight - could be only problem is no support for OMP nesting in OpenBLAS, or that MKL gets some special treatment, or both)

python -c "import numpy; print(numpy.__config__.show())"
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
None

@seberg
Copy link
Contributor Author

seberg commented Oct 31, 2018

No, unless I am very very confused, that is incorrect. The config does not tell you what BLAS is being used. Systems such as debian will link to a library somewhere and which library this is actually can be changed at any time. That configuration gives you an idea about which numpy would be compiling against, but it does not know whether this is actually used.

The only way to find out what is actually being picked up at run time is to use ldd. In fact, we (or rather Anaconda) had fun bugs already where the import of things such as Qt, changed which BLAS was picked up from MKL to Accelerate.

Spawning too many threads is not (EDIT: A thread safety issue). Personally I am more worried about the other issue. I do not see the incorrect results if I use OMP_NUM_THREADS=1 but maybe it is just less likely or depends on versions.

EDIT2: To be clear, this is just as much about whether we may be able to help OpenBLAS here as the other way around. If it is important enough, we might be able to find some resources for these issues.

@martin-frbg
Copy link
Collaborator

There have been several efforts since 0.2.18 to remove races, but there may be some still lurking in the level 3 BLAS code. One option to check this is to compile OpenBLAS with make USE_SIMPLE_THREADED_LEVEL3=1 - having a reproducer for the issue should help track this down.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

Yup, alternatives did not change config output, I agree it is build config.

I see incorrect results with pthreads openblas, single thread via OPENBLAS_NUM_THREADS=1 or not. They go away with OMP_NUM_THREADS=1 which has no impact on pthread openblas, but probably serializes python multiprocessing?

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

@martin-frbg reproducer is in linked issue. No problem with netlib or atlas (ubuntu14) , Reproduces well with 0.3.3, incl setting archaic coretypes, have not tried develop version yet.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

I am wondering what is shared gemm path that may pick or write something random - could be *_copy ?
The Repeater mentions small matrix sizes...

@martin-frbg
Copy link
Collaborator

Does the problem go away with SIMPLE_THREADED_LEVEL3 ? (though if this keeps occuring even with OpenBLAS limited to a single thread, it could be multiple numpy threads each invoking OpenBLAS in parallel in a scenario similar to #1536)

@seberg
Copy link
Contributor Author

seberg commented Oct 31, 2018

@martin-frbg so this issue should be a specific one and not a general issue, at least with newer OpenBLAS versions? That would be good news! The FAQ read a bit like it might be a deeper/bigger issue.

About OMP/NUM_THREADs, I am not actually sure on the details (or on how the debian/arch Openblas is compiled). On my Arch system OPENBLAS_NUM_THREADS=1 seems also sufficient.

@bbbbbbbbba
Copy link

Yup, alternatives did not change config output, I agree it is build config.

I see incorrect results with pthreads openblas, single thread via OPENBLAS_NUM_THREADS=1 or not. They go away with OMP_NUM_THREADS=1 which has no impact on pthread openblas, but probably serializes python multiprocessing?

If I'm understanding this correctly, the issue just got even more troublesome. On my computer (1 CPU, 2 cores), OPENBLAS_NUM_THREADS=1 works perfectly as a workaround, and that is why we thought it's a conflict between python threading and OpenBLAS threading. But if the issue reproduces on some system even with OPENBLAS_NUM_THREADS=1, then there may be another bug hiding somewhere in the stack...

@martin-frbg
Copy link
Collaborator

I should hope that the current situation is a lot better than it was at 0.2.18 but there may still be unfixed bugs in the code, and some of them may show up only on fast systems with many cores. If it is true that the problem persists at OPENBLAS_NUM_THREADS=1 that would be quite worrying indeed. @brada4 are you sure of that ?

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

Yes, absolutely sure with opensuse pthread:
AMD A10-5750M APU/PILEDRIVER or SANDYBRIDGE or ATOM
Absolutely no problem with ubuntu 14 OMP version:
Atom(TM) CPU N450/ATOM
I will try to narrow down the issue tomorrow.

@bbbbbbbbba
Copy link

@brada4 I don't know whether this is relevant, but I'm not sure about this part:

OMP_NUM_THREADS=1 which has no impact on pthread openblas

It seems from this part of code that it's the other way round:

  • pthread openblas looks at OMP_NUM_THREADS (as a fallback if it doesn't see OPENBLAS_NUM_THREADS or GOTO_NUM_THREADS), but
  • OpenMP openblas disregards OPENBLAS_NUM_THREADS completely.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

OPENBLAS_NUM_THREADS=1 alone does not address issue for me.

@seberg
Copy link
Contributor Author

seberg commented Oct 31, 2018

@martin-frbg good, that makes me slightly less nervous :). Would like to still say that if there is somehow we may be able to help, please ask. Not that we have much resources, but if something is important enough maybe we can help out.

@bbbbbbbbba, @brada4 yeah, then my arch at least is just the pthread version, so it does not matter which one I use. I agree that OPENBLAS_NUM_THREADS=1 persistence makes it a lot worse, but frankly it is well above my pain threshold even without it. I would be surprised if half the numpy users even knows this switch exists and that will include those using threading, or using a library that uses threading... (EDIT: OK, I am a bit confused, but it doesn't matter much ;))

@seberg
Copy link
Contributor Author

seberg commented Oct 31, 2018

@martin-frbg Sorry, SIMPLE_THREADED_LEVEL3=1 does not fix it for me, but I guess it might be a compile time option? In that case I would have come back later, or maybe Matti might.

@martin-frbg
Copy link
Collaborator

Sorry I meant the make option I mentioned earlier. There is no corresponding runtime option for this.
I am not really familiar with numpy - is it likely that there are several python threads each calling into OpenBLAS dgemm in parallel (and thereby perhaps unknowingly sharing a memory buffer) ?

@bbbbbbbbba
Copy link

Yes, the threads are probably calling into OpenBLAS dgemm in parallel. Python has a global interpreter lock, but some numpy routines specifically releases it.

@seberg
Copy link
Contributor Author

seberg commented Oct 31, 2018

Yes, of course the threads are calling into OpenBLAS in parallel. While the threads might be reading the same data, the write buffers are allocated safely by numpy. NumPy itself is safe in this regard (it holds the python global interpreter lock for all things that are problematic). The problem occurs inside the dgemm call (of course we could have bugs, but then it should not go away with the OpenBLAS options), which of course is called in parallel effectively (because numpy will release the global interpreter lock for it). The threading itself is implemented by someone using numpy, not numpy itself.

@mattip
Copy link
Contributor

mattip commented Oct 31, 2018

Are there internal (not ones in the external interfaces) OpenBLAS memory buffers that are shared across calls to OpenBLAS functions?

@martin-frbg
Copy link
Collaborator

There is a thread buffer pool that was shown to misbehave in some circumstances - the OpenMP side of it was fixed (or worked around) in #1536 (released with 0.3.0) by introducing another compile-time variable.

@bbbbbbbbba
Copy link

Right, but does that thread buffer pool even exist in non-OpenMP openblas? It was

static void * blas_thread_buffer[MAX_CPU_NUMBER];

(now static void * blas_thread_buffer[MAX_PARALLEL_NUMBER][MAX_CPU_NUMBER]) in blas_server_omp.c, and I don't see anything like that in blas_server.c.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2018

That would help to oversubscribe cpu cores even better.

@bbbbbbbbba
Copy link

Now that I think of it, I don't know whether my own openblas (which comes with numpy) is built with OPENMP=1. What would be the easiest way to check?

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

You can change BLAS easily on ubuntu:
https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts

@bbbbbbbbba
Copy link

After some diving into OpenBLAS code, I suspect that this buffer is the culprit.

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

threads are calling dgemv_
according to documentation provided with multiprocessing they are PROCESSES, so thread safety issues should not apply....

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

@bbbbbbbbba it will overflow "under conditions", but it is sized entire pages and strangely never hit any sorts of mprotect()

@ogrisel
Copy link
Contributor

ogrisel commented Nov 1, 2018

according to documentation provided with multiprocessing they are PROCESSES, so thread safety issues should not apply....

No from multiprocessing.pool.ThreadPool is an actual thread pool (single Python interpreter in one Python process with several pthreads) while from multiprocessing.pool.Pool is a process pool with the same API (Python worker processes are forked from the main process by default).

Switching topic: now about controlling the OpenBLAS / OpenMP / MKL thread pool sizes programmatically, we have some prototype code in this PR: https://github.com/tomMoral/loky/pull/135.

It was inspired from https://github.com/IntelPython/smp but cover more libraries and platforms. Feel free to steal that code for numpy if you wish.

@bbbbbbbbba
Copy link

@brada4 The problem isn't overflowing, it's that as a global buffer, it's being shared by multiple threads in the multiprocessing.pool.ThreadPool.

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

malloc/free should work in this case, i am a bit confused if it does not damage performance too much (but random numerical effects are to be solved first before they lead to random discoveries)

Then I wonder why numpy uses gemv for dot, if it is that noticeably better then probably OpenBLAS could adapt the idea.

@bbbbbbbbba just that first paragraph of multiprocessing description states something opposite to what is happening here. Probably changing the statement to "processes or threads" should help to sort it out.

@martin-frbg steady failures with pthread versions on anything. OPENBLAS_NUM_THREADS=1 solves the problem. I just lunatically set it to =0 and wondered why it does not do anything.

@bbbbbbbbba
Copy link

@brada4 Actually multiprocessing.pool.ThreadPool is completely undocumented, so yeah.

@mattip
Copy link
Contributor

mattip commented Nov 1, 2018

why numpy uses gemv for dot

NumPy uses gemv for matrix @ vector, syrk for matrix @ matrix.T, and gemm for the general matrix @ matrix case

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

@mattip got the idea, it is not BLAS _DOT_

Regarding oversubscription problem - would be easy with processes, say all R parallel modules trigger "namespace hook" before loading where you can set all viable _NUM_THREADS=1, which obviously does not work with OMP (and we see MKL instrumenting OMP much better)

@martin-frbg I wonder if this can be tweaked s/max/num/1 to make life better?
#ifdef USE_OPENMP
openmp_nthreads=omp_get_max_threads();

@martin-frbg
Copy link
Collaborator

@brada4 not convinced it would "make life better" - I suspect omp_get_num_threads() would typically return 1 there ?

@bbbbbbbbba
Copy link

bbbbbbbbba commented Nov 1, 2018

OK, I finally got direct access to a computer with more than 2 cores, and managed to reduce the bug that has been bugging me for days into a minimal working example.

import numpy as np
import threading

def do_test(idx):
    for i in range(100 * idx, 100 * idx + 100):
        a = np.ones((1 << 20, 2)) * i
        b = np.eye(2)
        mat_0 = a.dot(b).reshape(-1)
        mat_1 = a.dot(b).reshape(-1)
        if np.any(mat_0 != mat_1):
            indices = np.nonzero(mat_0 != mat_1)[0]
            if len(indices) > 100: print('Error %d %d' % (i, len(indices)))
            else: print('Error %d %s' % (i, [(j, mat_0[j], mat_1[j]) for j in indices]))

for idx in range(2):
    threading.Thread(target=do_test, args=(idx,)).start()

On computers where it doesn't fail (which seems to be any computer with <= 2 cores), it doesn't. On those where it does fail, it fails with high probability.

EDIT: Upon some further experiments, it seems that the bug begins happening at OPENBLAS_NUM_THREADS=4, but only becomes really common when I run it with the full 8 threads (the computer I'm testing this on has 8 cores).

EDIT 2: Change for idx in range(2): to for idx in range(8):, and it fails like crazy beginning at 3 OpenBLAS threads. It seems like 3 OpenBLAS threads is the "hard minimum" for the bug to happen, and having more threads just helps throw off the timing to induce a race condition.

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

omp_get_num_threads will return number of cores at top level, and ones when nested, seems mkl does exactly that
Documentation is very elusive in this regard ( link )

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

@bbbbbbbbba no need to strategically try to race same code more. Only thing I suspect is that not only gemv being affected, and with more processing per multiprocessing (not gemv/level2, but more like level 3 blas) thread it will just be much harder to race to failure, like you will need to square 50 threads to make them crawl to clash...

@seberg
Copy link
Contributor Author

seberg commented Nov 2, 2018

Would be good to fix the direct issue in gemv, first! For the bigger picture (I am probably ignorant and just spending effort on making OpenBLAS more thread-safe is the solution):

  1. Might it be very helpful to for example spend effort on annotating OpenBLAS for the clang analyzer, or try runtime race condition checkers?
  2. My impression was that ­– unless the problem size is small – threading the single problem instead the multiple problems fed in by python threads is likely faster anyway due to cache locality? In that case it might not be totally bad to put a lock in place either inside OpenBLAS or at least NumPy (but it would have to be placed at all interfaces to BLAS), we might simply decide to only allow threaded execution if the BLAS is certainly not multithreaded.

@martin-frbg
Copy link
Collaborator

I have no current experience with the clang analyzer, how much annotating would it need ? Part of the
problem with race conditions is that tools like helgrind(valgrind) get really confused by OpenMP.
And I am all for tackling the (now) known problem with gemv first before going after the unknown gremlins. Right now I am a bit behind the curve - is it gemv or dgemm (as suggested in the original numpy ticket, and where rebuilding OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 might already help) ?

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

it is dgemv_ for particular case (np.dot is bot BLAS _DOT):
#1844 (comment)

@seberg
Copy link
Contributor Author

seberg commented Nov 2, 2018

@martin-frbg after a bit of messing about, I am now pretty sure that I got it right. With OpenBLAS config: NO_AFFINITY HASWELL MAX_THREADS=4 compiled with USE_SIMPLE_THREADED_LEVEL3=1 (no other options) the error still occurs. The specific bug here runs dgemv.

@martin-frbg
Copy link
Collaborator

Perhaps good news is that gemv_thread seems to be the only code that has this static local buffer (as a result of an attempt to improve performance in issue #532 , around version 0.2.15 so not exactly recent but not classic K.Goto code either)

@bbbbbbbbba
Copy link

Right, but that means my code snippet in #1844 (comment) is almost definitely suffering from another bug.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

@bbbbbbbbba you are right, your code calls dgemm, though I could not spot it failing with 2 or 4 cores
Mind to open a new issue? It is certainly different from this one (different function).

Samples: 122K of event 'cycles', Event count (approx.): 60959070419, DSO: libopenblas_pthreads.so
Overhead  Comman  Symbol
22.40%  python  [.] dgemm_beta_NEHALEM
19.90%  python  [.] dgemm_kernel_NEHALEM
19.54%  python  [.] dgemm_oncopy_NEHALEM
2.72%  python  [.] blas_thread_server
1.49%  python  [.] dgemm_nn
0.07%  python  [.] exec_blas_async
0.05%  python  [.] exec_blas_async_wait
0.05%  python  [.] pthread_mutex_lock@plt
0.04%  python  [.] pthread_mutex_unlock@plt
0.03%  python  [.] sched_yield@plt
0.02%  python  [.] dgemm_itcopy_NEHALEM
0.00%  python  [.] cblas_dgemm
0.00%  python  [.] gemm_thread_mn
0.00%  python  [.] exec_blas

@brada4
Copy link
Contributor

brada4 commented Nov 14, 2018

@seberg it is wildly cleaned up after clang analyzer. scan-build has no complaints whatsoever regarding threading code. Annotations are way too much work for nothing.
@martin-frbg it is as simple as running scan-build make and following instructions on the screen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants