OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

seberg · 2018-10-31T20:30:56Z

Since OpenBLAS is one of the most commonly used BLAS implementations with NumPy, there are currently two points being discussed within NumPy. One thing is the spawning of too many threads when using multi processing/python threads (although is probably something we need to find a solution for, but pointers are welcome!).

The main issue is the thread safety of OpenBLAS when multiple threads are used. This may not be problematic for many projects explicitly using OpenBLAS, but NumPy is a bit different:

NumPy does not actually know which BLAS is being used.
Thus, NumPy has no control on the number of threads (the system default is used).

Together this means NumPy will often use multi threaded OpenBLAS. Now if someone uses a numpy function within threads they might get silently incorrect results (numpy/numpy#11046) and this seems true for OpenBLAS versions 0.2.18, 0.2.20, 0.3.0 (probably generally).

The possibility of silently wrong results deeply troubles me. Downstream users of NumPy are unlikely to be aware of such issues, or might not even notice that they are using BLAS at all. The problem is that there seems to be no easy solution.

We could use locking to disable threading, but that probably is incorrect/unnecessary often as well.
I have been talking with @mattip and others at NumPy, and hoped we can start a small discussion here. Maybe there are fixes/workarounds we can implement in NumPy, but maybe we can also provide help/resources to make OpenBLAS thread safe?

brada4 · 2018-10-31T20:57:39Z

Spawning too many threads is not exactly thread safety issue, it is a configuration choice made by user.

The accuracy issue indeed reproduces even with OpenBLAS threading disabled (but not with ATLAS or reference BLAS)

btw numpy perfectly knows which blas is used, and can set threads as needed (at first sight - could be only problem is no support for OMP nesting in OpenBLAS, or that MKL gets some special treatment, or both)

python -c "import numpy; print(numpy.__config__.show())"
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/lib64']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
None

seberg · 2018-10-31T21:04:41Z

No, unless I am very very confused, that is incorrect. The config does not tell you what BLAS is being used. Systems such as debian will link to a library somewhere and which library this is actually can be changed at any time. That configuration gives you an idea about which numpy would be compiling against, but it does not know whether this is actually used.

The only way to find out what is actually being picked up at run time is to use ldd. In fact, we (or rather Anaconda) had fun bugs already where the import of things such as Qt, changed which BLAS was picked up from MKL to Accelerate.

Spawning too many threads is not (EDIT: A thread safety issue). Personally I am more worried about the other issue. I do not see the incorrect results if I use OMP_NUM_THREADS=1 but maybe it is just less likely or depends on versions.

EDIT2: To be clear, this is just as much about whether we may be able to help OpenBLAS here as the other way around. If it is important enough, we might be able to find some resources for these issues.

martin-frbg · 2018-10-31T21:11:19Z

There have been several efforts since 0.2.18 to remove races, but there may be some still lurking in the level 3 BLAS code. One option to check this is to compile OpenBLAS with make USE_SIMPLE_THREADED_LEVEL3=1 - having a reproducer for the issue should help track this down.

brada4 · 2018-10-31T21:11:31Z

Yup, alternatives did not change config output, I agree it is build config.

I see incorrect results with pthreads openblas, single thread via OPENBLAS_NUM_THREADS=1 or not. They go away with OMP_NUM_THREADS=1 which has no impact on pthread openblas, but probably serializes python multiprocessing?

brada4 · 2018-10-31T21:14:02Z

@martin-frbg reproducer is in linked issue. No problem with netlib or atlas (ubuntu14) , Reproduces well with 0.3.3, incl setting archaic coretypes, have not tried develop version yet.

brada4 · 2018-10-31T21:29:32Z

I am wondering what is shared gemm path that may pick or write something random - could be *_copy ?
The Repeater mentions small matrix sizes...

martin-frbg · 2018-10-31T21:31:06Z

Does the problem go away with SIMPLE_THREADED_LEVEL3 ? (though if this keeps occuring even with OpenBLAS limited to a single thread, it could be multiple numpy threads each invoking OpenBLAS in parallel in a scenario similar to #1536)

seberg · 2018-10-31T21:34:06Z

@martin-frbg so this issue should be a specific one and not a general issue, at least with newer OpenBLAS versions? That would be good news! The FAQ read a bit like it might be a deeper/bigger issue.

About OMP/NUM_THREADs, I am not actually sure on the details (or on how the debian/arch Openblas is compiled). On my Arch system OPENBLAS_NUM_THREADS=1 seems also sufficient.

bbbbbbbbba · 2018-10-31T21:40:13Z

Yup, alternatives did not change config output, I agree it is build config.

I see incorrect results with pthreads openblas, single thread via OPENBLAS_NUM_THREADS=1 or not. They go away with OMP_NUM_THREADS=1 which has no impact on pthread openblas, but probably serializes python multiprocessing?

If I'm understanding this correctly, the issue just got even more troublesome. On my computer (1 CPU, 2 cores), OPENBLAS_NUM_THREADS=1 works perfectly as a workaround, and that is why we thought it's a conflict between python threading and OpenBLAS threading. But if the issue reproduces on some system even with OPENBLAS_NUM_THREADS=1, then there may be another bug hiding somewhere in the stack...

martin-frbg · 2018-10-31T21:43:26Z

I should hope that the current situation is a lot better than it was at 0.2.18 but there may still be unfixed bugs in the code, and some of them may show up only on fast systems with many cores. If it is true that the problem persists at OPENBLAS_NUM_THREADS=1 that would be quite worrying indeed. @brada4 are you sure of that ?

brada4 · 2018-10-31T21:51:34Z

Yes, absolutely sure with opensuse pthread:
AMD A10-5750M APU/PILEDRIVER or SANDYBRIDGE or ATOM
Absolutely no problem with ubuntu 14 OMP version:
Atom(TM) CPU N450/ATOM
I will try to narrow down the issue tomorrow.

bbbbbbbbba · 2018-10-31T21:58:41Z

@brada4 I don't know whether this is relevant, but I'm not sure about this part:

OMP_NUM_THREADS=1 which has no impact on pthread openblas

It seems from this part of code that it's the other way round:

pthread openblas looks at OMP_NUM_THREADS (as a fallback if it doesn't see OPENBLAS_NUM_THREADS or GOTO_NUM_THREADS), but
OpenMP openblas disregards OPENBLAS_NUM_THREADS completely.

brada4 · 2018-10-31T21:59:31Z

OPENBLAS_NUM_THREADS=1 alone does not address issue for me.

seberg · 2018-10-31T22:01:58Z

@martin-frbg good, that makes me slightly less nervous :). Would like to still say that if there is somehow we may be able to help, please ask. Not that we have much resources, but if something is important enough maybe we can help out.

@bbbbbbbbba, @brada4 yeah, then my arch at least is just the pthread version, so it does not matter which one I use. I agree that OPENBLAS_NUM_THREADS=1 persistence makes it a lot worse, but frankly it is well above my pain threshold even without it. I would be surprised if half the numpy users even knows this switch exists and that will include those using threading, or using a library that uses threading... (EDIT: OK, I am a bit confused, but it doesn't matter much ;))

seberg · 2018-10-31T22:06:41Z

@martin-frbg Sorry, SIMPLE_THREADED_LEVEL3=1 does not fix it for me, but I guess it might be a compile time option? In that case I would have come back later, or maybe Matti might.

martin-frbg · 2018-10-31T22:12:01Z

Sorry I meant the make option I mentioned earlier. There is no corresponding runtime option for this.
I am not really familiar with numpy - is it likely that there are several python threads each calling into OpenBLAS dgemm in parallel (and thereby perhaps unknowingly sharing a memory buffer) ?

bbbbbbbbba · 2018-10-31T22:21:17Z

Yes, the threads are probably calling into OpenBLAS dgemm in parallel. Python has a global interpreter lock, but some numpy routines specifically releases it.

seberg · 2018-10-31T22:25:28Z

Yes, of course the threads are calling into OpenBLAS in parallel. While the threads might be reading the same data, the write buffers are allocated safely by numpy. NumPy itself is safe in this regard (it holds the python global interpreter lock for all things that are problematic). The problem occurs inside the dgemm call (of course we could have bugs, but then it should not go away with the OpenBLAS options), which of course is called in parallel effectively (because numpy will release the global interpreter lock for it). The threading itself is implemented by someone using numpy, not numpy itself.

mattip · 2018-10-31T22:28:40Z

Are there internal (not ones in the external interfaces) OpenBLAS memory buffers that are shared across calls to OpenBLAS functions?

martin-frbg · 2018-10-31T22:43:46Z

There is a thread buffer pool that was shown to misbehave in some circumstances - the OpenMP side of it was fixed (or worked around) in #1536 (released with 0.3.0) by introducing another compile-time variable.

bbbbbbbbba · 2018-10-31T22:49:35Z

Right, but does that thread buffer pool even exist in non-OpenMP openblas? It was

static void * blas_thread_buffer[MAX_CPU_NUMBER];

(now static void * blas_thread_buffer[MAX_PARALLEL_NUMBER][MAX_CPU_NUMBER]) in blas_server_omp.c, and I don't see anything like that in blas_server.c.

brada4 · 2018-10-31T23:12:58Z

That would help to oversubscribe cpu cores even better.

bbbbbbbbba · 2018-10-31T23:18:47Z

Now that I think of it, I don't know whether my own openblas (which comes with numpy) is built with OPENMP=1. What would be the easiest way to check?

brada4 · 2018-11-01T05:12:30Z

You can change BLAS easily on ubuntu:
https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts

bbbbbbbbba · 2018-11-01T05:43:38Z

After some diving into OpenBLAS code, I suspect that this buffer is the culprit.

brada4 · 2018-11-01T06:54:24Z

threads are calling dgemv_
according to documentation provided with multiprocessing they are PROCESSES, so thread safety issues should not apply....

brada4 · 2018-11-01T06:57:43Z

@bbbbbbbbba it will overflow "under conditions", but it is sized entire pages and strangely never hit any sorts of mprotect()

ogrisel · 2018-11-01T07:24:55Z

according to documentation provided with multiprocessing they are PROCESSES, so thread safety issues should not apply....

No from multiprocessing.pool.ThreadPool is an actual thread pool (single Python interpreter in one Python process with several pthreads) while from multiprocessing.pool.Pool is a process pool with the same API (Python worker processes are forked from the main process by default).

Switching topic: now about controlling the OpenBLAS / OpenMP / MKL thread pool sizes programmatically, we have some prototype code in this PR: https://github.com/tomMoral/loky/pull/135.

It was inspired from https://github.com/IntelPython/smp but cover more libraries and platforms. Feel free to steal that code for numpy if you wish.

bbbbbbbbba · 2018-11-01T07:57:20Z

@brada4 The problem isn't overflowing, it's that as a global buffer, it's being shared by multiple threads in the multiprocessing.pool.ThreadPool.

brada4 · 2018-11-01T11:39:13Z

malloc/free should work in this case, i am a bit confused if it does not damage performance too much (but random numerical effects are to be solved first before they lead to random discoveries)

Then I wonder why numpy uses gemv for dot, if it is that noticeably better then probably OpenBLAS could adapt the idea.

@bbbbbbbbba just that first paragraph of multiprocessing description states something opposite to what is happening here. Probably changing the statement to "processes or threads" should help to sort it out.

@martin-frbg steady failures with pthread versions on anything. OPENBLAS_NUM_THREADS=1 solves the problem. I just lunatically set it to =0 and wondered why it does not do anything.

bbbbbbbbba · 2018-11-01T12:04:48Z

@brada4 Actually multiprocessing.pool.ThreadPool is completely undocumented, so yeah.

mattip · 2018-11-01T12:18:22Z

why numpy uses gemv for dot

NumPy uses gemv for matrix @ vector, syrk for matrix @ matrix.T, and gemm for the general matrix @ matrix case

brada4 · 2018-11-01T15:38:14Z

@mattip got the idea, it is not BLAS _DOT_

Regarding oversubscription problem - would be easy with processes, say all R parallel modules trigger "namespace hook" before loading where you can set all viable _NUM_THREADS=1, which obviously does not work with OMP (and we see MKL instrumenting OMP much better)

@martin-frbg I wonder if this can be tweaked s/max/num/1 to make life better?
#ifdef USE_OPENMP
openmp_nthreads=omp_get_max_threads();

martin-frbg · 2018-11-01T16:48:17Z

@brada4 not convinced it would "make life better" - I suspect omp_get_num_threads() would typically return 1 there ?

bbbbbbbbba · 2018-11-01T18:46:47Z

OK, I finally got direct access to a computer with more than 2 cores, and managed to reduce the bug that has been bugging me for days into a minimal working example.

import numpy as np
import threading

def do_test(idx):
    for i in range(100 * idx, 100 * idx + 100):
        a = np.ones((1 << 20, 2)) * i
        b = np.eye(2)
        mat_0 = a.dot(b).reshape(-1)
        mat_1 = a.dot(b).reshape(-1)
        if np.any(mat_0 != mat_1):
            indices = np.nonzero(mat_0 != mat_1)[0]
            if len(indices) > 100: print('Error %d %d' % (i, len(indices)))
            else: print('Error %d %s' % (i, [(j, mat_0[j], mat_1[j]) for j in indices]))

for idx in range(2):
    threading.Thread(target=do_test, args=(idx,)).start()

On computers where it doesn't fail (which seems to be any computer with <= 2 cores), it doesn't. On those where it does fail, it fails with high probability.

EDIT: Upon some further experiments, it seems that the bug begins happening at OPENBLAS_NUM_THREADS=4, but only becomes really common when I run it with the full 8 threads (the computer I'm testing this on has 8 cores).

EDIT 2: Change for idx in range(2): to for idx in range(8):, and it fails like crazy beginning at 3 OpenBLAS threads. It seems like 3 OpenBLAS threads is the "hard minimum" for the bug to happen, and having more threads just helps throw off the timing to induce a race condition.

brada4 · 2018-11-01T18:54:11Z

omp_get_num_threads will return number of cores at top level, and ones when nested, seems mkl does exactly that
Documentation is very elusive in this regard ( link )

brada4 · 2018-11-01T20:16:00Z

@bbbbbbbbba no need to strategically try to race same code more. Only thing I suspect is that not only gemv being affected, and with more processing per multiprocessing (not gemv/level2, but more like level 3 blas) thread it will just be much harder to race to failure, like you will need to square 50 threads to make them crawl to clash...

seberg · 2018-11-02T11:06:07Z

Would be good to fix the direct issue in gemv, first! For the bigger picture (I am probably ignorant and just spending effort on making OpenBLAS more thread-safe is the solution):

Might it be very helpful to for example spend effort on annotating OpenBLAS for the clang analyzer, or try runtime race condition checkers?
My impression was that – unless the problem size is small – threading the single problem instead the multiple problems fed in by python threads is likely faster anyway due to cache locality? In that case it might not be totally bad to put a lock in place either inside OpenBLAS or at least NumPy (but it would have to be placed at all interfaces to BLAS), we might simply decide to only allow threaded execution if the BLAS is certainly not multithreaded.

martin-frbg · 2018-11-02T12:01:36Z

I have no current experience with the clang analyzer, how much annotating would it need ? Part of the
problem with race conditions is that tools like helgrind(valgrind) get really confused by OpenMP.
And I am all for tackling the (now) known problem with gemv first before going after the unknown gremlins. Right now I am a bit behind the curve - is it gemv or dgemm (as suggested in the original numpy ticket, and where rebuilding OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 might already help) ?

brada4 · 2018-11-02T12:39:05Z

it is dgemv_ for particular case (np.dot is bot BLAS _DOT):
#1844 (comment)

seberg · 2018-11-02T13:56:01Z

@martin-frbg after a bit of messing about, I am now pretty sure that I got it right. With OpenBLAS config: NO_AFFINITY HASWELL MAX_THREADS=4 compiled with USE_SIMPLE_THREADED_LEVEL3=1 (no other options) the error still occurs. The specific bug here runs dgemv.

martin-frbg · 2018-11-02T21:06:39Z

Perhaps good news is that gemv_thread seems to be the only code that has this static local buffer (as a result of an attempt to improve performance in issue #532 , around version 0.2.15 so not exactly recent but not classic K.Goto code either)

bbbbbbbbba · 2018-11-02T21:26:02Z

Right, but that means my code snippet in #1844 (comment) is almost definitely suffering from another bug.

brada4 · 2018-11-02T23:10:52Z

@bbbbbbbbba you are right, your code calls dgemm, though I could not spot it failing with 2 or 4 cores
Mind to open a new issue? It is certainly different from this one (different function).

Samples: 122K of event 'cycles', Event count (approx.): 60959070419, DSO: libopenblas_pthreads.so
Overhead  Comman  Symbol
22.40%  python  [.] dgemm_beta_NEHALEM
19.90%  python  [.] dgemm_kernel_NEHALEM
19.54%  python  [.] dgemm_oncopy_NEHALEM
2.72%  python  [.] blas_thread_server
1.49%  python  [.] dgemm_nn
0.07%  python  [.] exec_blas_async
0.05%  python  [.] exec_blas_async_wait
0.05%  python  [.] pthread_mutex_lock@plt
0.04%  python  [.] pthread_mutex_unlock@plt
0.03%  python  [.] sched_yield@plt
0.02%  python  [.] dgemm_itcopy_NEHALEM
0.00%  python  [.] cblas_dgemm
0.00%  python  [.] gemm_thread_mn
0.00%  python  [.] exec_blas

brada4 · 2018-11-14T18:05:11Z

@seberg it is wildly cleaned up after clang analyzer. scan-build has no complaints whatsoever regarding threading code. Annotations are way too much work for nothing.
@martin-frbg it is as simple as running scan-build make and following instructions on the screen.

seberg mentioned this issue Oct 31, 2018

BUG: np.dot is not thread-safe with OpenBLAS numpy/numpy#11046

Closed

mattip mentioned this issue Nov 1, 2018

Set number of threads after numpy import numpy/numpy#11826

Closed

brada4 mentioned this issue Nov 2, 2018

Segmentation fault using serial OpenBLAS with OpenMP on Windows #1847

Closed

bbbbbbbbba mentioned this issue Nov 3, 2018

Thread safety issue with dgemm and >2 cores #1851

Closed

This was referenced Nov 3, 2018

Revert change from #532 due to unsafe use of static buffer #1852

Closed

AIX 7.1 with GCC/GFortran - Issue needing resolution #1803

Closed

Optimize gemv for small M, large N only if it can be done in a threadsafe manner #1865

Merged

martin-frbg closed this as completed in #1865 Nov 12, 2018

charris mentioned this issue Nov 15, 2018

Non-deterministically corrupted results from np.dot (conditional on large imports) numpy/numpy#12394

Closed

martin-frbg mentioned this issue Dec 2, 2018

Deadlock in threaded dgemv #660

Closed

gebner mentioned this issue Apr 21, 2019

cura: weird non-deterministic corruption of larger meshes NixOS/nixpkgs#59901

Open

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

Comments

seberg commented Oct 31, 2018

brada4 commented Oct 31, 2018

seberg commented Oct 31, 2018 • edited Loading

martin-frbg commented Oct 31, 2018

brada4 commented Oct 31, 2018

brada4 commented Oct 31, 2018

brada4 commented Oct 31, 2018

martin-frbg commented Oct 31, 2018

seberg commented Oct 31, 2018

bbbbbbbbba commented Oct 31, 2018

martin-frbg commented Oct 31, 2018

brada4 commented Oct 31, 2018 • edited Loading

bbbbbbbbba commented Oct 31, 2018

brada4 commented Oct 31, 2018

seberg commented Oct 31, 2018 • edited Loading

seberg commented Oct 31, 2018

martin-frbg commented Oct 31, 2018

bbbbbbbbba commented Oct 31, 2018

seberg commented Oct 31, 2018

mattip commented Oct 31, 2018

martin-frbg commented Oct 31, 2018

bbbbbbbbba commented Oct 31, 2018

brada4 commented Oct 31, 2018

bbbbbbbbba commented Oct 31, 2018

brada4 commented Nov 1, 2018

bbbbbbbbba commented Nov 1, 2018

brada4 commented Nov 1, 2018

brada4 commented Nov 1, 2018

ogrisel commented Nov 1, 2018 • edited Loading

bbbbbbbbba commented Nov 1, 2018

brada4 commented Nov 1, 2018

bbbbbbbbba commented Nov 1, 2018

mattip commented Nov 1, 2018

brada4 commented Nov 1, 2018 • edited Loading

martin-frbg commented Nov 1, 2018

bbbbbbbbba commented Nov 1, 2018 • edited Loading

brada4 commented Nov 1, 2018 • edited Loading

brada4 commented Nov 1, 2018 • edited Loading

seberg commented Nov 2, 2018

martin-frbg commented Nov 2, 2018

brada4 commented Nov 2, 2018

seberg commented Nov 2, 2018

martin-frbg commented Nov 2, 2018

bbbbbbbbba commented Nov 2, 2018

brada4 commented Nov 2, 2018

brada4 commented Nov 14, 2018

seberg commented Oct 31, 2018 •

edited

Loading

brada4 commented Oct 31, 2018 •

edited

Loading

seberg commented Oct 31, 2018 •

edited

Loading

ogrisel commented Nov 1, 2018 •

edited

Loading

brada4 commented Nov 1, 2018 •

edited

Loading

bbbbbbbbba commented Nov 1, 2018 •

edited

Loading

brada4 commented Nov 1, 2018 •

edited

Loading

brada4 commented Nov 1, 2018 •

edited

Loading