-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Spawning too many threads is not exactly thread safety issue, it is a configuration choice made by user. The accuracy issue indeed reproduces even with OpenBLAS threading disabled (but not with ATLAS or reference BLAS) btw numpy perfectly knows which blas is used, and can set threads as needed (at first sight - could be only problem is no support for OMP nesting in OpenBLAS, or that MKL gets some special treatment, or both)
|
No, unless I am very very confused, that is incorrect. The config does not tell you what BLAS is being used. Systems such as debian will link to a library somewhere and which library this is actually can be changed at any time. That configuration gives you an idea about which numpy would be compiling against, but it does not know whether this is actually used. The only way to find out what is actually being picked up at run time is to use Spawning too many threads is not (EDIT: A thread safety issue). Personally I am more worried about the other issue. I do not see the incorrect results if I use EDIT2: To be clear, this is just as much about whether we may be able to help OpenBLAS here as the other way around. If it is important enough, we might be able to find some resources for these issues. |
There have been several efforts since 0.2.18 to remove races, but there may be some still lurking in the level 3 BLAS code. One option to check this is to compile OpenBLAS with |
Yup, alternatives did not change config output, I agree it is build config. I see incorrect results with pthreads openblas, single thread via OPENBLAS_NUM_THREADS=1 or not. They go away with OMP_NUM_THREADS=1 which has no impact on pthread openblas, but probably serializes python multiprocessing? |
@martin-frbg reproducer is in linked issue. No problem with netlib or atlas (ubuntu14) , Reproduces well with 0.3.3, incl setting archaic coretypes, have not tried develop version yet. |
I am wondering what is shared gemm path that may pick or write something random - could be *_copy ? |
Does the problem go away with SIMPLE_THREADED_LEVEL3 ? (though if this keeps occuring even with OpenBLAS limited to a single thread, it could be multiple numpy threads each invoking OpenBLAS in parallel in a scenario similar to #1536) |
@martin-frbg so this issue should be a specific one and not a general issue, at least with newer OpenBLAS versions? That would be good news! The FAQ read a bit like it might be a deeper/bigger issue. About OMP/NUM_THREADs, I am not actually sure on the details (or on how the debian/arch Openblas is compiled). On my Arch system |
If I'm understanding this correctly, the issue just got even more troublesome. On my computer (1 CPU, 2 cores), |
I should hope that the current situation is a lot better than it was at 0.2.18 but there may still be unfixed bugs in the code, and some of them may show up only on fast systems with many cores. If it is true that the problem persists at OPENBLAS_NUM_THREADS=1 that would be quite worrying indeed. @brada4 are you sure of that ? |
Yes, absolutely sure with opensuse pthread: |
@brada4 I don't know whether this is relevant, but I'm not sure about this part:
It seems from this part of code that it's the other way round:
|
OPENBLAS_NUM_THREADS=1 alone does not address issue for me. |
@martin-frbg good, that makes me slightly less nervous :). Would like to still say that if there is somehow we may be able to help, please ask. Not that we have much resources, but if something is important enough maybe we can help out. @bbbbbbbbba, @brada4 yeah, then my arch at least is just the pthread version, so it does not matter which one I use. I agree that |
@martin-frbg Sorry, |
Sorry I meant the make option I mentioned earlier. There is no corresponding runtime option for this. |
Yes, the threads are probably calling into OpenBLAS dgemm in parallel. Python has a global interpreter lock, but some numpy routines specifically releases it. |
Yes, of course the threads are calling into OpenBLAS in parallel. While the threads might be reading the same data, the write buffers are allocated safely by numpy. NumPy itself is safe in this regard (it holds the python global interpreter lock for all things that are problematic). The problem occurs inside the dgemm call (of course we could have bugs, but then it should not go away with the OpenBLAS options), which of course is called in parallel effectively (because numpy will release the global interpreter lock for it). The threading itself is implemented by someone using numpy, not numpy itself. |
Are there internal (not ones in the external interfaces) OpenBLAS memory buffers that are shared across calls to OpenBLAS functions? |
There is a thread buffer pool that was shown to misbehave in some circumstances - the OpenMP side of it was fixed (or worked around) in #1536 (released with 0.3.0) by introducing another compile-time variable. |
Right, but does that thread buffer pool even exist in non-OpenMP openblas? It was
(now |
That would help to oversubscribe cpu cores even better. |
Now that I think of it, I don't know whether my own openblas (which comes with numpy) is built with |
You can change BLAS easily on ubuntu: |
After some diving into OpenBLAS code, I suspect that this buffer is the culprit. |
threads are calling dgemv_ |
@bbbbbbbbba it will overflow "under conditions", but it is sized entire pages and strangely never hit any sorts of mprotect() |
No Switching topic: now about controlling the OpenBLAS / OpenMP / MKL thread pool sizes programmatically, we have some prototype code in this PR: https://github.com/tomMoral/loky/pull/135. It was inspired from https://github.com/IntelPython/smp but cover more libraries and platforms. Feel free to steal that code for numpy if you wish. |
@brada4 The problem isn't overflowing, it's that as a global buffer, it's being shared by multiple threads in the |
malloc/free should work in this case, i am a bit confused if it does not damage performance too much (but random numerical effects are to be solved first before they lead to random discoveries) Then I wonder why numpy uses gemv for dot, if it is that noticeably better then probably OpenBLAS could adapt the idea. @bbbbbbbbba just that first paragraph of multiprocessing description states something opposite to what is happening here. Probably changing the statement to "processes or threads" should help to sort it out. @martin-frbg steady failures with pthread versions on anything. OPENBLAS_NUM_THREADS=1 solves the problem. I just lunatically set it to =0 and wondered why it does not do anything. |
@brada4 Actually |
NumPy uses |
@mattip got the idea, it is not BLAS _DOT_ Regarding oversubscription problem - would be easy with processes, say all R parallel modules trigger "namespace hook" before loading where you can set all viable _NUM_THREADS=1, which obviously does not work with OMP (and we see MKL instrumenting OMP much better) @martin-frbg I wonder if this can be tweaked s/max/num/1 to make life better? |
@brada4 not convinced it would "make life better" - I suspect omp_get_num_threads() would typically return 1 there ? |
OK, I finally got direct access to a computer with more than 2 cores, and managed to reduce the bug that has been bugging me for days into a minimal working example.
On computers where it doesn't fail (which seems to be any computer with <= 2 cores), it doesn't. On those where it does fail, it fails with high probability. EDIT: Upon some further experiments, it seems that the bug begins happening at EDIT 2: Change |
omp_get_num_threads will return number of cores at top level, and ones when nested, seems mkl does exactly that |
@bbbbbbbbba no need to strategically try to race same code more. Only thing I suspect is that not only gemv being affected, and with more processing per multiprocessing (not gemv/level2, but more like level 3 blas) thread it will just be much harder to race to failure, like you will need to square 50 threads to make them crawl to clash... |
Would be good to fix the direct issue in gemv, first! For the bigger picture (I am probably ignorant and just spending effort on making OpenBLAS more thread-safe is the solution):
|
I have no current experience with the clang analyzer, how much annotating would it need ? Part of the |
it is dgemv_ for particular case (np.dot is bot BLAS _DOT): |
@martin-frbg after a bit of messing about, I am now pretty sure that I got it right. With OpenBLAS config: |
Perhaps good news is that gemv_thread seems to be the only code that has this static local buffer (as a result of an attempt to improve performance in issue #532 , around version 0.2.15 so not exactly recent but not classic K.Goto code either) |
Right, but that means my code snippet in #1844 (comment) is almost definitely suffering from another bug. |
@bbbbbbbbba you are right, your code calls dgemm, though I could not spot it failing with 2 or 4 cores
|
@seberg it is wildly cleaned up after clang analyzer. scan-build has no complaints whatsoever regarding threading code. Annotations are way too much work for nothing. |
Since OpenBLAS is one of the most commonly used BLAS implementations with NumPy, there are currently two points being discussed within NumPy. One thing is the spawning of too many threads when using multi processing/python threads (although is probably something we need to find a solution for, but pointers are welcome!).
The main issue is the thread safety of OpenBLAS when multiple threads are used. This may not be problematic for many projects explicitly using OpenBLAS, but NumPy is a bit different:
Together this means NumPy will often use multi threaded OpenBLAS. Now if someone uses a numpy function within threads they might get silently incorrect results (numpy/numpy#11046) and this seems true for OpenBLAS versions 0.2.18, 0.2.20, 0.3.0 (probably generally).
The possibility of silently wrong results deeply troubles me. Downstream users of NumPy are unlikely to be aware of such issues, or might not even notice that they are using BLAS at all. The problem is that there seems to be no easy solution.
We could use locking to disable threading, but that probably is incorrect/unnecessary often as well.
I have been talking with @mattip and others at NumPy, and hoped we can start a small discussion here. Maybe there are fixes/workarounds we can implement in NumPy, but maybe we can also provide help/resources to make OpenBLAS thread safe?
The text was updated successfully, but these errors were encountered: