-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Segmentation fault using serial OpenBLAS with OpenMP on Windows #1847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does the same code work correctly with other versions of BLAS (or OpenBLAS compiled with different options)? |
Yes, the code works correctly with netlib-lapack/blas. |
Can you get backtrace with e.g. x64dbg? |
Sorry, I am not familiar with backtrace debug. |
See Usage.md, for "historical reasons" NUM_THREADS also sizes an internal buffer. Do you get correct results when you compile OpenBLAS (which version, by the way ?) with USE_OPENMP=1 ? |
@martin-frbg I think USE_OPENMP=1 can not be used simultaneously with USE_THREAD=0, furthermore only serial version of OpenBLAS (0.3.3) is considered here. |
I will try to build threaded OpenBLAS and use openblas_set_num_threads(1) to get serial OpenBLAS. |
Did not see the requirement that OpenBLAS has to be single-threaded. If the error is not reproducible on anything other than Windows it could also be a mingw miscompilation. I will check (later) if valgrind/helgrind on Linux makes any complaints. |
No segfault but incorrect results with OpenBLAS using USE_THREAD=1 USE_OPENMP=1. |
To get backtrace in a debugger (x64dbg is graphical and understands both windows and gcc debug symbols and is free, alternatives would be IDApro, visual studio, ollydbg, each lacking functionality in some way) you run crashing program inside debugger and it catches the crash. Then you have various options (which open new MDI windows) like backtrace, registers, stack content, disassembly of code around (crashed) EIP and probably in other code points mentioned in traces. Just be courageous and creative around these lines. Nobody expects perfect picture at the first try. |
@martin-frbg I have updated test results, including Linux, please see the first post for details. I have read the USAGE.md. I still do not understand why single-threaded OpenBLAS may have problem in multiple threads program.
@brada4 Many thanks, I will try later. |
#1844 approaches same class of buffers but from different side |
@hjbreg could you retry with a build after setting USE_TLS to 0 in Makefile.rule please ? 0.3.3 was meant to revert to the original thread memory allocation code as there were some problems with the new thread-local storage implementation, but unfortunately I had kept the wrong default (unless you build with cmake, that is). Not sure yet if this is in any way related to #1844. |
Tested previsous buiding options with USE_TLS=0, still not working. |
Do not set |
@brada4 You are right, USE_TLS is acutually not distabled in my last build. I found that the only option to disable USE_TLS is to comment out the line in Makefile.rule, as Makefile.system only checks for |
Yes, sorry, I only fixed the USE_TLS=0 gotcha after the 0.3.3 release. I'd say myself that 0.3.4 is overdue, but I had still hoped for some phase where bugs are not popping up like mushrooms. |
|
@hjbreg the problem is that there are about 10 candidate functions and backtrace is really needed to find which (can be more than one) of them is at fault. Say Linux version of backtrace would look like
|
@brada4 Thanks for the guide. I tested Linux case using USE_THREAD=0 NUM_THREADS=1, and OMP_NUM_THREADS=4. gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)
|
This may be related to the global static memory buffer defined in I think the error message "too many memory regions" can demonstrate that this static variable is accessed from multiple threads. As NUM_BUFFERS is defined to But if the static memory variable is not thread safety, I can not explain why the case of memory.c line 2460-2473
|
That is about non-TLS version .It uses static buffer configured at build time. It turns out it is unnecessarily used for single-threaded calls. |
I suspect OpenBLAS will "just" work in this context (single-threaded OpenBLAS called by multiple threads in an uplevel program) as long as each thread finds an empty slot in the memory array (i.e. NUM_THREADS is big enough to accomodate them) - but on the whole this looks to be a serious design flaw in the original GotoBLAS. With luck, the miscalculations you are seeing on Windows are caused by the separate (and much younger) bug in xGEMV that is discussed in #1844 and hopefully resolvable by #1852. |
@martin-frbg I have tested on 64-bit Windows with Intel Core i7-4790, libopenblas_haswell-r0.3.3 dgemv works correctly under multiple threads with or without #1852 .
|
Did initial dgels_ code start to work with the patch from PR? |
@hjbreg so it looks to you as if it is a Sandybridge problem (or more specifically the Sandybridge-specific microkernels as compiled by mingw-4.9.3) ? The dgemv issue seems real (and reproducible), but may indeed be unrelated to your case. |
@brada4 dgels still does not work with the patch |
I have tested functions called by dgels, and found the bug is caused by the precision of Test is performed on 64-bit Windows, Intel Core i7-4790, gcc 8.2.0 (MSYS2 MinGW 64-bit), OpenBLAS 0.3.3 (USE_THREAD=0) Test output (1st column is openblas dnrm2, 2st is naive code)
Test code
|
Please go to makefile.rule, the place where -frecursive parameter is disabled, and enable it. |
@brada4 I do not understand why this option is related to this issue. I tested dnrm2 with OpenBLAS build with NO_LAPACK=1 |
I also disabled optimized dnrm2 by adding |
it is a blas function. my bad. ARM(v5 32bit) is C-only. |
So "wrong" in this context is "only" about the trailing digits of the SSE result (and which somehow differs between Windows and Linux builds on the same hardware - I count 15 significant digits in either case if I am not mistaken) ? |
@hjbreg it is exactly one youngest bit difference in the result which is insignificant for all purposes. You can re-calculate with some symbolic math package, then with reference BLAS and you see that whatever you use , you lose on average log2(number of float ops) youngest bits precision using any FPU as compared to full symbolic evaluation. |
@hjbreg there was old issue that Windows mingw does not copy FPCSR between threads and that was helped out with CONSISTENT_FPCSR=1 , that could be one explanation why threaded version acts slightly differently. |
I have to say the real case is much more complex where dgels and dgelsy are used intensively for variable selection. I am sorry that the above test does not cover the full real issue. |
Finally, I think I have found the real issue. I build OpenBLAS with BLAS only (NO_LAPACK=1), and link my real program to LAPACK (NETLIB) and BLAS (OpenBLAS), surprisingly, this issue disappears. Then I guess this issue may be related to Fortran compiling option only, so I rebuid whole OpenBLAS (LAPACK included) with Although I do not known why Also please see https://gcc.gnu.org/onlinedocs/gfortran/OpenMP.html
|
I think it can even enabled for all gfortran as thread-safety is intended for single-thread version. |
@brada4 The initial crash is fixed by specifying NUM_THRADS=16 (or any other value greater than actual number of threads), which is acutually described in Usage.md. |
I hope these two address initial concerns. There are no thread-safety tests in current test suite either, thank you for reporting. |
The two patches work for me, so I close this issue. |
Add gfortran -frecursive option from upstream and #1847
Updated test results (OpenBLAS 0.3.3)
Windows 64-bit
gcc version 4.9.3 20150626 (Fedora MinGW 4.9.3-1.el7)
USE_THREAD=0 NUM_THREADS=1
segmentation fault
USE_THREAD=0 NUM_THREADS=16
incorrect result
USE_THREAD=1 NUM_THREADS=2 USE_OPENMP=1 NUM_PARALEL=16
incorrect result
USE_THREAD=1 NUM_THREADS=2 openblas_set_num_threads(1)
incorrect result
Linux 64-bit
gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)
USE_THREAD=0 NUM_THREADS=1
too many memory regions
USE_THREAD=0 NUM_THREADS=16
OK
Segmentation fault if OMP_NUM_THREADS > 4, but linking to netlib-lapack is ok, so I think this may be OpenBLAS side problem. I also tested it on Linux, no segmentation fault.
GCC: x86_64-w64-mingw32-gcc 4.9.3
Single threaded OpenBLAS is built
USE_THREAD=0 DYNAMIC_ARCH=1 DYNAMIC_OLDER=0 NO_CBLAS=1 NO_LAPACKE=1 NO_SHARED=1
Test code: segfault-win.cpp
The text was updated successfully, but these errors were encountered: