-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Segfault when building haskelll-opencv with openblas >= 0.3.3 #1923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
No spontaneous idea - 0.3.2 to 0.3.3 included a (partial) revert of the experimental TLS memory allocator that may have made OpenBLAS a bit more likely to run out of thread pointers again, but (a) you would have experienced that problem in 0.3.0 and earlier versions already and (b) from the nixpkgs ticket it appears you already tried 0.3.4 which uses a safer default for the number of threads. |
Could you try building with 0.3.4 before reverting to old release? |
Luckily the time period between 0.3.2 and 0.3.3 was just one month, but apart from reverting to the pre-0.3.1 thread memory allocation logic not much happened besides addition of cpu-specific code for IBM Z and for AVX512-capable Intel processors (Skylake X and recent Xeons). |
@martin-frbg I will try to get a backtrack using DEBUG=1. Note that the segfault occurs during compilation of Haskell code; not when running code during a test. @nh2 you tried building with 0.3.4 right? You said the tests hang but did the segfault disappear? |
Yes, it did disappear. I just posted it on NixOS/nixpkgs#52439 (comment) By the way, it may not be an OpenBLAS-only problem, because we have one report for another package where the segfault happens and OpenBLAS is not involved. But it may be that the TLS stuff (or something else) that OpenBLAS did for these releases is the problem in general, and that the other package uses something similar. |
@nh2 I see that in the latest nixpkgs
So this issue can be closed. @martin-frbg and @brada4 thanks for your help and sorry for the noise! |
@basvandijk I'm not super convinced though: We should probably still figure it out. If it is TLS and OpenBLAS wants to re-enable it in the future, we probably hit this immediately again, and the other issue I mentioned won't go away by itself either. Though I agree it is not an OpenBLAS bug in the latest release as per current knowledge, we may be able to find the underlying cause given that we cound narrow the problem down to be triggered small amount of code in OpenBLAS. |
Update: Here's a Quoting the important bits (this is for
|
this could be related to the remaining issue from #1720 (missing pthread_key_delete on thread shutdown, testing the proposed solution is already on the to-do list for the 0.3.5 milestone). Though actually I would be more interested in backtraces from 0.3.4 - given the surprise bugs encountered with the new TLS code it was supposed to be switched off already in 0.3.3, and I expect it to remain a non-default option for at least the next version despite its demonstrated advantages. |
You mean with the TLS feature explicitly enabled? |
Here's an improved backtrace after building the same as previous
This one shows Can anybody tell me how I can add source info in this situation with gdb? I get
and |
Thanks. This will probably help to get the TLS code in shape eventually. I am still convinced that it would be a great improvement over the old memory management code from GotoBLAS, but there were just too many unexpected interactions and corner cases in the mostly uncommented OpenBLAS code to make it viable in the few months since its appearance. (Including it early has also taught me how many programs have come to depend either directly or indirectly on OpenBLAS nowadays, and how quickly a new release is picked up by distributors) |
@martin-frbg How do I enable the ? |
you are right, the DEBUG=1 does not get passed through the individual Makefiles. Suggest you |
I've enabled it now with
So it seems that even though we got |
Building
haskell-opencv
onnixpkgs
withopenblas
>= 0.3.3 results in a segfault when compiling one of the included executables. The build succeeds withopenblas-0.3.2
.This is reported in more detail in: NixOS/nixpkgs#52439.
This has to be caused by something in v0.3.2..v0.3.3. Any idea what it could be?
The text was updated successfully, but these errors were encountered: