-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Segfault with large NUM_THREADS #2839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In this case, it segfaults too without LD_PRELOAD. It was a matter of simplifying the example. |
Even less of an idea then, would need to see with gdb where it blows up. This with 0.3.10 or |
Both. We detected this with the current release, but in the example above I checked out current develop branch. |
I almost forgot... We collected this:
|
Hmm. dsyrk_thread_UT again (issue #2821). Not sure yet what that is, as nothing changed in syrk recently. |
In this process of changing the system-wide default to openblas-openmp, I had to rebuild every single BLAS/LAPACK consumer in Fedora, and I noticed a couple of random segfaults in tests also in scipy, and another in clblast. I didn't dig further because they didn't trigger consistently and I had a ton of work to do, but in retrospect, they may be related to this and #2821. This segfault with octave is consistent though. |
Reproduced with the docker setup, built OpenBLAS with DEBUG=1 but cannot get gdb to print a meaningful backtrace (?? for anything except libjvm.so) |
Mmmh... the backtrace above was collected without docker. I'm unable too to get the same inside docker. |
And unable outside docker too. I get |
bisecting to see if/when/why this ever used to work |
No luck going back as far as 0.2.18 - gcc10 is hitting several long-resolved bugs, but after working around them the symptoms stay the same and the backtrace stays unusable. Trying to use valgrind instead of gdb only results in an immediate segfault on startup. I am a bit suspicious of the role of the Java VM lib - it is the only thing that is visible in the traces, and there have been various collisions with Java stack and memory size limits in the past. |
Note that the segfault also happens with openblas-threads, not only with OpenMP (and there are no issues with blis-openmp or blis-threads). Also, the segfault triggers in Fedora 31 (Java 8, gcc 9), Fedora 32 (Java 8, gcc 10) and 33/rawhide (Java 11, gcc 10). The fact that we've seen random segfaults with clblast, which doesn't involve Java, makes me think that maybe there is an issue that Java, for some reason, makes it blow up consistently. |
Mystery solved. It turns out we are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1572811. TL;DR: JVM uses SIGSEGV for stack overflow detection (in other words, JVM is giving us a hint that this is a stack overflow), and it is masking whatever happens next under gdb. So, just hit "continue" and you'll get a nice clean backtrace. :) |
No chance to break out of the Java-induced SIGSEGV, gdb just gets another SIGSEGV in the same location after "continue" - and so ad infinitum (or at least for significantly more than the hypothetical 128 threads that should not even get instantiated on a less well endowed machine). |
Seems the docker reproducer segfaults on my system with or without libopenblas preloaded, and never gets far enough to even enter level3_syrk_threaded.c under gdb with "handle 11 nostop noprint nopass" |
Try this:
|
And if you add |
Thanks, that's better. Unfortunately it still does not explain what happens - arguments go from totally sane in interface/syrk.c to unreadable garbage in level3_syrk_threaded.c as far as gdb is concerned, and valgrind is effectively neutralised by the java segfault. Retrying the bisect in the hope that it uncovers something this time. |
Just managed to catch this from a NUM_THREADS=88 build now (which does not make much sense either, the m,n,k are 2,2,10 in the caller, i.e. dsyrk_UT) . Tried building with gcc's asan and ubsan but they did not find anything (except a known problem in ctest's way of collating the openblas_utest modules).
|
Makes sense if another thread is overwriting them. The question is how this happens and why with this NUM_THREADS and not with lower values. |
It would be easier to understand if there were actually that many threads running (thinking low-probability race condition ) but as the number of threads is capped at the hardware capability the NUM_THREADS should only size the GEMM buffer here. |
A race condition should trigger random failures. The consistency of the segfault suggests that threads are being assigned somehow a wrong position in the buffer maybe? I don't know the internals of this buffer, so just speculating here. |
The buffer "only" contains the results of individual submatrix computations, any conflict there should not lead to overwriting of function arguments on the stack (I think). |
I found that there's a |
NUM_THREADS=85 gave no segfault but I'll try playing with that threshold just in case. Maybe this is more related to gcc10 optimizing some already fragile code and the bigger buffer from a high NUM_THREADS just changes memory layout to make the consequences more interesting. |
Note that my example above is replicable too if you switch from |
Same crash after a tenfold increase of BLAS3_MEM_ALLOC_THRESHOLD. thread sanitizer found some unfriendly things to say about the sanity of level3_thread from a gemm perspective but did not complain about the code path taken by syrk. (Unfortunately it refuses to run in the java/gdb context, so this was from running the BLAS tests) |
Actually decreasing the threshold (e.g. to 60) seems to make the crash go away, so it could be that it was the job array itself that is/was trashing the stack. Curiouser and curiouser... |
BTW, I incorrectly stated that |
Too early to celebrate, this could be a Heisenbug. I can also make it work by (only) increasing NUM_THREADS even further (208)... |
Actually, |
So this makes sense, right? |
Behold:
|
Ah, ok... this is where Java messes things up for us (again - there are examples in closed issues where the small default java stack caused crashes). I had already wondered if/how/where to set Xss but did not think of an environment variable. Default stack on most(?) Linux distros is 8192k |
I think so. And default stack on most(?) Java installations is 1024k AFAIK. I suppose that |
Might make sense to adjust behaviour according to a getrlimit call at runtime instead of relying on a fixed threshold, but I guess the whole point is to try to avoid the system call overhead. Alternatively, Octave could adjust their _JAVA_OPTIONS ? |
That's possible, yes, because octave integrates Java. But I'd bet #2821 is the same issue, and scipy knows nothing about Java. So whenever Java is involved, this could happen. And the same can be said whenever a user changes the stack size. |
The trick is that #2821 only crashes when he loads libhdfs.so, which appears to be java-based... I have just copied our current understanding of the problem there. |
Yes, that's my point: octave can ensure that the proper stack size is set, but scipy doesn't use Java at all. But then, if someone uses scipy and another Java-based library in the same script, boom. Hadoop knows nothing about OpenBLAS requirements for the stack either. So in the end, it's the user's responsibility. |
Could have sworn I had put it in the FAQ long ago (where few would read it anyway I guess). Can't Java just go away and die out ? |
:D Cannot agree more. But here we are, and many distributed technologies, big data tools and other scientific stuff are Java-based. ¯_(ツ)_/¯ |
Does Java modify the stack size of non-java spawned threads (the ones openblas creates)? |
Yes, it does. It doesn't matter which code path spawns those threads: as soon as the JVM lives in the main process, new threads inherit those parameters. |
try |
I already tried, and it doesn't help. |
I do not quite like the idea of a runtime check (portability and all), perhaps reduce the default THRESHOLD to a value that is safe with the java default, and at the same time make it configurable so that anybody who is certain to not use java and to want every little gain in performance can restore the old behaviour (or even go beyond that with a suitable |
With an upstream-developer hat on, I fully understand. With the Fedora hat on (pun intended), OpenBLAS already detects the number of CPUs available to set a proper number of threads, so I don't see how this should be different. :) |
Point taken, and at least Windows gets special treatment already and getrlimit() should be sufficiently portable across anything unixoid. |
Hmm. Don't really see a clean way to go from a (compile-time) decision between allocating on either heap or stack to a run-time choice on startup (that does not lose the advantages of stack allocation and does not clutter up the stack with effectively unused allocations either). Of course what could be done easily is a runtime warning that the current stack is too small - there is commented-out (and most probably ineffective) code from xianyi's #221 in memory.c to try and raise a "soft" stack limit to the "hard" maximum so he appears to have been there before) |
Slightly off topic, but why does Java VM/RE feel the need to mess with the stack size to begin with? Also if Java can barge in and modify the stack size without any regard to other libraries, maybe OpenBLAS could also do the same thing, and increase the stack size to its liking, instead of conforming to the unreasonably small stack limit set by Java? Although this is somewhat hostile towards Java and I have no idea if this would cause mayhem in Java though. |
That is a good point. Increasing the stack to meet Linux default would be the easiest approach here to avoid performance degradation under certain configurations. But in the mid-term, I think that the best approach would be to move towards a heap-based memory pool: the best of both worlds. AFAIK, both BLIS and MKL do use memory pools. |
Back in the beginnings java was one of early mass-market framework with zillions of threads. How do you justify "incresing stack" for 200-something threads java creates in embedded or realtime context where it is quite popular too. |
Hmm. The only clean solution seems to be to reduce the default threshold to match the Java stacksize (= failsafe for distribution packagers), and make it easily configurable through a build parameter so that anybody certain to never call it from Java can restore the old, slightly more efficient behaviour. |
In Fedora, we set
NUM_THREADS=128
for the openmp and threaded versions (see spec file for reference; cc @susilehtola). Recently, we switched to openblas-openmp as the system-wide default BLAS/LAPACK implementation. Then, we found out that a test in the octave-statistics package (canoncorr.m
) is segfaulting (octave was previously using openblas-serial), and we have managed to narrow down the issue to this point so far. Here's a reproducible example with the current master branch:but
Any idea what could be happening here?
The text was updated successfully, but these errors were encountered: