Skip to content

Segfault with large NUM_THREADS #2839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Enchufa2 opened this issue Sep 15, 2020 · 54 comments · Fixed by #2879
Closed

Segfault with large NUM_THREADS #2839

Enchufa2 opened this issue Sep 15, 2020 · 54 comments · Fixed by #2879

Comments

@Enchufa2
Copy link

Enchufa2 commented Sep 15, 2020

In Fedora, we set NUM_THREADS=128 for the openmp and threaded versions (see spec file for reference; cc @susilehtola). Recently, we switched to openblas-openmp as the system-wide default BLAS/LAPACK implementation. Then, we found out that a test in the octave-statistics package (canoncorr.m) is segfaulting (octave was previously using openblas-serial), and we have managed to narrow down the issue to this point so far. Here's a reproducible example with the current master branch:

$ docker run --rm -it fedora:rawhide
$ dnf install -y octave-statistics make git perl-devel
$ CMD='octave -H -q --no-window-system --no-site-file --eval pkg("load","statistics");test("/usr/share/octave/packages/statistics-1.4.1/canoncorr.m");'
$ git clone https://github.com/xianyi/OpenBLAS && cd OpenBLAS
$ make USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=128
$ LD_PRELOAD=$PWD/libopenblas.so.0 $CMD
Segmentation fault (core dumped)

but

$ make clean
$ make USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=64
$ LD_PRELOAD=$PWD/libopenblas.so.0 $CMD
PASSES 7 out of 7 tests

Any idea what could be happening here?

@martin-frbg
Copy link
Collaborator

The LD_PRELOAD reminds me of #1936 (and #2030), stacksize requirements of OpenBLAS "somehow" interfering with later
thread creation in the program.

@Enchufa2
Copy link
Author

In this case, it segfaults too without LD_PRELOAD. It was a matter of simplifying the example.

@martin-frbg
Copy link
Collaborator

Even less of an idea then, would need to see with gdb where it blows up. This with 0.3.10 or develop ?

@Enchufa2
Copy link
Author

Both. We detected this with the current release, but in the example above I checked out current develop branch.

@Enchufa2
Copy link
Author

I almost forgot... We collected this:

Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
508     int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
(gdb) bt
#0  0x00007ffff0f1eb19 in dsyrk_thread_UT (args=0x7fffffff8db0, range_m=0x0, range_n=0x0, sa=0x7fffb4d31000, sb=0x7fffb4e31000, mypos=0) at level3_syrk_threaded.c:508
#1  0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
     ldC=0x7fffffff8ed8) at syrk.c:370
#2  0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
#3  0x00007ffff75336ee in oct_binop_trans_mul (a1=..., a2=...) at libinterp/octave-value/ov-re-mat.cc:145
#4  0x00007ffff781ab44 in do_binary_op (ti=..., op=octave_value::op_herm_mul, v1=..., v2=...) at libinterp/octave-value/ov.cc:2407


(gdb) list
503     #endif
504
505       return 0;
506     }
507
508     int CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa, FLOAT *sb, BLASLONG mypos){
509
510       blas_arg_t newarg;
511
512     #ifndef USE_ALLOC_HEAP
(gdb) up
#1  0x00007ffff0e2152f in dsyrk_ (UPLO=<optimized out>, TRANS=<optimized out>, N=<optimized out>, K=<optimized out>, alpha=<optimized out>, a=<optimized out>, ldA=0x7fffffff8ef0, beta=0x7fffffff8f20, c=0x555555f1faa0,
     ldC=0x7fffffff8ed8) at syrk.c:370
370         (syrk[4 | (uplo << 1) | trans ])(&args, NULL, NULL, sa, sb, 0);
(gdb) up
#2  0x00007ffff67ac436 in _Z5xgemmRK6MatrixS1_15blas_trans_typeS2_ (a=..., b=..., transa=blas_trans, transb=blas_no_trans) at liboctave/array/Array.h:582
582       const T * data (void) const { return slice_data; }
(gdb) print a
$1 = (const class Matrix &) @0x7fffffff9020: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
         slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}
(gdb) print b
$2 = (const class Matrix &) @0x7fffffff8ff0: {<NDArray> = {<MArray<double>> = {<Array<double>> = {_vptr.Array = 0x7ffff7f95750 <vtable for Matrix+16>, dimensions = {rep = 0x555555f04180}, rep = 0x555555f45dc0,
         slice_data = 0x555555f07200, slice_len = 20}, <No data fields>}, <No data fields>}, <No data fields>}

@martin-frbg
Copy link
Collaborator

Hmm. dsyrk_thread_UT again (issue #2821). Not sure yet what that is, as nothing changed in syrk recently.

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 16, 2020

In this process of changing the system-wide default to openblas-openmp, I had to rebuild every single BLAS/LAPACK consumer in Fedora, and I noticed a couple of random segfaults in tests also in scipy, and another in clblast. I didn't dig further because they didn't trigger consistently and I had a ton of work to do, but in retrospect, they may be related to this and #2821. This segfault with octave is consistent though.

@martin-frbg
Copy link
Collaborator

Reproduced with the docker setup, built OpenBLAS with DEBUG=1 but cannot get gdb to print a meaningful backtrace (?? for anything except libjvm.so)

@Enchufa2
Copy link
Author

Mmmh... the backtrace above was collected without docker. I'm unable too to get the same inside docker.

@Enchufa2
Copy link
Author

And unable outside docker too. I get Backtrace stopped: previous frame inner to this frame (corrupt stack?), and useless garbage. CC'ing @opoplawski, who provided the backtrace shown in my previous comment. We could use some help. Is there any octave magic we should invoke?

@martin-frbg
Copy link
Collaborator

bisecting to see if/when/why this ever used to work

@martin-frbg
Copy link
Collaborator

No luck going back as far as 0.2.18 - gcc10 is hitting several long-resolved bugs, but after working around them the symptoms stay the same and the backtrace stays unusable. Trying to use valgrind instead of gdb only results in an immediate segfault on startup. I am a bit suspicious of the role of the Java VM lib - it is the only thing that is visible in the traces, and there have been various collisions with Java stack and memory size limits in the past.
The corrupted backtraces that I am getting all claim to start from "VM_Version::get_processor_features() () from /usr/lib/jvm/jre/lib/server/libjvm.so", so maybe there is an incompatibility between Java and Docker, or Java and OpenMP already

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 17, 2020

Note that the segfault also happens with openblas-threads, not only with OpenMP (and there are no issues with blis-openmp or blis-threads). Also, the segfault triggers in Fedora 31 (Java 8, gcc 9), Fedora 32 (Java 8, gcc 10) and 33/rawhide (Java 11, gcc 10). The fact that we've seen random segfaults with clblast, which doesn't involve Java, makes me think that maybe there is an issue that Java, for some reason, makes it blow up consistently.

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 17, 2020

Mystery solved. It turns out we are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1572811. TL;DR: JVM uses SIGSEGV for stack overflow detection (in other words, JVM is giving us a hint that this is a stack overflow), and it is masking whatever happens next under gdb. So, just hit "continue" and you'll get a nice clean backtrace. :)

@martin-frbg
Copy link
Collaborator

No chance to break out of the Java-induced SIGSEGV, gdb just gets another SIGSEGV in the same location after "continue" - and so ad infinitum (or at least for significantly more than the hypothetical 128 threads that should not even get instantiated on a less well endowed machine).

@martin-frbg
Copy link
Collaborator

Seems the docker reproducer segfaults on my system with or without libopenblas preloaded, and never gets far enough to even enter level3_syrk_threaded.c under gdb with "handle 11 nostop noprint nopass"

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 17, 2020

Try this:

$ docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined fedora:32
$ dnf update -y && dnf install -y octave-statistics gdb openblas-*
$ CMD='octave -H -q --no-window-system --no-site-file --eval pkg("load","statistics");test("/usr/share/octave/packages/statistics-1.4.1/canoncorr.m");'
$ gdb --args env LD_PRELOAD=/usr/lib64/libopenblaso.so.0 $CMD
(gdb) r
...
(gdb) c
Continuing.
[New Thread 0x7fffbe122700 (LWP 1339)]
...

Thread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
(gdb) bt
#0  0x00007ffff5ddb38d in dsyrk_thread_UT () from /usr/lib64/libopenblaso.so.0
#1  0x00007ffff5cdc79f in dsyrk_ () from /usr/lib64/libopenblaso.so.0
#2  0x00007ffff41a677e in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) ()
   from /usr/lib64/octave/5.2.0/liboctave.so.7
...

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 17, 2020

And if you add dnf debuginfo-install openblas-openmp at the beginning, the backtrace contains more detail, obviously.

@martin-frbg
Copy link
Collaborator

Thanks, that's better. Unfortunately it still does not explain what happens - arguments go from totally sane in interface/syrk.c to unreadable garbage in level3_syrk_threaded.c as far as gdb is concerned, and valgrind is effectively neutralised by the java segfault. Retrying the bisect in the hope that it uncovers something this time.

@martin-frbg
Copy link
Collaborator

Just managed to catch this from a NUM_THREADS=88 build now (which does not make much sense either, the m,n,k are 2,2,10 in the caller, i.e. dsyrk_UT) . Tried building with gcc's asan and ubsan but they did not find anything (except a known problem in ctest's way of collating the openblas_utest modules).

hread 1 "octave-cli-5.2." received signal SIGSEGV, Segmentation fault.
0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
60                 FLOAT *a, FLOAT *b, FLOAT *c, BLASLONG ldc, BLASLONG offset){
(gdb) bt
#0  0x00007fffebeba154 in dsyrk_kernel_U (m=0, n=0, k=0, alpha_r=0, a=0x0, b=0x0, c=0x0, ldc=2, offset=0) at syrk_kernel.c:60
#1  0x00007fffebea8c75 in dsyrk_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, dummy=0) at level3_syrk.c:231
#2  0x00007fffebf03b04 in dsyrk_thread_UT (args=0x7fffffff9910, range_m=0x0, range_n=0x0, sa=0x7ffe5d222000, sb=0x7ffe5d322000, mypos=0) at level3_syrk_threaded.c:532
#3  0x00007fffeb89f2ce in dsyrk_ (UPLO=0x7fffe4424fed "U", TRANS=0x7fffffff9a6c "T\177", N=0x7fffffff9a58, K=0x7fffffff9a54, alpha=0x7fffffff9ad0, a=0x60e000155600, 
    ldA=0x7fffffff9a70, beta=0x7fffffff9aa0, c=0x603000755dd0, ldC=0x7fffffff9a58) at syrk.c:370

@Enchufa2
Copy link
Author

Makes sense if another thread is overwriting them. The question is how this happens and why with this NUM_THREADS and not with lower values.

@martin-frbg
Copy link
Collaborator

It would be easier to understand if there were actually that many threads running (thinking low-probability race condition ) but as the number of threads is capped at the hardware capability the NUM_THREADS should only size the GEMM buffer here.

@Enchufa2
Copy link
Author

A race condition should trigger random failures. The consistency of the segfault suggests that threads are being assigned somehow a wrong position in the buffer maybe? I don't know the internals of this buffer, so just speculating here.

@martin-frbg
Copy link
Collaborator

The buffer "only" contains the results of individual submatrix computations, any conflict there should not lead to overwriting of function arguments on the stack (I think).

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 18, 2020

I found that there's a BLAS3_MEM_ALLOC_THRESHOLD of 160 here. Then USE_ALLOC_HEAP is used if MAX_CPU_NUMBER > BLAS3_MEM_ALLOC_THRESHOLD here. This happens (because MAX_CPU_NUMBER = NUM_THREADS * 2) for NUM_THREADS = 128 and also your case, NUM_THREADS = 88, but not for NUM_THREADS = 64 (which doesn't cause a segfault). Maybe this rings a bell?

@martin-frbg
Copy link
Collaborator

NUM_THREADS=85 gave no segfault but I'll try playing with that threshold just in case. Maybe this is more related to gcc10 optimizing some already fragile code and the bigger buffer from a high NUM_THREADS just changes memory layout to make the consequences more interesting.

@Enchufa2
Copy link
Author

Note that my example above is replicable too if you switch from fedora:32 to fedora:31, which ships gcc9.

@martin-frbg
Copy link
Collaborator

Same crash after a tenfold increase of BLAS3_MEM_ALLOC_THRESHOLD. thread sanitizer found some unfriendly things to say about the sanity of level3_thread from a gemm perspective but did not complain about the code path taken by syrk. (Unfortunately it refuses to run in the java/gdb context, so this was from running the BLAS tests)

@martin-frbg
Copy link
Collaborator

Actually decreasing the threshold (e.g. to 60) seems to make the crash go away, so it could be that it was the job array itself that is/was trashing the stack. Curiouser and curiouser...

@Enchufa2
Copy link
Author

BTW, I incorrectly stated that MAX_CPU_NUMBER = NUM_THREADS * 2, but no, it's MAX_CPU_NUMBER = NUM_THREADS and USAGE.md says that NUM_BUFFERS = MAX_CPU_NUMBER * 2. But that's not true either.

@martin-frbg
Copy link
Collaborator

Too early to celebrate, this could be a Heisenbug. I can also make it work by (only) increasing NUM_THREADS even further (208)...

@Enchufa2
Copy link
Author

Actually, NUM_THREADS > 160 triggers heap allocation. Sorry for the confusion above.

@Enchufa2
Copy link
Author

So this makes sense, right? MAX_CPU_NUMBER = NUM_THREADS = 128 seems to overflow the stack. If you decrease the threshold, heap allocation is used instead; if you increase NUM_THREADS even further, heap allocation is used again. If you decrease NUM_THREADS, stack allocation is used, but there is no overflow.

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 18, 2020

Behold:

$ _JAVA_OPTIONS="-Xss2048k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss2048k
Segmentation fault (core dumped)
$ _JAVA_OPTIONS="-Xss4096k" LD_PRELOAD=/lib64/libopenblaso.so.0 $CMD
Picked up _JAVA_OPTIONS: -Xss4096k
PASSES 7 out of 7 tests

@martin-frbg
Copy link
Collaborator

Ah, ok... this is where Java messes things up for us (again - there are examples in closed issues where the small default java stack caused crashes). I had already wondered if/how/where to set Xss but did not think of an environment variable. Default stack on most(?) Linux distros is 8192k

@Enchufa2
Copy link
Author

Default stack on most(?) Linux distros is 8192k

I think so. And default stack on most(?) Java installations is 1024k AFAIK. I suppose that BLAS3_MEM_ALLOC_THRESHOLD was calculated for the default stack on Linux, and Java is messing with that. However, I'd say that a proper fix would be to have a threshold based on the actual stack size at runtime.

@martin-frbg
Copy link
Collaborator

Might make sense to adjust behaviour according to a getrlimit call at runtime instead of relying on a fixed threshold, but I guess the whole point is to try to avoid the system call overhead. Alternatively, Octave could adjust their _JAVA_OPTIONS ?

@Enchufa2
Copy link
Author

That's possible, yes, because octave integrates Java. But I'd bet #2821 is the same issue, and scipy knows nothing about Java. So whenever Java is involved, this could happen. And the same can be said whenever a user changes the stack size.

@martin-frbg
Copy link
Collaborator

The trick is that #2821 only crashes when he loads libhdfs.so, which appears to be java-based... I have just copied our current understanding of the problem there.

@Enchufa2
Copy link
Author

Yes, that's my point: octave can ensure that the proper stack size is set, but scipy doesn't use Java at all. But then, if someone uses scipy and another Java-based library in the same script, boom. Hadoop knows nothing about OpenBLAS requirements for the stack either. So in the end, it's the user's responsibility.

@martin-frbg
Copy link
Collaborator

Could have sworn I had put it in the FAQ long ago (where few would read it anyway I guess). Can't Java just go away and die out ?

@Enchufa2
Copy link
Author

Can't Java just go away and die out?

:D Cannot agree more. But here we are, and many distributed technologies, big data tools and other scientific stuff are Java-based. ¯_(ツ)_/¯

@colinfang
Copy link

Does Java modify the stack size of non-java spawned threads (the ones openblas creates)?

@Enchufa2
Copy link
Author

Yes, it does. It doesn't matter which code path spawns those threads: as soon as the JVM lives in the main process, new threads inherit those parameters.

@brada4
Copy link
Contributor

brada4 commented Sep 18, 2020

try MAX_STACK_ALLOC=0, it is something around stack "one size fits all" code, may not play nice with stack reduced by somebody else.

@Enchufa2
Copy link
Author

I already tried, and it doesn't help.

@martin-frbg
Copy link
Collaborator

I do not quite like the idea of a runtime check (portability and all), perhaps reduce the default THRESHOLD to a value that is safe with the java default, and at the same time make it configurable so that anybody who is certain to not use java and to want every little gain in performance can restore the old behaviour (or even go beyond that with a suitable ulimit -s) ? Back in the day everybody used to build the library for their specific needs but now with binary distributions and indirect dependencies it may be better to err on the side of caution.

@Enchufa2
Copy link
Author

With an upstream-developer hat on, I fully understand. With the Fedora hat on (pun intended), OpenBLAS already detects the number of CPUs available to set a proper number of threads, so I don't see how this should be different. :)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 19, 2020

Point taken, and at least Windows gets special treatment already and getrlimit() should be sufficiently portable across anything unixoid.
BTW relevant previous commit is 32d2ca3 from seven years ago, before that it used to be individual NUM_CORES-based limits in each affected source file (and incidentally the "Windows" value of the threshold is specifically for a 1MB stack).

@martin-frbg
Copy link
Collaborator

Hmm. Don't really see a clean way to go from a (compile-time) decision between allocating on either heap or stack to a run-time choice on startup (that does not lose the advantages of stack allocation and does not clutter up the stack with effectively unused allocations either). Of course what could be done easily is a runtime warning that the current stack is too small - there is commented-out (and most probably ineffective) code from xianyi's #221 in memory.c to try and raise a "soft" stack limit to the "hard" maximum so he appears to have been there before)

@Diazonium
Copy link
Contributor

Slightly off topic, but why does Java VM/RE feel the need to mess with the stack size to begin with? Also if Java can barge in and modify the stack size without any regard to other libraries, maybe OpenBLAS could also do the same thing, and increase the stack size to its liking, instead of conforming to the unreasonably small stack limit set by Java? Although this is somewhat hostile towards Java and I have no idea if this would cause mayhem in Java though.

@Enchufa2
Copy link
Author

Enchufa2 commented Sep 21, 2020

That is a good point. Increasing the stack to meet Linux default would be the easiest approach here to avoid performance degradation under certain configurations. But in the mid-term, I think that the best approach would be to move towards a heap-based memory pool: the best of both worlds. AFAIK, both BLIS and MKL do use memory pools.

@brada4
Copy link
Contributor

brada4 commented Sep 21, 2020

Back in the beginnings java was one of early mass-market framework with zillions of threads. How do you justify "incresing stack" for 200-something threads java creates in embedded or realtime context where it is quite popular too.
Stack is not primary storage, it is abstraction of call chain inside primary storage, and java's default of 32k ... 512k permits quite a bit of recursion.

@martin-frbg
Copy link
Collaborator

Hmm. The only clean solution seems to be to reduce the default threshold to match the Java stacksize (= failsafe for distribution packagers), and make it easily configurable through a build parameter so that anybody certain to never call it from Java can restore the old, slightly more efficient behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants