-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenBLAS hangs in a trivial DGESV call with USE_THREAD=1 on an 80-CPU box #1497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you say where it loops or hangs ? Does the problem go away when you build with USE_OPENMP=1 or USE_SIMPLE_THREADED_LEVEL3=1 (which would mean it is a thread safety problem that is just more likely to occur with more threads) ? |
First check like with 'top' if it spins one or all or 64 threads. GDB line depends on what we are digging after. |
Lets try to isolate code paths involved:
feel free to sanitize hostnames in typescript file, rename it to whatever.TXT and upload here. |
Hi. This is a shared machine that's usually highly utilized, so top isn't very clear on how many cores MY process is spinning. Based on the backtrace though, I'm guessing it isn't very many at all (all but 2 threads are waiting on a condition). The backtrace is attached: There's no debugging info apparently. Let me know if line numbers would make things much easier. I'm at the latest "develop" branch: 73c5ca7 If you want a core or something, I can get you that. Thanks! |
Thanks. There is only one occurence of pthread_cond_wait in the entire code if I am not mistaken, so You could try with dodomorandi's patches from my PR #1486 (based on his suggestions in #660, basically change a handful of occurences of "volatile" to "_Atomic", in particular one of them in getrf_parallel() that is implicated by your backtrace as well), or just do a |
I cherry-picked the two patches in that PR, and I don't see a difference: the code consistently hangs. |
The weird thing is that with such a small problem size, there should not be a need for threads 2 to 80 anyway (which may be why they appear stuck in pthread_cond_wait, there is simply no work for them), so it is not clear to me what thread 1 should be yielding to. Not sure where the N<=65 barrier would come from either as OpenBLAS is supposed to work with 1024 cores (at least with BIGNUMA=1 set). What is the topology of your server ? If it is a NUMA architecture, the limit for default builds is 16 nodes (see start of driver/others/init.c) |
This machine is a fancy xeon workstation: E5-2698. According to intel this actually has 20 cores with 40 threads each. I'm not entirely sure what that means, but Linux reports this as 80 discrete cores. |
I looked around with a debugger. Exactly two threads are spinning, the rest are waiting in pthread_cond_wait(). The two spinning threads are waiting for stuff too, albeit more proactively: One is spinning here: https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server.c#L760 Here queue->assigned = 15 and tsqq ends up as (blas_queue_t *) 0x1. That looks bogus. Is it bogus? Do I need to ask rr where that's coming from? And the other here: https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L340 Here i = 0, xxx = 0, mypos=15, args->nthreads=16 and job[mypos].working[i][CACHE_LINE_SIZE * xxx] is some pointer to something. |
OK, I'm at the commit mentioned previously: 73c5ca7, with the patches in the PR above cherry-picked. I added prints to all transitions to the variable we're polling in the busyloop in getrf_parallel.c: diff --git a/lapack/getrf/getrf_parallel.c b/lapack/getrf/getrf_parallel.c
index 27faea0c..b67e5a1b 100644
--- a/lapack/getrf/getrf_parallel.c
+++ b/lapack/getrf/getrf_parallel.c
@@ -297,7 +297,11 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
MB;
for (i = 0; i < args -> nthreads; i++)
+ {
job[mypos].working[i][CACHE_LINE_SIZE * bufferside] = (BLASLONG)buffer[bufferside];
+ fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,mypos,i,bufferside,
+ job[mypos].working[i][CACHE_LINE_SIZE * bufferside]);
+ }
}
@@ -306,6 +310,7 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
if (m == 0) {
for (xxx = 0; xxx < DIVIDE_RATE; xxx++) {
job[mypos].working[mypos][CACHE_LINE_SIZE * xxx] = 0;
+ fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,mypos,mypos,xxx,0);
}
}
@@ -339,6 +344,7 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
MB;
if (is + min_i >= m) {
job[current].working[mypos][CACHE_LINE_SIZE * bufferside] = 0;
+ fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,current,mypos,bufferside,0);
}
}
@@ -348,9 +354,18 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
} while (current != mypos);
}
+ long count = 0;
for (i = 0; i < args -> nthreads; i++) {
for (xxx = 0; xxx < DIVIDE_RATE; xxx++) {
- while (job[mypos].working[i][CACHE_LINE_SIZE * xxx] ) {};
+ while (job[mypos].working[i][CACHE_LINE_SIZE * xxx] ) {
+ if(count++ > 1000000000L)
+ {
+ fprintf(stderr, "exiting. mypos/i/xxx: %d/%d/%d=0x%lx\n",
+ mypos, i, xxx,
+ job[mypos].working[i][CACHE_LINE_SIZE * xxx]);
+ exit(1);
+ }
+ };
}
}
@@ -563,6 +578,7 @@ blasint CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa,
for (i = 0; i < num_cpu; i++) {
for (k = 0; k < DIVIDE_RATE; k++) {
job[j].working[i][CACHE_LINE_SIZE * k] = 0;
+ fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,j,i,k,0);
}
}
} And the resulting log looks like this:
I.e. we busyloop, waiting for the value at 15/0/0 to be 0, but that never happens. Immediately prior we set 15/everything/0 to something !=0, and reset ONLY 15/15/0. I don't know what is intended here. Is this enlightening for anybody? |
Not sure yet what is going on there, but could you try with this modification of lapack/getrf/getrf_parallel.c (had to rename it to .txt so that github accepts it) that at least fixes the helgrind warnings ? |
Nope, still hangs with that version of getrf_parallel.c. |
Pity. Hanging in same location probably ? (And did you do a full rebuild, i.e. |
I did a very clean rebuild, and yes, it hangs in the same place. Anything more I can do? |
Not right now - "your" hang seems to happen in a different place from "my" helgrind error, and I cannot reproduce the problem with my small machine even if I throw 80 threads at it. The fix may still look quite similar, just by putting the phtread_lock around the job[].working - but at the moment all my tries seem to introduce hangs rather than fix anything... |
OK. I tried to instrument with rr to get more clarity, but that squashed the bug. This is consistent with what you saw: true parallelism is required to trigger this; simply increasing the number of threads isn't enough. |
According to intel ark your e5-2698 v4 processor has fancy cluster-on-die technology that you may get 2 numa nodes per socket. Can you check with numactl -H? |
|
Perhaps adding |
The build system isn't cooperating. Is there some "correct" way to add the sanitizer flags to the compiler AND the linker? |
if it does not take the CFLAGS, I guess you could just uncomment the COMMON_OPT line in Makefile.rule and add it after the -O2 |
It takes the CFLAGS, but there's a required link library, and no obvious variable that takes a -ltsan |
Does it work when you use |
No. LD_FLAGS isn't something this build system has. LDLIBS is what the right thing is usually called, but we don't use that either. There's an LDFLAGS, but that applies to C and fortran, and the latter barfs. Do you ever build tsan into openblas? If not, then I can figure it out; just not right this second. |
Got it now. Actually you need to add the -fsanitize=thread to the FFLAGS as well, as it uses gfortran to invoke the linker. And make sure that you actually have /usr/lib64/libtsan.so, your distro may supply this in a separate libtsan or libtsan0 package... |
Yep, I figured it out at the same time: fedora was being a silly distro as usual. I can build with this:
But the sample program now crashes in the libtsan initialization. It doesn't look like this crash is related to OpenBLAS at all. I really need to go do something else now; let me look at this again a bit later. |
Oh, never mind. tsan was being unfriendly too. I can now build and run. The tests were failing because of tsan (maybe indicative of the failure in this issue) so I disabled them. My test application now says this: I'm here: 73c5ca7 With these patches applied: |
Thank you. This appears to be without my proposed changes to getrf_parallel.c, was that intentional after seeing that they did not appear to fix anything ? |
Not especially intentional. We just had several versions flying around, and it wasn't clear which changes were important, and which were random experiments. I can try with the changes. Are we talking about these: #1497 (comment) ? |
Yes, that version at least seems to have silenced the warnings from both helgrind and thread-sanitizer (on my system that does not show the full symptoms of your problem) |
My only idea at the moment is to add locks around the few remaining code sections that involve the job array in that file. Just to be sure - you build on the same 80-core machine where execution fails, not on something with fewer cores ? And did you try with USE_OPENMP=1, or is there something about your particular use case that would preclude this ? |
getrf_parallel.txt |
I am building on the same machine where I'm running the tests, yes. And I haven't tried with any configurations other than the one mentioned above. The reason is that this is the configuration used by fedora when they build their packages. Originally I hit this bug when using the packages, and I wanted to reproduce that configuration. If it's useful, I can check OPENMP=1, or anything else you'd like. |
I fail to find your configuration built by fedora |
I don't exactly remember, but it looks close to the 'threaded64' build, except I'm also passing BUILD64=1, and not passing the extra flags that fedora is. I don't remember exactly why I'm adding BUILD64=1. Is that wrong? |
Oh, and the resulting fedora packages were 100% showing this issue; that's why I started chasing it in the first place. |
Try to build without flags |
BUILD64 is not used anywhere in the OpenBLAS build system so should not matter, BINARY64 as you mentioned in your first post should be the default anyway. |
I just tried your latest getrf_parallel, and it fails, as before. Setting USE=OPENMP=1 "fixes" it: the test program successfully exits. |
Maybe all the helgrind and tsan warnings were a red herring, and it is indeed a cpu mask or similar overflowing somewhere in blas_server.c . (The obvious drawback is that it will probably be hard to find and fix this on a lesser machine). At least with OPENMP, "conventional wisdom" is confirmed and you should have a reliable solution for your actual use case. |
One quick thing to try would be to unconditionally set USE_ALLOC_HEAP at the start of getrf_parallel.c (where it now depends on number of cpus exceeding a fixed value from common.h). |
I already have a workaround for my actual use case: OPENBLAS_NUM_THREADS=N for N <= 65. I'm here purely to help you debug. defining USE_ALLOC_HEAP unconditionally doesn't appear to make a difference: test program hangs. |
For exactly tests, could you try which of fedora packages fail (-lopenblas / -lopenblasp / -lopeblaso after installing openblas-devel, also -lblas -llapack for generic versions that should not be moved by differenc CPU arch or count) |
@brada4 my understanding is that he is tracking this down via direct builds from source, so I do not see what the fedora packages have to do with it (and surely their p and o variants are just builds with pthreads and/or openmp). With the lblas -llapack are you suggesting that the test code may fail on his machine even when linked against netlib blas/lapack (that would not even multithread) ?? |
There was statement that fedora package hangs , and same parameters keep adding up and over. Would be nice to know if it is omp vs pthread issue or not.... |
All the tests in this thread were from source builds. We already know that the direct pthread-based implementation is at fault somehow, while the openmp one works. |
I'm trying to reproduce this with a fake (qemu) 80-core system now in the hope that there is some subtle difference between building for that many threads and that many cpus. We'll see if valgrind catches anything useful. Trouble is qemu is fairly slow... |
I just tried to reproduce on an AWS box with 36 cores (that's what I had available). This was instance type: c5.9xlarge. Sadly, this box wasn't enough to trigger the bug. |
The fake box does trigger it, but valgrind remains silent (and helgrind as well, if using my lastest patched version of getrf_parallel.c). So this is probably a logic bug and not a simple overflow. |
Martin Kroeker <[email protected]> writes:
The fake box does trigger it, but valgrind remains silent (and
helgrind as well, if using my lastest patched version of
getrf_parallel.c). So this is probably a logic bug and not a simple
overflow.
Oh, so you're able to reproduce the failure then? That's great progress,
even if the emulation is slow.
|
Hmm I see That would explain behaviour change on 81st thread... |
Actually behaviour "changes" beyond 64 or 65, but stack allocation of the job array could play a role. Another thing to try is to revert wernsaar's changes made in preparation for non-power of two UNROLL. |
Seems to me that blas_quickdivide from https://github.com/xianyi/OpenBLAS/blob/07ed01e97f1cb03c7f3820b00c601cc634fc1b89/lapack/getrf/getrf_parallel.c#L503 |
I just tested this on my box, and indeed the issue is gone. Thank you very much! |
Thank you very much for the feedback, I have merged this change now. Similar unverified use of the value returned by blas_quickdivide seems to occur in driver/others/blas_l1_thread.c and threaded level3 GEMM. |
Hi. I have a machine with 80 cores, and a trivial invocation of DGESV goes into an infinite loop with OpenBLAS. There's something about the core count that's making it break; maybe some internal buffer is being overrun or something. This is on Fedora 27, but I suspect it doesn't really matter. Recipe:
make USE_THREAD=1 USE_OPENMP=0 INTERFACE64=0 BINARY=64
gcc -o /tmp/inv /tmp/inv.c libopenblas*.so(.) -Wl,-rpath=$PWD && /tmp/inv
On this 80-core machine this command goes into an infinite loop, and never exits. Apparently setting
OPENBLAS_NUM_THREADS=N
for N <= 65 makes the issue go away. I tried to find the issue myself, but it wasn't trivial, and hopefully the devs can do this much faster. This failure is very consistent here, so let me know if you need more debugging data. Thanks!The text was updated successfully, but these errors were encountered: