Skip to content

OpenBLAS hangs in a trivial DGESV call with USE_THREAD=1 on an 80-CPU box #1497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dkogan opened this issue Mar 23, 2018 · 60 comments
Closed

Comments

@dkogan
Copy link

dkogan commented Mar 23, 2018

Hi. I have a machine with 80 cores, and a trivial invocation of DGESV goes into an infinite loop with OpenBLAS. There's something about the core count that's making it break; maybe some internal buffer is being overrun or something. This is on Fedora 27, but I suspect it doesn't really matter. Recipe:

  1. Check out OpenBLAS from git. I see the failure in the master, develop and release-0.2.21 branches.
  2. Build thusly: make USE_THREAD=1 USE_OPENMP=0 INTERFACE64=0 BINARY=64
  3. Save this into /tmp/inv.c: (github is refusing to accept this as an upload)
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int dgesv_(int *n, int *nrhs, double *a,
           int *lda, int *ipiv, double *b, int *ldb, int *info);

int main(int argc, char* argv[])
{
    int N = 25;

    double* m1 = calloc(N,N*sizeof(double));
    double* m2 = calloc(N,N*sizeof(double));
    assert(m1);
    assert(m2);
    for(int i=0; i<N; i++)
        m1[i*(N+1)] = m2[i*(N+1)] = 1.0;

    int ipiv[N];
    int info;

    dgesv_( &N, &N, m1, &N, ipiv, m2, &N, &info);
    printf("info: %d\n", info);
    return 0;
}
  1. Build and run this source against the openblas we just built. I do this in zsh: gcc -o /tmp/inv /tmp/inv.c libopenblas*.so(.) -Wl,-rpath=$PWD && /tmp/inv

On this 80-core machine this command goes into an infinite loop, and never exits. Apparently setting OPENBLAS_NUM_THREADS=N for N <= 65 makes the issue go away. I tried to find the issue myself, but it wasn't trivial, and hopefully the devs can do this much faster. This failure is very consistent here, so let me know if you need more debugging data. Thanks!

@martin-frbg
Copy link
Collaborator

Can you say where it loops or hangs ? Does the problem go away when you build with USE_OPENMP=1 or USE_SIMPLE_THREADED_LEVEL3=1 (which would mean it is a thread safety problem that is just more likely to occur with more threads) ?

@brada4
Copy link
Contributor

brada4 commented Mar 24, 2018

First check like with 'top' if it spins one or all or 64 threads. GDB line depends on what we are digging after.

@brada4
Copy link
Contributor

brada4 commented Mar 24, 2018

Lets try to isolate code paths involved:
determine affected PID (top. ps etc)

$ script
$ gdb 
gdb> attach PID
gdb> thread all apply backtrace
<interesting stuff here>
gdb> detach
gdb> quir
$ exit
file is typescript

feel free to sanitize hostnames in typescript file, rename it to whatever.TXT and upload here.

@dkogan
Copy link
Author

dkogan commented Mar 24, 2018

Hi. This is a shared machine that's usually highly utilized, so top isn't very clear on how many cores MY process is spinning. Based on the backtrace though, I'm guessing it isn't very many at all (all but 2 threads are waiting on a condition).

The backtrace is attached:

OpenBLAS.hang.backtrace.txt

There's no debugging info apparently. Let me know if line numbers would make things much easier.

I'm at the latest "develop" branch: 73c5ca7

If you want a core or something, I can get you that.

Thanks!

@martin-frbg
Copy link
Collaborator

Thanks. There is only one occurence of pthread_cond_wait in the entire code if I am not mistaken, so
line numbers are probably not needed (though perhaps a build with -fsanitize-threads would provide
additional information).

You could try with dodomorandi's patches from my PR #1486 (based on his suggestions in #660, basically change a handful of occurences of "volatile" to "_Atomic", in particular one of them in getrf_parallel() that is implicated by your backtrace as well), or just do a git clone from his fork (mentioned in #660). Best I can muster at the moment is a lowly 6-core i8700k, which does not suffice to reproduce the hang (though valgrind/helgrind are not too happy either, at least with an unpatched develop).

@dkogan
Copy link
Author

dkogan commented Mar 24, 2018

I cherry-picked the two patches in that PR, and I don't see a difference: the code consistently hangs.

@martin-frbg
Copy link
Collaborator

The weird thing is that with such a small problem size, there should not be a need for threads 2 to 80 anyway (which may be why they appear stuck in pthread_cond_wait, there is simply no work for them), so it is not clear to me what thread 1 should be yielding to. Not sure where the N<=65 barrier would come from either as OpenBLAS is supposed to work with 1024 cores (at least with BIGNUMA=1 set). What is the topology of your server ? If it is a NUMA architecture, the limit for default builds is 16 nodes (see start of driver/others/init.c)

@dkogan
Copy link
Author

dkogan commented Mar 24, 2018

This machine is a fancy xeon workstation: E5-2698. According to intel this actually has 20 cores with 40 threads each. I'm not entirely sure what that means, but Linux reports this as 80 discrete cores.

@dkogan
Copy link
Author

dkogan commented Mar 25, 2018

I looked around with a debugger. Exactly two threads are spinning, the rest are waiting in pthread_cond_wait(). The two spinning threads are waiting for stuff too, albeit more proactively:

One is spinning here: https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server.c#L760

Here queue->assigned = 15 and tsqq ends up as (blas_queue_t *) 0x1. That looks bogus. Is it bogus? Do I need to ask rr where that's coming from?

And the other here: https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L340

Here i = 0, xxx = 0, mypos=15, args->nthreads=16 and job[mypos].working[i][CACHE_LINE_SIZE * xxx] is some pointer to something.

@dkogan
Copy link
Author

dkogan commented Mar 25, 2018

OK, I'm at the commit mentioned previously: 73c5ca7, with the patches in the PR above cherry-picked.

I added prints to all transitions to the variable we're polling in the busyloop in getrf_parallel.c:

diff --git a/lapack/getrf/getrf_parallel.c b/lapack/getrf/getrf_parallel.c
index 27faea0c..b67e5a1b 100644
--- a/lapack/getrf/getrf_parallel.c
+++ b/lapack/getrf/getrf_parallel.c
@@ -297,7 +297,11 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
 
     MB;
     for (i = 0; i < args -> nthreads; i++)
+    {
       job[mypos].working[i][CACHE_LINE_SIZE * bufferside] = (BLASLONG)buffer[bufferside];
+      fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,mypos,i,bufferside,
+              job[mypos].working[i][CACHE_LINE_SIZE * bufferside]);
+    }
 
   }
 
@@ -306,6 +310,7 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
   if (m == 0) {
     for (xxx = 0; xxx < DIVIDE_RATE; xxx++) {
       job[mypos].working[mypos][CACHE_LINE_SIZE * xxx] = 0;
+      fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,mypos,mypos,xxx,0);
     }
   }
 
@@ -339,6 +344,7 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
          MB;
          if (is + min_i >= m) {
            job[current].working[mypos][CACHE_LINE_SIZE * bufferside] = 0;
+            fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,current,mypos,bufferside,0);
          }
        }
 
@@ -348,9 +354,18 @@ static int inner_advanced_thread(blas_arg_t *args, BLASLONG *range_m, BLASLONG *
       } while (current != mypos);
   }
 
+  long count = 0;
   for (i = 0; i < args -> nthreads; i++) {
     for (xxx = 0; xxx < DIVIDE_RATE; xxx++) {
-      while (job[mypos].working[i][CACHE_LINE_SIZE * xxx] ) {};
+      while (job[mypos].working[i][CACHE_LINE_SIZE * xxx] ) {
+          if(count++ > 1000000000L)
+          {
+              fprintf(stderr, "exiting. mypos/i/xxx: %d/%d/%d=0x%lx\n",
+                      mypos, i, xxx,
+                      job[mypos].working[i][CACHE_LINE_SIZE * xxx]);
+              exit(1);
+          }
+      };
     }
   }
 
@@ -563,6 +578,7 @@ blasint CNAME(blas_arg_t *args, BLASLONG *range_m, BLASLONG *range_n, FLOAT *sa,
        for (i = 0; i < num_cpu; i++) {
          for (k = 0; k < DIVIDE_RATE; k++) {
            job[j].working[i][CACHE_LINE_SIZE * k] = 0;
+            fprintf(stderr, "line %d: %d/%d/%d=0x%lx\n",__LINE__,j,i,k,0);
          }
        }
       }

And the resulting log looks like this:

line 581: 0/0/0=0x0
line 581: 0/0/1=0x0
line 581: 0/1/0=0x0
line 581: 0/1/1=0x0
line 581: 0/2/0=0x0
line 581: 0/2/1=0x0
line 581: 0/3/0=0x0
line 581: 0/3/1=0x0
line 581: 0/4/0=0x0
line 581: 0/4/1=0x0
line 581: 0/5/0=0x0
line 581: 0/5/1=0x0
line 581: 0/6/0=0x0
line 581: 0/6/1=0x0
line 581: 0/7/0=0x0
line 581: 0/7/1=0x0
line 581: 0/8/0=0x0
line 581: 0/8/1=0x0
line 581: 0/9/0=0x0
line 581: 0/9/1=0x0
line 581: 0/10/0=0x0
line 581: 0/10/1=0x0
line 581: 0/11/0=0x0
line 581: 0/11/1=0x0
line 581: 0/12/0=0x0
line 581: 0/12/1=0x0
line 581: 0/13/0=0x0
line 581: 0/13/1=0x0
line 581: 0/14/0=0x0
line 581: 0/14/1=0x0
line 581: 0/15/0=0x0
line 581: 0/15/1=0x0
line 581: 1/0/0=0x0
line 581: 1/0/1=0x0
line 581: 1/1/0=0x0
line 581: 1/1/1=0x0
line 581: 1/2/0=0x0
line 581: 1/2/1=0x0
line 581: 1/3/0=0x0
line 581: 1/3/1=0x0
line 581: 1/4/0=0x0
line 581: 1/4/1=0x0
line 581: 1/5/0=0x0
line 581: 1/5/1=0x0
line 581: 1/6/0=0x0
line 581: 1/6/1=0x0
line 581: 1/7/0=0x0
line 581: 1/7/1=0x0
line 581: 1/8/0=0x0
line 581: 1/8/1=0x0
line 581: 1/9/0=0x0
line 581: 1/9/1=0x0
line 581: 1/10/0=0x0
line 581: 1/10/1=0x0
line 581: 1/11/0=0x0
line 581: 1/11/1=0x0
line 581: 1/12/0=0x0
line 581: 1/12/1=0x0
line 581: 1/13/0=0x0
line 581: 1/13/1=0x0
line 581: 1/14/0=0x0
line 581: 1/14/1=0x0
line 581: 1/15/0=0x0
line 581: 1/15/1=0x0
line 581: 2/0/0=0x0
line 581: 2/0/1=0x0
line 581: 2/1/0=0x0
line 581: 2/1/1=0x0
line 581: 2/2/0=0x0
line 581: 2/2/1=0x0
line 581: 2/3/0=0x0
line 581: 2/3/1=0x0
line 581: 2/4/0=0x0
line 581: 2/4/1=0x0
line 581: 2/5/0=0x0
line 581: 2/5/1=0x0
line 581: 2/6/0=0x0
line 581: 2/6/1=0x0
line 581: 2/7/0=0x0
line 581: 2/7/1=0x0
line 581: 2/8/0=0x0
line 581: 2/8/1=0x0
line 581: 2/9/0=0x0
line 581: 2/9/1=0x0
line 581: 2/10/0=0x0
line 581: 2/10/1=0x0
line 581: 2/11/0=0x0
line 581: 2/11/1=0x0
line 581: 2/12/0=0x0
line 581: 2/12/1=0x0
line 581: 2/13/0=0x0
line 581: 2/13/1=0x0
line 581: 2/14/0=0x0
line 581: 2/14/1=0x0
line 581: 2/15/0=0x0
line 581: 2/15/1=0x0
line 581: 3/0/0=0x0
line 581: 3/0/1=0x0
line 581: 3/1/0=0x0
line 581: 3/1/1=0x0
line 581: 3/2/0=0x0
line 581: 3/2/1=0x0
line 581: 3/3/0=0x0
line 581: 3/3/1=0x0
line 581: 3/4/0=0x0
line 581: 3/4/1=0x0
line 581: 3/5/0=0x0
line 581: 3/5/1=0x0
line 581: 3/6/0=0x0
line 581: 3/6/1=0x0
line 581: 3/7/0=0x0
line 581: 3/7/1=0x0
line 581: 3/8/0=0x0
line 581: 3/8/1=0x0
line 581: 3/9/0=0x0
line 581: 3/9/1=0x0
line 581: 3/10/0=0x0
line 581: 3/10/1=0x0
line 581: 3/11/0=0x0
line 581: 3/11/1=0x0
line 581: 3/12/0=0x0
line 581: 3/12/1=0x0
line 581: 3/13/0=0x0
line 581: 3/13/1=0x0
line 581: 3/14/0=0x0
line 581: 3/14/1=0x0
line 581: 3/15/0=0x0
line 581: 3/15/1=0x0
line 581: 4/0/0=0x0
line 581: 4/0/1=0x0
line 581: 4/1/0=0x0
line 581: 4/1/1=0x0
line 581: 4/2/0=0x0
line 581: 4/2/1=0x0
line 581: 4/3/0=0x0
line 581: 4/3/1=0x0
line 581: 4/4/0=0x0
line 581: 4/4/1=0x0
line 581: 4/5/0=0x0
line 581: 4/5/1=0x0
line 581: 4/6/0=0x0
line 581: 4/6/1=0x0
line 581: 4/7/0=0x0
line 581: 4/7/1=0x0
line 581: 4/8/0=0x0
line 581: 4/8/1=0x0
line 581: 4/9/0=0x0
line 581: 4/9/1=0x0
line 581: 4/10/0=0x0
line 581: 4/10/1=0x0
line 581: 4/11/0=0x0
line 581: 4/11/1=0x0
line 581: 4/12/0=0x0
line 581: 4/12/1=0x0
line 581: 4/13/0=0x0
line 581: 4/13/1=0x0
line 581: 4/14/0=0x0
line 581: 4/14/1=0x0
line 581: 4/15/0=0x0
line 581: 4/15/1=0x0
line 581: 5/0/0=0x0
line 581: 5/0/1=0x0
line 581: 5/1/0=0x0
line 581: 5/1/1=0x0
line 581: 5/2/0=0x0
line 581: 5/2/1=0x0
line 581: 5/3/0=0x0
line 581: 5/3/1=0x0
line 581: 5/4/0=0x0
line 581: 5/4/1=0x0
line 581: 5/5/0=0x0
line 581: 5/5/1=0x0
line 581: 5/6/0=0x0
line 581: 5/6/1=0x0
line 581: 5/7/0=0x0
line 581: 5/7/1=0x0
line 581: 5/8/0=0x0
line 581: 5/8/1=0x0
line 581: 5/9/0=0x0
line 581: 5/9/1=0x0
line 581: 5/10/0=0x0
line 581: 5/10/1=0x0
line 581: 5/11/0=0x0
line 581: 5/11/1=0x0
line 581: 5/12/0=0x0
line 581: 5/12/1=0x0
line 581: 5/13/0=0x0
line 581: 5/13/1=0x0
line 581: 5/14/0=0x0
line 581: 5/14/1=0x0
line 581: 5/15/0=0x0
line 581: 5/15/1=0x0
line 581: 6/0/0=0x0
line 581: 6/0/1=0x0
line 581: 6/1/0=0x0
line 581: 6/1/1=0x0
line 581: 6/2/0=0x0
line 581: 6/2/1=0x0
line 581: 6/3/0=0x0
line 581: 6/3/1=0x0
line 581: 6/4/0=0x0
line 581: 6/4/1=0x0
line 581: 6/5/0=0x0
line 581: 6/5/1=0x0
line 581: 6/6/0=0x0
line 581: 6/6/1=0x0
line 581: 6/7/0=0x0
line 581: 6/7/1=0x0
line 581: 6/8/0=0x0
line 581: 6/8/1=0x0
line 581: 6/9/0=0x0
line 581: 6/9/1=0x0
line 581: 6/10/0=0x0
line 581: 6/10/1=0x0
line 581: 6/11/0=0x0
line 581: 6/11/1=0x0
line 581: 6/12/0=0x0
line 581: 6/12/1=0x0
line 581: 6/13/0=0x0
line 581: 6/13/1=0x0
line 581: 6/14/0=0x0
line 581: 6/14/1=0x0
line 581: 6/15/0=0x0
line 581: 6/15/1=0x0
line 581: 7/0/0=0x0
line 581: 7/0/1=0x0
line 581: 7/1/0=0x0
line 581: 7/1/1=0x0
line 581: 7/2/0=0x0
line 581: 7/2/1=0x0
line 581: 7/3/0=0x0
line 581: 7/3/1=0x0
line 581: 7/4/0=0x0
line 581: 7/4/1=0x0
line 581: 7/5/0=0x0
line 581: 7/5/1=0x0
line 581: 7/6/0=0x0
line 581: 7/6/1=0x0
line 581: 7/7/0=0x0
line 581: 7/7/1=0x0
line 581: 7/8/0=0x0
line 581: 7/8/1=0x0
line 581: 7/9/0=0x0
line 581: 7/9/1=0x0
line 581: 7/10/0=0x0
line 581: 7/10/1=0x0
line 581: 7/11/0=0x0
line 581: 7/11/1=0x0
line 581: 7/12/0=0x0
line 581: 7/12/1=0x0
line 581: 7/13/0=0x0
line 581: 7/13/1=0x0
line 581: 7/14/0=0x0
line 581: 7/14/1=0x0
line 581: 7/15/0=0x0
line 581: 7/15/1=0x0
line 581: 8/0/0=0x0
line 581: 8/0/1=0x0
line 581: 8/1/0=0x0
line 581: 8/1/1=0x0
line 581: 8/2/0=0x0
line 581: 8/2/1=0x0
line 581: 8/3/0=0x0
line 581: 8/3/1=0x0
line 581: 8/4/0=0x0
line 581: 8/4/1=0x0
line 581: 8/5/0=0x0
line 581: 8/5/1=0x0
line 581: 8/6/0=0x0
line 581: 8/6/1=0x0
line 581: 8/7/0=0x0
line 581: 8/7/1=0x0
line 581: 8/8/0=0x0
line 581: 8/8/1=0x0
line 581: 8/9/0=0x0
line 581: 8/9/1=0x0
line 581: 8/10/0=0x0
line 581: 8/10/1=0x0
line 581: 8/11/0=0x0
line 581: 8/11/1=0x0
line 581: 8/12/0=0x0
line 581: 8/12/1=0x0
line 581: 8/13/0=0x0
line 581: 8/13/1=0x0
line 581: 8/14/0=0x0
line 581: 8/14/1=0x0
line 581: 8/15/0=0x0
line 581: 8/15/1=0x0
line 581: 9/0/0=0x0
line 581: 9/0/1=0x0
line 581: 9/1/0=0x0
line 581: 9/1/1=0x0
line 581: 9/2/0=0x0
line 581: 9/2/1=0x0
line 581: 9/3/0=0x0
line 581: 9/3/1=0x0
line 581: 9/4/0=0x0
line 581: 9/4/1=0x0
line 581: 9/5/0=0x0
line 581: 9/5/1=0x0
line 581: 9/6/0=0x0
line 581: 9/6/1=0x0
line 581: 9/7/0=0x0
line 581: 9/7/1=0x0
line 581: 9/8/0=0x0
line 581: 9/8/1=0x0
line 581: 9/9/0=0x0
line 581: 9/9/1=0x0
line 581: 9/10/0=0x0
line 581: 9/10/1=0x0
line 581: 9/11/0=0x0
line 581: 9/11/1=0x0
line 581: 9/12/0=0x0
line 581: 9/12/1=0x0
line 581: 9/13/0=0x0
line 581: 9/13/1=0x0
line 581: 9/14/0=0x0
line 581: 9/14/1=0x0
line 581: 9/15/0=0x0
line 581: 9/15/1=0x0
line 581: 10/0/0=0x0
line 581: 10/0/1=0x0
line 581: 10/1/0=0x0
line 581: 10/1/1=0x0
line 581: 10/2/0=0x0
line 581: 10/2/1=0x0
line 581: 10/3/0=0x0
line 581: 10/3/1=0x0
line 581: 10/4/0=0x0
line 581: 10/4/1=0x0
line 581: 10/5/0=0x0
line 581: 10/5/1=0x0
line 581: 10/6/0=0x0
line 581: 10/6/1=0x0
line 581: 10/7/0=0x0
line 581: 10/7/1=0x0
line 581: 10/8/0=0x0
line 581: 10/8/1=0x0
line 581: 10/9/0=0x0
line 581: 10/9/1=0x0
line 581: 10/10/0=0x0
line 581: 10/10/1=0x0
line 581: 10/11/0=0x0
line 581: 10/11/1=0x0
line 581: 10/12/0=0x0
line 581: 10/12/1=0x0
line 581: 10/13/0=0x0
line 581: 10/13/1=0x0
line 581: 10/14/0=0x0
line 581: 10/14/1=0x0
line 581: 10/15/0=0x0
line 581: 10/15/1=0x0
line 581: 11/0/0=0x0
line 581: 11/0/1=0x0
line 581: 11/1/0=0x0
line 581: 11/1/1=0x0
line 581: 11/2/0=0x0
line 581: 11/2/1=0x0
line 581: 11/3/0=0x0
line 581: 11/3/1=0x0
line 581: 11/4/0=0x0
line 581: 11/4/1=0x0
line 581: 11/5/0=0x0
line 581: 11/5/1=0x0
line 581: 11/6/0=0x0
line 581: 11/6/1=0x0
line 581: 11/7/0=0x0
line 581: 11/7/1=0x0
line 581: 11/8/0=0x0
line 581: 11/8/1=0x0
line 581: 11/9/0=0x0
line 581: 11/9/1=0x0
line 581: 11/10/0=0x0
line 581: 11/10/1=0x0
line 581: 11/11/0=0x0
line 581: 11/11/1=0x0
line 581: 11/12/0=0x0
line 581: 11/12/1=0x0
line 581: 11/13/0=0x0
line 581: 11/13/1=0x0
line 581: 11/14/0=0x0
line 581: 11/14/1=0x0
line 581: 11/15/0=0x0
line 581: 11/15/1=0x0
line 581: 12/0/0=0x0
line 581: 12/0/1=0x0
line 581: 12/1/0=0x0
line 581: 12/1/1=0x0
line 581: 12/2/0=0x0
line 581: 12/2/1=0x0
line 581: 12/3/0=0x0
line 581: 12/3/1=0x0
line 581: 12/4/0=0x0
line 581: 12/4/1=0x0
line 581: 12/5/0=0x0
line 581: 12/5/1=0x0
line 581: 12/6/0=0x0
line 581: 12/6/1=0x0
line 581: 12/7/0=0x0
line 581: 12/7/1=0x0
line 581: 12/8/0=0x0
line 581: 12/8/1=0x0
line 581: 12/9/0=0x0
line 581: 12/9/1=0x0
line 581: 12/10/0=0x0
line 581: 12/10/1=0x0
line 581: 12/11/0=0x0
line 581: 12/11/1=0x0
line 581: 12/12/0=0x0
line 581: 12/12/1=0x0
line 581: 12/13/0=0x0
line 581: 12/13/1=0x0
line 581: 12/14/0=0x0
line 581: 12/14/1=0x0
line 581: 12/15/0=0x0
line 581: 12/15/1=0x0
line 581: 13/0/0=0x0
line 581: 13/0/1=0x0
line 581: 13/1/0=0x0
line 581: 13/1/1=0x0
line 581: 13/2/0=0x0
line 581: 13/2/1=0x0
line 581: 13/3/0=0x0
line 581: 13/3/1=0x0
line 581: 13/4/0=0x0
line 581: 13/4/1=0x0
line 581: 13/5/0=0x0
line 581: 13/5/1=0x0
line 581: 13/6/0=0x0
line 581: 13/6/1=0x0
line 581: 13/7/0=0x0
line 581: 13/7/1=0x0
line 581: 13/8/0=0x0
line 581: 13/8/1=0x0
line 581: 13/9/0=0x0
line 581: 13/9/1=0x0
line 581: 13/10/0=0x0
line 581: 13/10/1=0x0
line 581: 13/11/0=0x0
line 581: 13/11/1=0x0
line 581: 13/12/0=0x0
line 581: 13/12/1=0x0
line 581: 13/13/0=0x0
line 581: 13/13/1=0x0
line 581: 13/14/0=0x0
line 581: 13/14/1=0x0
line 581: 13/15/0=0x0
line 581: 13/15/1=0x0
line 581: 14/0/0=0x0
line 581: 14/0/1=0x0
line 581: 14/1/0=0x0
line 581: 14/1/1=0x0
line 581: 14/2/0=0x0
line 581: 14/2/1=0x0
line 581: 14/3/0=0x0
line 581: 14/3/1=0x0
line 581: 14/4/0=0x0
line 581: 14/4/1=0x0
line 581: 14/5/0=0x0
line 581: 14/5/1=0x0
line 581: 14/6/0=0x0
line 581: 14/6/1=0x0
line 581: 14/7/0=0x0
line 581: 14/7/1=0x0
line 581: 14/8/0=0x0
line 581: 14/8/1=0x0
line 581: 14/9/0=0x0
line 581: 14/9/1=0x0
line 581: 14/10/0=0x0
line 581: 14/10/1=0x0
line 581: 14/11/0=0x0
line 581: 14/11/1=0x0
line 581: 14/12/0=0x0
line 581: 14/12/1=0x0
line 581: 14/13/0=0x0
line 581: 14/13/1=0x0
line 581: 14/14/0=0x0
line 581: 14/14/1=0x0
line 581: 14/15/0=0x0
line 581: 14/15/1=0x0
line 581: 15/0/0=0x0
line 581: 15/0/1=0x0
line 581: 15/1/0=0x0
line 581: 15/1/1=0x0
line 581: 15/2/0=0x0
line 581: 15/2/1=0x0
line 581: 15/3/0=0x0
line 581: 15/3/1=0x0
line 581: 15/4/0=0x0
line 581: 15/4/1=0x0
line 581: 15/5/0=0x0
line 581: 15/5/1=0x0
line 581: 15/6/0=0x0
line 581: 15/6/1=0x0
line 581: 15/7/0=0x0
line 581: 15/7/1=0x0
line 581: 15/8/0=0x0
line 581: 15/8/1=0x0
line 581: 15/9/0=0x0
line 581: 15/9/1=0x0
line 581: 15/10/0=0x0
line 581: 15/10/1=0x0
line 581: 15/11/0=0x0
line 581: 15/11/1=0x0
line 581: 15/12/0=0x0
line 581: 15/12/1=0x0
line 581: 15/13/0=0x0
line 581: 15/13/1=0x0
line 581: 15/14/0=0x0
line 581: 15/14/1=0x0
line 581: 15/15/0=0x0
line 581: 15/15/1=0x0
line 313: 1/1/0=0x0
line 313: 8/8/0=0x0
line 313: 8/8/1=0x0
line 313: 1/1/1=0x0
line 313: 13/13/0=0x0
line 313: 9/9/0=0x0
line 313: 9/9/1=0x0
line 313: 5/5/0=0x0
line 313: 4/4/0=0x0
line 313: 4/4/1=0x0
line 313: 2/2/0=0x0
line 313: 2/2/1=0x0
line 313: 11/11/0=0x0
line 313: 14/14/0=0x0
line 313: 5/5/1=0x0
line 313: 14/14/1=0x0
line 313: 6/6/0=0x0
line 313: 3/3/0=0x0
line 313: 0/0/0=0x0
line 313: 0/0/1=0x0
line 313: 7/7/0=0x0
line 313: 7/7/1=0x0
line 313: 12/12/0=0x0
line 313: 12/12/1=0x0
line 313: 6/6/1=0x0
line 313: 3/3/1=0x0
line 313: 13/13/1=0x0
line 313: 11/11/1=0x0
line 313: 10/10/0=0x0
line 313: 10/10/1=0x0
line 302: 15/0/0=0x7fa14bacd000
line 302: 15/1/0=0x7fa14bacd000
line 302: 15/2/0=0x7fa14bacd000
line 302: 15/3/0=0x7fa14bacd000
line 302: 15/4/0=0x7fa14bacd000
line 302: 15/5/0=0x7fa14bacd000
line 302: 15/6/0=0x7fa14bacd000
line 302: 15/7/0=0x7fa14bacd000
line 302: 15/8/0=0x7fa14bacd000
line 302: 15/9/0=0x7fa14bacd000
line 302: 15/10/0=0x7fa14bacd000
line 302: 15/11/0=0x7fa14bacd000
line 302: 15/12/0=0x7fa14bacd000
line 302: 15/13/0=0x7fa14bacd000
line 302: 15/14/0=0x7fa14bacd000
line 302: 15/15/0=0x7fa14bacd000
line 347: 15/15/0=0x0
exiting. mypos/i/xxx: 15/0/0=0x7fa14bacd000

I.e. we busyloop, waiting for the value at 15/0/0 to be 0, but that never happens. Immediately prior we set 15/everything/0 to something !=0, and reset ONLY 15/15/0. I don't know what is intended here. Is this enlightening for anybody?

@martin-frbg
Copy link
Collaborator

Not sure yet what is going on there, but could you try with this modification of lapack/getrf/getrf_parallel.c (had to rename it to .txt so that github accepts it) that at least fixes the helgrind warnings ?
getrf_parallel.txt

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

Nope, still hangs with that version of getrf_parallel.c.

@martin-frbg
Copy link
Collaborator

Pity. Hanging in same location probably ? (And did you do a full rebuild, i.e. make clean first ?)

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

I did a very clean rebuild, and yes, it hangs in the same place. Anything more I can do?

@martin-frbg
Copy link
Collaborator

Not right now - "your" hang seems to happen in a different place from "my" helgrind error, and I cannot reproduce the problem with my small machine even if I throw 80 threads at it. The fix may still look quite similar, just by putting the phtread_lock around the job[].working - but at the moment all my tries seem to introduce hangs rather than fix anything...

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

OK. I tried to instrument with rr to get more clarity, but that squashed the bug. This is consistent with what you saw: true parallelism is required to trigger this; simply increasing the number of threads isn't enough.

@brada4
Copy link
Contributor

brada4 commented Mar 26, 2018

According to intel ark your e5-2698 v4 processor has fancy cluster-on-die technology that you may get 2 numa nodes per socket. Can you check with numactl -H?

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

kogan@dirac:~/OpenBLAS$ ~/numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 48233 MB
node 0 free: 15973 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 48353 MB
node 1 free: 18754 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

@martin-frbg
Copy link
Collaborator

Perhaps adding -fsanitize=thread to your CFLAGS (assuming you use gcc for the build ?) will tell us something that (at least my) valgrind/helgrind did not manage to notice ? (So far I do not think the NUMA layout is to blame here)

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

The build system isn't cooperating. Is there some "correct" way to add the sanitizer flags to the compiler AND the linker?

@martin-frbg
Copy link
Collaborator

if it does not take the CFLAGS, I guess you could just uncomment the COMMON_OPT line in Makefile.rule and add it after the -O2

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

It takes the CFLAGS, but there's a required link library, and no obvious variable that takes a -ltsan

@martin-frbg
Copy link
Collaborator

Does it work when you use LD_FLAGS=-fsanitize=thread as well ?

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

No. LD_FLAGS isn't something this build system has. LDLIBS is what the right thing is usually called, but we don't use that either. There's an LDFLAGS, but that applies to C and fortran, and the latter barfs. Do you ever build tsan into openblas? If not, then I can figure it out; just not right this second.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 26, 2018

Got it now. Actually you need to add the -fsanitize=thread to the FFLAGS as well, as it uses gfortran to invoke the linker. And make sure that you actually have /usr/lib64/libtsan.so, your distro may supply this in a separate libtsan or libtsan0 package...
(And it does not provide any additional output in my case, except that it complains about conflicting accesses to sgemm_p in the utest run at the end of the build, and makes the "fork" utest hang)

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

Yep, I figured it out at the same time: fedora was being a silly distro as usual. I can build with this:

make USE_THREAD=1 USE_OPENMP=0 INTERFACE64=0 BINARY=64 CC='ccache gcc' FC='ccache gfortran' CFLAGS='-fsanitize=thread' LDFLAGS=-ltsan

But the sample program now crashes in the libtsan initialization. It doesn't look like this crash is related to OpenBLAS at all. I really need to go do something else now; let me look at this again a bit later.

@dkogan
Copy link
Author

dkogan commented Mar 26, 2018

Oh, never mind. tsan was being unfriendly too. I can now build and run. The tests were failing because of tsan (maybe indicative of the failure in this issue) so I disabled them. My test application now says this:
tsan.log.txt

I'm here: 73c5ca7

With these patches applied:
openblas.patch.txt

@martin-frbg
Copy link
Collaborator

Thank you. This appears to be without my proposed changes to getrf_parallel.c, was that intentional after seeing that they did not appear to fix anything ?

@dkogan
Copy link
Author

dkogan commented Mar 27, 2018

Not especially intentional. We just had several versions flying around, and it wasn't clear which changes were important, and which were random experiments. I can try with the changes. Are we talking about these: #1497 (comment) ?

@martin-frbg
Copy link
Collaborator

Yes, that version at least seems to have silenced the warnings from both helgrind and thread-sanitizer (on my system that does not show the full symptoms of your problem)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 28, 2018

My only idea at the moment is to add locks around the few remaining code sections that involve the job array in that file. Just to be sure - you build on the same 80-core machine where execution fails, not on something with fewer cores ? And did you try with USE_OPENMP=1, or is there something about your particular use case that would preclude this ?

@martin-frbg
Copy link
Collaborator

getrf_parallel.txt
Looking at the "fork" utest hang now, happens on thread shutdown so may play a role in other cases as well.

@dkogan
Copy link
Author

dkogan commented Mar 28, 2018

I am building on the same machine where I'm running the tests, yes. And I haven't tried with any configurations other than the one mentioned above. The reason is that this is the configuration used by fedora when they build their packages. Originally I hit this bug when using the packages, and I wanted to reproduce that configuration. If it's useful, I can check OPENMP=1, or anything else you'd like.

@brada4
Copy link
Contributor

brada4 commented Mar 28, 2018

I fail to find your configuration built by fedora
https://src.fedoraproject.org/cgit/rpms/openblas.git/tree/openblas.spec

@dkogan
Copy link
Author

dkogan commented Mar 28, 2018

I don't exactly remember, but it looks close to the 'threaded64' build, except I'm also passing BUILD64=1, and not passing the extra flags that fedora is. I don't remember exactly why I'm adding BUILD64=1. Is that wrong?

@dkogan
Copy link
Author

dkogan commented Mar 28, 2018

Oh, and the resulting fedora packages were 100% showing this issue; that's why I started chasing it in the first place.

@brada4
Copy link
Contributor

brada4 commented Mar 29, 2018

Try to build without flags

@martin-frbg
Copy link
Collaborator

BUILD64 is not used anywhere in the OpenBLAS build system so should not matter, BINARY64 as you mentioned in your first post should be the default anyway.
Unfortunately I did not get any further yet - the hang in the "fork" utest seems to happen only with tsan (and not even on a lock/unlock), helgrind finds nothing wrong with the thread usage there. I suspect the latest version of getrf_parallel I posted still did not improve anything for you ?
The USE_OPENMP=1 suggestion I made earlier is just because that happens to be considered intrinsically thread-safe (but making plain pthreads behave is still an important goal).

@dkogan
Copy link
Author

dkogan commented Mar 29, 2018

I just tried your latest getrf_parallel, and it fails, as before.

Setting USE=OPENMP=1 "fixes" it: the test program successfully exits.

@martin-frbg
Copy link
Collaborator

Maybe all the helgrind and tsan warnings were a red herring, and it is indeed a cpu mask or similar overflowing somewhere in blas_server.c . (The obvious drawback is that it will probably be hard to find and fix this on a lesser machine). At least with OPENMP, "conventional wisdom" is confirmed and you should have a reliable solution for your actual use case.

@martin-frbg
Copy link
Collaborator

One quick thing to try would be to unconditionally set USE_ALLOC_HEAP at the start of getrf_parallel.c (where it now depends on number of cpus exceeding a fixed value from common.h).

@dkogan
Copy link
Author

dkogan commented Mar 29, 2018

I already have a workaround for my actual use case: OPENBLAS_NUM_THREADS=N for N <= 65. I'm here purely to help you debug.

defining USE_ALLOC_HEAP unconditionally doesn't appear to make a difference: test program hangs.

@brada4
Copy link
Contributor

brada4 commented Mar 30, 2018

For exactly tests, could you try which of fedora packages fail (-lopenblas / -lopenblasp / -lopeblaso after installing openblas-devel, also -lblas -llapack for generic versions that should not be moved by differenc CPU arch or count)
Currently potential suspect code exceeds what can be checked for defects on the spot.
i.e if it is possible to reduce search for the bug to small ifdef block(s) instead of checking all ifdef combos on 10 files involved.

@martin-frbg
Copy link
Collaborator

@brada4 my understanding is that he is tracking this down via direct builds from source, so I do not see what the fedora packages have to do with it (and surely their p and o variants are just builds with pthreads and/or openmp). With the lblas -llapack are you suggesting that the test code may fail on his machine even when linked against netlib blas/lapack (that would not even multithread) ??

@brada4
Copy link
Contributor

brada4 commented Mar 30, 2018

There was statement that fedora package hangs , and same parameters keep adding up and over. Would be nice to know if it is omp vs pthread issue or not....

@dkogan
Copy link
Author

dkogan commented Mar 30, 2018

All the tests in this thread were from source builds. We already know that the direct pthread-based implementation is at fault somehow, while the openmp one works.

@martin-frbg
Copy link
Collaborator

I'm trying to reproduce this with a fake (qemu) 80-core system now in the hope that there is some subtle difference between building for that many threads and that many cpus. We'll see if valgrind catches anything useful. Trouble is qemu is fairly slow...

@dkogan
Copy link
Author

dkogan commented Apr 1, 2018

I just tried to reproduce on an AWS box with 36 cores (that's what I had available). This was instance type: c5.9xlarge. Sadly, this box wasn't enough to trigger the bug.

@martin-frbg
Copy link
Collaborator

The fake box does trigger it, but valgrind remains silent (and helgrind as well, if using my lastest patched version of getrf_parallel.c). So this is probably a logic bug and not a simple overflow.

@dkogan
Copy link
Author

dkogan commented Apr 1, 2018 via email

@brada4
Copy link
Contributor

brada4 commented Apr 1, 2018

Hmm I see
common.h:#define GETRF_MEM_ALLOC_THRESHOLD 80
getrf_parallel.c
//In this case, the recursive getrf_parallel may overflow the stack.
//Instead, use malloc to alloc job_t.
#if MAX_CPU_NUMBER > GETRF_MEM_ALLOC_THRESHOLD
#define USE_ALLOC_HEAP

That would explain behaviour change on 81st thread...

@martin-frbg
Copy link
Collaborator

Actually behaviour "changes" beyond 64 or 65, but stack allocation of the job array could play a role. Another thing to try is to revert wernsaar's changes made in preparation for non-power of two UNROLL.
My current thinking is there could be something wrong with the range_m/range_n setup around line 500, at least I think I saw some strange changes in the value of newarg.nthreads at the transition from working to hanging behaviour. Unfortunately I lost network access to my test system and will not return to it until tuesday.

@martin-frbg
Copy link
Collaborator

Seems to me that blas_quickdivide from https://github.com/xianyi/OpenBLAS/blob/07ed01e97f1cb03c7f3820b00c601cc634fc1b89/lapack/getrf/getrf_parallel.c#L503
onwards creates bogus workloads with zero width when there are "too many" cpus available.
Handling "width==0" the same as "nn < width" (or "mm < width"), i.e. setting width to either nn or mm as appropriate seems to fix it in qemu, could you please check if this is the case on actual hardware as well ?

@dkogan
Copy link
Author

dkogan commented Apr 5, 2018

I just tested this on my box, and indeed the issue is gone. Thank you very much!

@dkogan dkogan closed this as completed Apr 5, 2018
@martin-frbg
Copy link
Collaborator

Thank you very much for the feedback, I have merged this change now. Similar unverified use of the value returned by blas_quickdivide seems to occur in driver/others/blas_l1_thread.c and threaded level3 GEMM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants