Skip to content

Segmentation fault using serial OpenBLAS with OpenMP on Windows #1847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hjbreg opened this issue Nov 1, 2018 · 41 comments
Closed

Segmentation fault using serial OpenBLAS with OpenMP on Windows #1847

hjbreg opened this issue Nov 1, 2018 · 41 comments

Comments

@hjbreg
Copy link

hjbreg commented Nov 1, 2018

Updated test results (OpenBLAS 0.3.3)

Windows 64-bit

gcc version 4.9.3 20150626 (Fedora MinGW 4.9.3-1.el7)

  • USE_THREAD=0 NUM_THREADS=1
    segmentation fault

  • USE_THREAD=0 NUM_THREADS=16
    incorrect result

  • USE_THREAD=1 NUM_THREADS=2 USE_OPENMP=1 NUM_PARALEL=16
    incorrect result

  • USE_THREAD=1 NUM_THREADS=2 openblas_set_num_threads(1)
    incorrect result

Linux 64-bit

gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)

  • USE_THREAD=0 NUM_THREADS=1
    too many memory regions

  • USE_THREAD=0 NUM_THREADS=16
    OK


Segmentation fault if OMP_NUM_THREADS > 4, but linking to netlib-lapack is ok, so I think this may be OpenBLAS side problem. I also tested it on Linux, no segmentation fault.

GCC: x86_64-w64-mingw32-gcc 4.9.3

Single threaded OpenBLAS is built

USE_THREAD=0 DYNAMIC_ARCH=1 DYNAMIC_OLDER=0 NO_CBLAS=1 NO_LAPACKE=1 NO_SHARED=1

Test code: segfault-win.cpp

#include <cstdlib>
#include <vector>
#include <iostream>

// g++ segfault-win.cpp -lopenblas -lgfortran -lquadmath -fopenmp

#define bint int

extern "C"
void dgels_(char *trans, bint *m, bint *n, bint *nrhs, double *a, bint *lda, double *b, bint *ldb,
            double *work, bint *lwork, bint *info);

extern "C" void openblas_set_num_threads(int);
extern "C" int openblas_get_num_threads();
extern "C" int omp_get_max_threads();

int C_dgels(char trans, bint m, bint n, bint nrhs, double *a, bint lda, double *b, bint ldb)
{
    bint info = 0;
    double wkopt;
    bint lwork = -1;
    dgels_(&trans, &m, &n, &nrhs, a, &lda, b, &ldb, &wkopt, &lwork, &info);
    if (info == 0) {
        lwork = static_cast<bint>( wkopt );
        double *work = new double[lwork];
        dgels_(&trans, &m, &n, &nrhs, a, &lda, b, &ldb, work, &lwork, &info);
        delete[] work;
    }
    return info;
}

int main()
{
    // USE_THREAD=1 NUM_THREADS=2
    // openblas_set_num_threads(1);

    int n = 300;
    int q = 5;
    int r = 1000;

    std::vector<double> x, y;
    for (int i = 0; i < n; ++i) {
        y.push_back( static_cast<double>(std::rand()) / RAND_MAX );
        for (int j = 0; j < q; ++j)
            x.push_back( static_cast<double>(std::rand()) / RAND_MAX );
    }

    std::cerr << "omp_get_max_threads: " << omp_get_max_threads() << "\n";
    std::cerr << "openblas_get_num_threads: " << openblas_get_num_threads() << "\n";

    std::vector< std::vector<double> > z(r);

    #pragma omp parallel for
    for (int i = 0; i < r; ++i) {
        std::vector<double> xx = x;
        std::vector<double> yy = y;
        C_dgels('N', n, q, 1, xx.data(), n, yy.data(), n);
        z[i] = yy;
    }

    for (int i = 1; i < r; ++i)
        std::cerr << (z[i] == z[0]);
    std::cerr << "\n";

    return 0;
}

@brianborchers
Copy link

Does the same code work correctly with other versions of BLAS (or OpenBLAS compiled with different options)?

@hjbreg
Copy link
Author

hjbreg commented Nov 1, 2018

Does the same code work correctly with other versions of BLAS (or OpenBLAS compiled with different options)?

Yes, the code works correctly with netlib-lapack/blas.

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

Can you get backtrace with e.g. x64dbg?
Is OpenBLAS also compiled in unorthodox way with USE_OPENMP=1 requiring dozen of redist packages from gcc?

@hjbreg
Copy link
Author

hjbreg commented Nov 1, 2018

Sorry, I am not familiar with backtrace debug.
However, I rebuilt single threaded OpenBLAS with NUM_THREADS=24 (default to 4 in previous build) in addition to USE_THREAD=0, and interestingly segmentation fault is gone, but the results are incorrect for OMP_NUM_THREADS>1.
I am confused that why is NUM_THREADS=24 option used when USE_THREAD=0 is set.

hjbreg added a commit to njau-sri/rtm-gwas that referenced this issue Nov 1, 2018
@martin-frbg
Copy link
Collaborator

See Usage.md, for "historical reasons" NUM_THREADS also sizes an internal buffer. Do you get correct results when you compile OpenBLAS (which version, by the way ?) with USE_OPENMP=1 ?

@hjbreg
Copy link
Author

hjbreg commented Nov 1, 2018

@martin-frbg I think USE_OPENMP=1 can not be used simultaneously with USE_THREAD=0, furthermore only serial version of OpenBLAS (0.3.3) is considered here.
So NUM_THREADS is related to internal buffer, and I wonder whether the internal buffer is thread safety.

@hjbreg
Copy link
Author

hjbreg commented Nov 1, 2018

I will try to build threaded OpenBLAS and use openblas_set_num_threads(1) to get serial OpenBLAS.

@martin-frbg
Copy link
Collaborator

Did not see the requirement that OpenBLAS has to be single-threaded. If the error is not reproducible on anything other than Windows it could also be a mingw miscompilation. I will check (later) if valgrind/helgrind on Linux makes any complaints.

@hjbreg
Copy link
Author

hjbreg commented Nov 1, 2018

No segfault but incorrect results with OpenBLAS using USE_THREAD=1 USE_OPENMP=1.
So OpenBLAS will not work with MinGW and OpenMP on Windows.
As the same code is fine on Linux, this may be also a MinGW side problem.

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2018

To get backtrace in a debugger (x64dbg is graphical and understands both windows and gcc debug symbols and is free, alternatives would be IDApro, visual studio, ollydbg, each lacking functionality in some way) you run crashing program inside debugger and it catches the crash.

Then you have various options (which open new MDI windows) like backtrace, registers, stack content, disassembly of code around (crashed) EIP and probably in other code points mentioned in traces. Just be courageous and creative around these lines. Nobody expects perfect picture at the first try.

@hjbreg
Copy link
Author

hjbreg commented Nov 2, 2018

@martin-frbg I have updated test results, including Linux, please see the first post for details. I have read the USAGE.md. I still do not understand why single-threaded OpenBLAS may have problem in multiple threads program.

Despite its name, and due to the use of memory buffers in functions like SGEMM, the setting of NUM_THREADS can be relevant even for a single-threaded build of OpenBLAS, if such functions get called by multiple threads of a program that uses OpenBLAS. In some cases, the affected code may simply crash or throw a segmentation fault without displaying the above warning first.

@brada4 Many thanks, I will try later.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

#1844 approaches same class of buffers but from different side
Call graph of dgels probably yields more candidate broken functions used in Windows 2-4 samples

@martin-frbg
Copy link
Collaborator

@hjbreg could you retry with a build after setting USE_TLS to 0 in Makefile.rule please ? 0.3.3 was meant to revert to the original thread memory allocation code as there were some problems with the new thread-local storage implementation, but unfortunately I had kept the wrong default (unless you build with cmake, that is). Not sure yet if this is in any way related to #1844.

@hjbreg
Copy link
Author

hjbreg commented Nov 2, 2018

Tested previsous buiding options with USE_TLS=0, still not working.
As USE_THREAD=0 NUM_THREADS=1 also does not work on Linux, and the netlib-refblas/lapack works,
then I guess it may be the OpenBLAS side issue. But it is also strange that USE_THREAD=0 NUM_THREADS=16 (or any other value greater than the runtime thread number) works on Linux, but does not work on Windows.
I think we should check globle static variables used in OpenBLAS.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

Do not set USE_TLS=0 , correct way to disable TLS is to comment option in Makefile.rule
It is a mistake that experimental option slipped into release in a hard to disable way.
Can you confirm that disabling TLS, i.e commenting it out, still exhibits accuracy problems?

@hjbreg
Copy link
Author

hjbreg commented Nov 2, 2018

@brada4 You are right, USE_TLS is acutually not distabled in my last build. I found that the only option to disable USE_TLS is to comment out the line in Makefile.rule, as Makefile.system only checks for ifdef USE_TLS, rather than its value. However, disabling USE_TLS still does not work. I also check OpenBLAS version 0.2.20 and 0.2.10, but unfortunately none works.

@martin-frbg
Copy link
Collaborator

Yes, sorry, I only fixed the USE_TLS=0 gotcha after the 0.3.3 release. I'd say myself that 0.3.4 is overdue, but I had still hoped for some phase where bugs are not popping up like mushrooms.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

g++ -g s1847.cpp -llapack -fopenmp -fsanitize=thread complains about data race regarding array z
tried with and without OMP on Linux, all same.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2018

@hjbreg the problem is that there are about 10 candidate functions and backtrace is really needed to find which (can be more than one) of them is at fault.

Say Linux version of backtrace would look like
-addresses are ASLR-ed, nothing personal in them, feel free to edit out non-system paths.
-I artificially made breakpoint to get some picture):
-You will have crashing function being called from dgels_
... that function name alone would help to reduce search for the bug 10x

Thread 5 "a.out" hit Breakpoint 1, 0x00007ffff7336fe0 in dgels_ () from /usr/lib64/libopenblas_pthreads.so.0
(gdb) bt
#0  0x00007ffff7336fe0 in dgels_ () from /usr/lib64/libopenblas_pthreads.so.0
#1  0x00000000004010a3 in C_dgels(char, int, int, int, double*, int, double*, int) ()
#2  0x000000000040177c in .omp_outlined. ()
#3  0x00007ffff7f89883 in __kmp_invoke_microtask () from /usr/lib64/libomp.so
#4  0x00007ffff7f2fbe2 in ?? () from /usr/lib64/libomp.so
#5  0x00007ffff7f2eb2b in ?? () from /usr/lib64/libomp.so
#6  0x00007ffff7f7cdc8 in ?? () from /usr/lib64/libomp.so
#7  0x00007ffff54fa559 in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff523181f in clone () from /lib64/libc.so.6
(gdb) disassemble 
Dump of assembler code for function dgels_:
=> 0x00007ffff7336fe0 <+0>:     push   %r15
   0x00007ffff7336fe2 <+2>:     push   %r14
   0x00007ffff7336fe4 <+4>:     mov    %rdi,%r14
   0x00007ffff7336fe7 <+7>:     push   %r13
   0x00007ffff7336fe9 <+9>:     push   %r12
   0x00007ffff7336feb <+11>:    mov    %rcx,%r12
   0x00007ffff7336fee <+14>:    push   %rbp
   0x00007ffff7336fef <+15>:    push   %rbx
   0x00007ffff7336ff0 <+16>:    mov    %rsi,%rbx
   0x00007ffff7336ff3 <+19>:    mov    %rdx,%rbp
   0x00007ffff7336ff6 <+22>:    mov    $0x1,%ecx

@hjbreg
Copy link
Author

hjbreg commented Nov 3, 2018

@brada4 Thanks for the guide. I tested Linux case using USE_THREAD=0 NUM_THREADS=1, and OMP_NUM_THREADS=4.

gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)
libopenblas_sandybridge-r0.3.3.a (Single threaded)

Starting program: /ds3512/home/hjb/OpenBLAS-0.3.3/a.out 
[Thread debugging using libthread_db enabled]
omp_get_max_threads: 4
openblas_get_num_threads: 1
[New Thread 0x7ffff7cef700 (LWP 27985)]
[New Thread 0x7ffff72ee700 (LWP 27986)]
[New Thread 0x7ffff68ed700 (LWP 27987)]
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7cef700 (LWP 27985)]
dgemv_t () at ../kernel/x86_64/dgemv_t.S:252
252		movapd	%xmm0, 0 * SIZE(X1)
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libgfortran-4.4.7-4.el6.x86_64 libgomp-4.4.7-4.el6.x86_64 libstdc++-4.4.7-4.el6.x86_64
(gdb) bt
#0  dgemv_t () at ../kernel/x86_64/dgemv_t.S:252
#1  0x0000000000000004 in ?? ()
#2  0x00007ffff7cee900 in ?? ()
#3  0x00007ffff7cee830 in ?? ()
#4  0x0000000000000005 in ?? ()
#5  0x000000000067d230 in ?? ()
#6  0x0000000000000003 in ?? ()
#7  0x0000000000403e65 in blas_memory_alloc (procpos=1072693248) at memory.c:2699
#8  0x00000000004755dc in dlarf (side='L', m=1000, n=4, v=..., incv=1, tau=1.021459769892848, c=..., ldc=1000, work=..., _side=4) at dlarf.f:201
#9  0x00000000004753bf in dgeqr2 (m=1000, n=5, a=..., lda=1000, tau=..., work=..., info=0) at dgeqr2.f:184
#10 0x0000000000405f9e in dgeqrf (m=1000, n=5, a=..., lda=1000, tau=..., work=..., lwork=160, info=0) at dgeqrf.f:263
#11 0x000000000040502b in dgels (trans='N', m=1000, n=5, nrhs=1, a=..., lda=1000, b=..., ldb=1000, work=..., lwork=165, info=0, _trans=-8752) at dgels.f:353
#12 0x00000000004017e7 in C_dgels (trans=78 'N', m=1000, n=5, nrhs=1, a=0x7fffe80008c0, lda=1000, b=0x7fffe800a510, ldb=1000) at segfault-win.cpp:31
#13 0x0000000000401ce7 in main.omp_fn.0(void) (.omp_data_i=0x7fffffffde10) at segfault-win.cpp:66
#14 0x00000031f7408502 in ?? () from /usr/lib64/libgomp.so.1
#15 0x00000031f68079d1 in start_thread () from /lib64/libpthread.so.0
#16 0x00000031f5ce8b6d in clone () from /lib64/libc.so.6

@hjbreg
Copy link
Author

hjbreg commented Nov 3, 2018

This may be related to the global static memory buffer defined in memory.c. I do not known how this static variable is guaranteed to work correctly if serial version OpenBLAS is called in multiple threads.

I think the error message "too many memory regions" can demonstrate that this static variable is accessed from multiple threads. As NUM_BUFFERS is defined to #define NUM_BUFFERS (MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER), I guess that in case of USE_THREAD=0 NUM_THREADS=1, NUM_BUFFERS will be 2, that is why OMP_NUM_THREADS<=2 works.

But if the static memory variable is not thread safety, I can not explain why the case of USE_THREAD=0 NUM_THREADS=16 works on Linux.

memory.c line 2460-2473

static volatile struct {
  BLASULONG lock;
  void *addr;
#if defined(WHEREAMI) && !defined(USE_OPENMP)
  int   pos;
#endif
  int used;
#ifndef __64BIT__
  char dummy[48];
#else
  char dummy[40];
#endif

} memory[NUM_BUFFERS];

@brada4
Copy link
Contributor

brada4 commented Nov 3, 2018

That is about non-TLS version .It uses static buffer configured at build time. It turns out it is unnecessarily used for single-threaded calls.
What about first setting it to minimum say 64 as everyone can have that much cores at home by next working day, then tackle the issue that single-threaded version and then configuration does thread locking at all.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 3, 2018

I suspect OpenBLAS will "just" work in this context (single-threaded OpenBLAS called by multiple threads in an uplevel program) as long as each thread finds an empty slot in the memory array (i.e. NUM_THREADS is big enough to accomodate them) - but on the whole this looks to be a serious design flaw in the original GotoBLAS. With luck, the miscalculations you are seeing on Windows are caused by the separate (and much younger) bug in xGEMV that is discussed in #1844 and hopefully resolvable by #1852.

@hjbreg
Copy link
Author

hjbreg commented Nov 4, 2018

@martin-frbg I have tested on 64-bit Windows with Intel Core i7-4790, libopenblas_haswell-r0.3.3 dgemv works correctly under multiple threads with or without #1852 .

#include <cstdlib>
#include <vector>
#include <iostream>

extern "C"
void dgemv_(char *trans, int *m, int *n, double *alpha, const double *a, int *lda,
            const double *x, int *incx, double *beta, double *y, int *incy);

extern "C" int openblas_get_num_threads();
extern "C" int omp_get_max_threads();

using namespace std;

int main(int argc, char *argv[])
{
    int r = 500;

    char trans = 'N';
    int m = 1000;
    int n = 5;

    if (argc > 3) {
        trans = argv[1][0];
        m = atoi(argv[2]);
        n = atoi(argv[3]);
    }

    std::cerr << trans << " " << m << " " << n << "\n";

    int na = m * n;
    vector<double> a(na);
    for (int i = 0; i < na; ++i)
        a[i] = (double) rand() / RAND_MAX;

    int nx = trans == 'T' ? m : n;
    vector<double> x(nx);
    for (int i = 0; i < nx; ++i)
        x[i] = (double) rand() / RAND_MAX;

    int lda = m;
    int incx = 1;
    int incy = 1;
    double alpha = 1.0;
    double beta = 0.0;

    int ny = trans == 'T' ? n : m;
    vector< vector<double> > ys(r, vector<double>(ny));

    std::cerr << "omp_get_max_threads: " << omp_get_max_threads() << "\n";
    std::cerr << "openblas_get_num_threads: " << openblas_get_num_threads() << "\n";

    #pragma omp parallel for
    for (int i = 0; i < r; ++i)
        dgemv_(&trans, &m, &n, &alpha, a.data(), &lda, x.data(), &incx, &beta, ys[i].data(), &incy);

    for (int i = 1; i < r; ++i)
        std::cerr << (ys[i] == ys[0]);
    std::cerr << "\n";

    return 0;
}

@brada4
Copy link
Contributor

brada4 commented Nov 4, 2018

Did initial dgels_ code start to work with the patch from PR?
DGEMV_ makes just small part of it

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 4, 2018

@hjbreg so it looks to you as if it is a Sandybridge problem (or more specifically the Sandybridge-specific microkernels as compiled by mingw-4.9.3) ? The dgemv issue seems real (and reproducible), but may indeed be unrelated to your case.

@hjbreg
Copy link
Author

hjbreg commented Nov 5, 2018

@brada4 dgels still does not work with the patch
@martin-frbg I will try to narrow down the problem latter

@hjbreg
Copy link
Author

hjbreg commented Nov 5, 2018

I have tested functions called by dgels, and found the bug is caused by the precision of dnrm2. If dnrm2 is called from multiple threads, the results may be different.

Test is performed on 64-bit Windows, Intel Core i7-4790, gcc 8.2.0 (MSYS2 MinGW 64-bit), OpenBLAS 0.3.3 (USE_THREAD=0)

Test output (1st column is openblas dnrm2, 2st is naive code)

0.20518284528683192 0.20518284528683189 2.77556e-017
0.20518284528683192 0.20518284528683189 2.77556e-017
0.20518284528683192 0.20518284528683189 2.77556e-017
0.20518284528683192 0.20518284528683189 2.77556e-017
0.20518284528683189 0.20518284528683189 0
0.20518284528683189 0.20518284528683189 0
0.20518284528683189 0.20518284528683189 0
0.20518284528683189 0.20518284528683189 0

Test code

#include <math.h>
#include <stdio.h>

double dnrm2_(int *n, const double *x, int *incx);

double C_dnrm2(int n, const double *x, int incx)
{
    return dnrm2_(&n, x, &incx);
}

double M_dnrm2(int n, const double *x, int incx)
{
    int i = 0;
    double s = 0.0;
    for (i = 0; i < n; ++i)
        s += x[i]*x[i];
    return sqrt(s);
}

int main()
{
    int i = 0;
    double x[2] = {0.14, 0.15};
    double y[8], z[8];

    #pragma omp parallel for
    for (i = 0; i < 8; ++i) {
        y[i] = C_dnrm2(2, x, 1);
        z[i] = M_dnrm2(2, x, 1);
    }

    for (i = 0; i < 8; ++i)
        printf("%.17f %.17f %g\n", y[i], z[i], y[i]-z[i]);

    return 0;
}

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2018

Please go to makefile.rule, the place where -frecursive parameter is disabled, and enable it.
Thats reference lapack code, that does not call anything (Open)BLAS.

@hjbreg
Copy link
Author

hjbreg commented Nov 5, 2018

@brada4 I do not understand why this option is related to this issue. I tested dnrm2 with OpenBLAS build with NO_LAPACK=1

@hjbreg
Copy link
Author

hjbreg commented Nov 5, 2018

I also disabled optimized dnrm2 by adding
DNRM2KERNEL = ../arm/nrm2.c
to kernel file kernel/x86_64/KERNEL.HASWELL
The new OpenBLAS build passed the DGELS test in the first post

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2018

it is a blas function. my bad. ARM(v5 32bit) is C-only.

@martin-frbg
Copy link
Collaborator

So "wrong" in this context is "only" about the trailing digits of the SSE result (and which somehow differs between Windows and Linux builds on the same hardware - I count 15 significant digits in either case if I am not mistaken) ?

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2018

@hjbreg it is exactly one youngest bit difference in the result which is insignificant for all purposes. You can re-calculate with some symbolic math package, then with reference BLAS and you see that whatever you use , you lose on average log2(number of float ops) youngest bits precision using any FPU as compared to full symbolic evaluation.
Let me give extreme roundoff example - make sum of a million random floats... Then sort them up and down and repeat summing....

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2018

@hjbreg there was old issue that Windows mingw does not copy FPCSR between threads and that was helped out with CONSISTENT_FPCSR=1 , that could be one explanation why threaded version acts slightly differently.

@hjbreg
Copy link
Author

hjbreg commented Nov 6, 2018

I have to say the real case is much more complex where dgels and dgelsy are used intensively for variable selection. I am sorry that the above test does not cover the full real issue.
I have tested the real case with DNRM2KERNEL = ../arm/nrm2.c, the result (selected variables) is still quite different from serial call of openblas.
So I am not sure whether this is related to bit precision. I will keep trying to narrow down the issue.

@hjbreg
Copy link
Author

hjbreg commented Nov 6, 2018

Finally, I think I have found the real issue. I build OpenBLAS with BLAS only (NO_LAPACK=1), and link my real program to LAPACK (NETLIB) and BLAS (OpenBLAS), surprisingly, this issue disappears.

Then I guess this issue may be related to Fortran compiling option only, so I rebuid whole OpenBLAS (LAPACK included) with -frecursive enabled as suggested by @brada4, my real program works correctly as expcted.

Although I do not known why -frecursive is needed on 64-bit Windows and OpenMP, it really fixes this issue. If there is no other side effect with -frecursive, I think we can conditionally enable this option by default as in lapack-netlib/make.inc.example.

Also please see https://gcc.gnu.org/onlinedocs/gfortran/OpenMP.html

-fopenmp implies -frecursive, i.e., all local arrays will be allocated on the stack. When porting existing code to OpenMP, this may lead to surprising results, especially to segmentation faults if the stacksize is limited.

@brada4
Copy link
Contributor

brada4 commented Nov 6, 2018

I think it can even enabled for all gfortran as thread-safety is intended for single-thread version.
Btw is initial crash gone with gemv patch?

@hjbreg
Copy link
Author

hjbreg commented Nov 6, 2018

@brada4 The initial crash is fixed by specifying NUM_THRADS=16 (or any other value greater than actual number of threads), which is acutually described in Usage.md.

@brada4
Copy link
Contributor

brada4 commented Nov 6, 2018

I hope these two address initial concerns. There are no thread-safety tests in current test suite either, thank you for reporting.

@hjbreg
Copy link
Author

hjbreg commented Nov 7, 2018

The two patches work for me, so I close this issue.

@hjbreg hjbreg closed this as completed Nov 7, 2018
martin-frbg added a commit that referenced this issue Nov 7, 2018
Add gfortran -frecursive option from upstream and #1847
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants