Skip to content

Issue with LAPACKE_sgesvd() on custom compiled v0.3.7 for Win64 (new x86_64-w64-mingw32 might be the cause) #2297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Arech opened this issue Oct 30, 2019 · 66 comments

Comments

@Arech
Copy link

Arech commented Oct 30, 2019

Hi there.

I have a wrapper over LAPACKE_sgesvd() that works well with supplied binary v0.2.19/20, custom compiled v.0.2.20 and supplied binary v0.3.7. However, the code doesn't work well with v0.3.7 compiled (same options as I did for v0.2.20) on a fresh Debian10 with fresh compilers CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran (I'll describe later details of the compilation process)

Testing environment: Windows 7 x64 with latest updates on AMD Phenom II X6 1090.

The code in question is the same as the following python script:

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

It takes random N(0,1) matrix and performs SVD on it. Here's first few floats of sample input (colmajor matrix 64*785):

0x000000000D4C00C0      -0.909242570     -0.741646349     -0.169360474     -0.177789196  
0x000000000D4C00D0      -0.341049701     -0.345561802     -0.421100467     -0.359291792  
0x000000000D4C00E0     -0.0570527203       1.12855971      -1.45928419      0.384212315  
...

All tested versions of LAPACKE_sgesvd() works great, except custom compiled v0.3.7, which despite returning success (0), outputs the following junk:

0x000000000D4C00C0   -3.40282347e+38  -3.40282347e+38      0.000000000      0.000000000  
0x000000000D4C00D0    3.40282347e+38   3.40282347e+38  -3.40282347e+38   3.40282347e+38  
0x000000000D4C00E0   -3.40282347e+38      0.000000000      0.000000000  -3.40282347e+38  
...

Or in DWORDS

0x000000000D4C00C0  ff7fffff ff7fffff 00000000 00000000  
0x000000000D4C00D0  7f7fffff 7f7fffff ff7fffff 7f7fffff  
0x000000000D4C00E0  ff7fffff 00000000 00000000 ff7fffff  
...

Note, that I had to do custom compilation, because the supplied binary still doesn't use the CONSISTENT_FPCSR=1 switch and I eventually get a lots of NaNs that seriously slows computation down.


The compilation process is basically the same as described in the issue linked above ( #1237 ). I installed fresh Debian10 on a virtual machine, did all the boilerplates

apt-get update
apt-get upgrade
apt-get install make cmake gcc mingw-w64 gfortran-mingw-w64

and then ran

make clean
make DYNAMIC_ARCH=0 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=6 TARGET=BARCELONA PREFIX=/opt/OpenBLAS
make install

to obtain kind of distro in /opt/OpenBLAS. Then I copy the contents of /opt/OpenBLAS to the windows system, compile my project over it, copy libopenblas.dll to my exe's folder and get the issue with LAPACKE_sgesvd() when I run my code.

Note, that I haven't found any issues with CBLAS routines I use (mainly gemm, syrk and symm). Moreover, I'm glad to see some performance improvement over the older v0.2.20.

I've noticed one suspicious difference between the supplied binary libopenblas.dll and my compiled version. The supplied binary depends on libgfortran-3.dll and works great with very old version of this lib dated 21.10.2014 (AFAIR I got it from some .zip from sourceforge's project page long ago). However, the custom compiled version depends on libgfortran-5.dll file, which I had to take (with all other necessary .dll dependencies) from debian's installation folder /lib/gcc/x86_64-w64-mingw32/8.3-win32.

Any ideas how to fix the issue?

Probably it worth trying to change the compiler to some older version, however, I'm not aware how to do it (I'm a foreigner in the Linux world). Could someone please explain it a little if the idea is worth trying?

@Arech Arech changed the title Problems with LAPACKE_sgesvd() on custom compiled v0.3.7 for Win64 (new x86_64-w64-mingw32 might be the cause) Issue with LAPACKE_sgesvd() on custom compiled v0.3.7 for Win64 (new x86_64-w64-mingw32 might be the cause) Oct 30, 2019
@brada4
Copy link
Contributor

brada4 commented Oct 30, 2019

You are correct - one has to copy gfortran redist from cross-build system, that -3 and -5 is internal ABI version, they are not interchangeable.

_FPSCR flag could be planted into single-arch builds where it improves things a lot - thats essentially trading strict IEEE float conformance to significant performance on old to very old gear. ??? @martin-frbg ???

Does anything change if you use TARGET=OPTERON or OPTERON_SSE3 ?
Other side of _FPCSR build flag?
Setiing OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 before run?

Between mentioned releases of OpenBLAS ang GCC (You got 8.2.0?) - default gfortran ABI changed slightly, and fresh GCC got more picky about assembly registers.

@Arech
Copy link
Author

Arech commented Oct 30, 2019

You got 8.2.0?

No, it's 8.3

root@debian:~/my_dev/OpenBLAS-0.3.7# x86_64-w64-mingw32-gcc --version
x86_64-w64-mingw32-gcc (GCC) 8.3-win32 20190406
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@debian:~/my_dev/OpenBLAS-0.3.7# x86_64-w64-mingw32-gfortran --version
GNU Fortran (GCC) 8.3-win32 20190406
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@debian:~/my_dev/OpenBLAS-0.3.7# gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'll try other things you've mentioned and post updates later.

@Arech
Copy link
Author

Arech commented Oct 30, 2019

Setiing OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 before run?

No change

TARGET=OPTERON

No change

/*
BTW, TARGET=OPTERON build do have

#define OPENBLAS_HAVE_3DNOW 
#define OPENBLAS_HAVE_3DNOWEX

defined in \include\openblas_config.h, while TARGET=BARCELONA doesn't, though it seems that these proccessors do have support for 3Now & 3DNowExt instructions sets. Shouldn't TARGET=BARCELONA have them defined too?
*/

TARGET=OPTERON_SSE3

No change

TARGET=BARCELONA CONSISTENT_FPCSR=0

Kind of no change. For the same input (rng is seeded with the same constant) outputs

0x000000000D5700C0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
0x000000000D5700D0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
0x000000000D5700E0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
...

or in bytes

0x000000000D5700C0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
0x000000000D5700D0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
0x000000000D5700E0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
...

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2019

If you compile 0.2.20 with Debian 10 - does it work?
Do you build numpy with same CC/FC options as OpenBLAS?

@Arech
Copy link
Author

Arech commented Oct 30, 2019

Do you build numpy with same CC/FC options as OpenBLAS?

I don't use neither numpy, nor python at all, it's purely C++ project. I mentioned the python script just to give an idea what kind of work gesvd() is doing in the project (python is easier to comprehend than C++ code I gave link to)

If you compile 0.2.20 with Debian 10 - does it work?

Have just downloaded the code, tried to compile: compilation breaks in the middle (and I don't know how to fix it - is it possible to make error messages more verbose? there's even no line number here...)

x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_tr -DASMFNAME=cgemm_tr_ -DNAME=cgemm_tr_ -DCNAME=cgemm_tr -DCHAR_NAME=\"cgemm_tr_\" -DCHAR_CNAME=\"cgemm_tr\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DTR gemm.c -o cgemm_tr.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_cr -DASMFNAME=cgemm_cr_ -DNAME=cgemm_cr_ -DCNAME=cgemm_cr -DCHAR_NAME=\"cgemm_cr_\" -DCHAR_CNAME=\"cgemm_cr\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DCR gemm.c -o cgemm_cr.obj
<command-line>: error: expected identifier or ‘(’ before numeric constant
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rn -DASMFNAME=cgemm_rn_ -DNAME=cgemm_rn_ -DCNAME=cgemm_rn -DCHAR_NAME=\"cgemm_rn_\" -DCHAR_CNAME=\"cgemm_rn\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRN gemm.c -o cgemm_rn.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rt -DASMFNAME=cgemm_rt_ -DNAME=cgemm_rt_ -DCNAME=cgemm_rt -DCHAR_NAME=\"cgemm_rt_\" -DCHAR_CNAME=\"cgemm_rt\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRT gemm.c -o cgemm_rt.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rc -DASMFNAME=cgemm_rc_ -DNAME=cgemm_rc_ -DCNAME=cgemm_rc -DCHAR_NAME=\"cgemm_rc_\" -DCHAR_CNAME=\"cgemm_rc\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRC gemm.c -o cgemm_rc.obj
make[1]: *** [Makefile:365: cgemm_cr.obj] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/root/my_dev/OpenBLAS-0.2.20/driver/level3'
make: *** [Makefile:139: libs] Error 1
root@debian:~/my_dev/OpenBLAS-0.2.20#

@martin-frbg
Copy link
Collaborator

IIRC, the cgemm_cr problem with 0.2.20 was a "funny" quirk of mingw where it misinterpretes the CR define from the command line as a literal "carriage return" or something like that. Later versions have the define (and corresponding ifdef in the code) changed to "XCR" to work around this.
No idea about the missing HAVE_3DNOW on Barcelona, could have been a simple oversight years ago but as far as I can tell it does not play a role in any of the BLAS kernels actually used by the BARCELONA target (except a prefetch instruction in the zgemm copy-out helper, but I doubt it
would have much impact on performance). Most likely this reflects the transition from 3dnow to SSE instructions.
Newer gcc/gfortran has an ABI issue where it expects the C code to pass length arguments even to single-character string arguments in calls to FORTRAN routines, but this should be taken care of in 0.3.7 by adding a compiler option (-fno-optimize-sibling-calls) automatically.
The recent windows binaries were built with a version of the MXE cross-compiler environment that
appears to be based on (mingw) gcc-5.5.0

@Arech
Copy link
Author

Arech commented Oct 30, 2019

but this should be taken care of in 0.3.7 by adding a compiler option (-fno-optimize-sibling-calls) automatically.

Seems so. Tried to compile with СFLAGS="-fno-optimize-sibling-calls" CXXFLAGS="-fno-optimize-sibling-calls". It produced a .dll with the same size but some 6 bytes are different (2 WORDs in PE header and 1 WORD somewhere in the middle). No change...

MXE env looks intimidating 😁

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2019

0.3.7 integrates that fflag already, there will be no change. For what is important numpy build has to use that flag too.

@Arech
Copy link
Author

Arech commented Oct 30, 2019

IIRC, the cgemm_cr problem with 0.2.20 was a "funny" quirk of mingw where it misinterpretes the CR define from the command line as a literal "carriage return" or something like that. Later versions have the define (and corresponding ifdef in the code) changed to "XCR" to work around this.

v0.3.0 also fails at the same point. v0.3.1 compiled successfully, but no change for the issue when I run my code.

@brada4 Did you read my previous comment about python? Numpy is completely irrelevant to the issue.

And btw, just tried to compute on doubles instead of floats using my base v0.3.7. LAPACKE_dgesvd() works excellent, no problems here. I'd say, that the fact that LAPACKE_dgesvd() works as expected may hint, that the root cause of the issue may lie not only in compiler quirks, but in OpenBLAS's code too.

@Arech
Copy link
Author

Arech commented Oct 30, 2019

oh... If I'd only knew ahead that it'd take almost two days to switch to a newer version... But finally I made it work. I've just managed to install mingw-w64 from previous debian release. Compiler from there have version 6.3.0 and links to the same -3 fortran ABI .dll... Now both functions, - single precision LAPACKE_sgesvd() and double precision LAPACKE_dgesvd(),- works as expected.

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2019

It is unlikely fortran issue.
It is the new gcc using unmarked registers through assembly code at least in v9
Before it tried to scan assembly section and detect what registers are used inside.
EDIT
Numpy SVD uses [sd]gesdd - does that fail just as blatantly as _gesvd ?
It is just one function losing one register somewhere inside, so far kernels differing between BARCELONA and OPTERON are "cleared" , still like 80 left, too much to look through manually.

@martin-frbg
Copy link
Collaborator

Have now learned that it is possible to rebuild MXE with a more recent gcc, will provide an updated
windows package (or possibly two, with and without the FPCSR thingy) when I find the time.
The fortran ABI issue is indeed mostly conjecture - there have been posts in other projects reporting mysterious hard to reproduce crashes that went away when a or the workaround was employed. Not sure I understand the comment about "unmarked registers", hopefully all the earlier cases of wrong constraints and writes to input-only registers have been addressed in recent releases.

@Arech
Copy link
Author

Arech commented Oct 31, 2019

@martin-frbg

Have now learned that it is possible to rebuild MXE with a more recent gcc, will provide an updated
windows package (or possibly two, with and without the FPCSR thingy) when I find the time.

That would be a good thing, if it's not very burdensomely for you.

Regarding the issue... I personally think that it's very important, that the issue is easily reproducible, because it makes much easier to find the root cause of the issue. If it's actually an issue in OpenBLAS - it could easily be fixed. If the compiler is to blame, - it would very beneficial for the whole society if a reproducible compiler bug report will be created.

Should I make a short isolated code to reproduce the issue exactly as I saw it, so some of you (who are sufficiently familiar with the OpenBLASs code) could debug the library and find the real cause?

@martin-frbg
Copy link
Collaborator

If it is not too much trouble for you, an isolated test code would be great - that way it could hopefully be established if this is a bug in OpenBLAS or in recent mingw ports of gcc.

@Arech
Copy link
Author

Arech commented Oct 31, 2019

No problem, I'll post it soon

@Arech
Copy link
Author

Arech commented Oct 31, 2019

@martin-frbg please take a look into rep https://github.com/Arech/sgesvd_tester

Note that actually in a conventional FP mode sgesvd() from a buggy binary is able to catch and return an error (however, it'll still produce NaNs in output). It will silently return success with a junk in output when FP rounding mode was set to "round towards zero". Proper binary works great in both modes.

Feel free to ping me if you need some more info/help.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2019

@Arech and how would reference BLAS and MKL react to your FPU compliance diversions?

@Arech
Copy link
Author

Arech commented Oct 31, 2019

@brada4 Andrew, I'd like you to answer me the following two questions first before I answer yours:

  1. What source link you have to support your claim about "compliance diversions"?
  2. Did you notice that the issue appears in absolutely standard FPU state using compiler v8.3 and doesn't appear for any other compiler?

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2019

Your code changes FPU flags only on calling thread, others remain in default state, you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads, that is limitation of mingw32. Maybe then it works properly with strange rounding modes too, certainly they produce different results from standard conditions in all cases.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 31, 2019

you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads, that is limitation of mingw32.

Makes me wonder if (the effect of) setting CONSISTENT_FPCSR=1 could/should be done automatically at compile time if defined(__MINGW32__) (Still have to reread #1237, but back then I was probably more respectful of what I assumed to be conscious coding decisions taken by earlier authors)

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2019

MKL propagates user-set FPU config to all threads. I am puzzled why here even single-threaded case was failing.
Slowness is a concern only on older cpus:
https://www.cs.uaf.edu/~olawlor/papers/2005/denormal/lawlor_denormal_2005.pdf , the absent FPCSR (and respective other SSE register) propagation is pain on windows.
What i found strange is repeater that does not match problem statement and obviously flawed RNG up there.

@Arech
Copy link
Author

Arech commented Nov 1, 2019

@brada4

Your code changes FPU flags only on calling thread, others remain in default state, you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads...

Now I think I got you idea. If I understand it correctly, there is a possibility that:

  • in case of non-standard FPU config (rounding to zero) sgesvd() fails to catch convergence error because of different FP handling in worker threads (and it has a right to do so), and that is the reason why it returns success while actually it's an error with all accompanying junk in output vars
  • in case of standard FPU config sgesvd() is able to catch an error, that is why it returns it. Different sgesvd() behaviour on different compiler versions (no error on 6.3, error on 8.3) may happen not due to some issues with OpenBLAS or with the new compiler, but because new compiler may legitimately reorder some math instructions (if compiled with non-strict -fmath switch /*it is so in reality?*/) and therefore code may produce slightly different result, and that is enough to generate a convergence error with single point precision while double is still ok.

Well... that is fair point. I was going to test it, but

I am puzzled why here even single-threaded case was failing.

Indeed. Setting env variables OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 changes nothing... I even pushed an update to the test that confirms the state of this variables and double checked in debugger that OpenBLAS is indeed single threaded.


Now, regarding non standart FPU state and if it's necessary to propagate to worker threads...

Unfortunately, I don't remember exactly, why I chose to use this setting in my main codebase. According to a comment, it seems that it was intended to prevent NaNs from occurring in vectorized form of ::std::exp() function (that is possibly a quirk of either the compiler or my hardware, because denormals were already disabled at that moment, but NaNs kept occurring from ::std::exp() anyway)..

Therefore as long as OpenBLAS functions does not produce NaNs (and it seems that setting CONSISTENT_FPCSR=1 is enough to achieve it) - there's totally no need to push that non-conventional FPU config to it at least for my task. It is actually my bug that I forgot to restore FPU state back to normal before calling OpenBLAS, and I'm going to fix it.

So, to reiterate, for me personally (don't know about other use-cases) there's no need for OpenBLAS to support non-conventional FPU config's as long as it doesn't produce denormal numbers.

What really bothers me is that even in totally standard FPU config sgesvd() always converges when it was compiled with v6.3 and always fails when it was compiled with v8.3.

Any ideas why it happens?

@brada4
Copy link
Contributor

brada4 commented Nov 1, 2019

This injustice is not fixable. Just like netlib blas here IEEE754 conformant FPU is expected, with all NaN signalling etc. For those caring less denormals can be disabled.
You know rounding here or ther changes away from standard behaviour, and gives different results.

@Arech
Copy link
Author

Arech commented Nov 2, 2019

@brada4 What "this injustice" exactly you're talking about?

@martin-frbg Did you understand Andrew's point? Do you agree with him?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 2, 2019

Probably something lost in translation... @brada4 could you rephrase your comment ?
But part of the problem would likely be that netlib LAPACK/LAPACKE is not yet fully prepared to deal with NaN, so perhaps deficiency or incapability was meant ?

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2019

Things like -ffast-float were mentioned, thats much worse than having just denormals wiped away. Then example of random input - obvious increments. Then code programming FPU rounding modes.... Not sure which breaks LAPACK and which OpenBLAS....

There is shortcoming that FPU modes are not distributed pn each call to threads like MKL does.

@Dr-Desty-Nova
Copy link

Have same problem, can I somehow obtain the configuration/flags of the MXE environment that @martin-frbg uses?

@brada4
Copy link
Contributor

brada4 commented Feb 20, 2021

Same problem to which?

@Dr-Desty-Nova
Copy link

Same problem as in OP post. Custom compiled 0.3.7 produces weird results in _sgesvd(). Tried 0.3.12, same issue. MinGW version 9.3.

Flag set:

TARGET=NEHALEM
BINARY=64
USE_THREAD=0
USE_LOCKING=1
NUM_THREADS=200
HOSTCC=gcc 
CC=x86_64-w64-mingw32-gcc
FC=x86_64-w64-mingw32-gfortran

@brada4
Copy link
Contributor

brada4 commented Feb 23, 2021

What is your CPU? Like off CPU-Z screenshot will do.
I think Skylake-X was fresh then at 0.3.7, and some threading changes were found unstable later. It is really important to test latest release for both fixed. 0.2.20 has no AVX-512, and old but stable heavy thread locking that sometimes adds 50% to executing time to result, but result is accurate. FreeBSD still has that version for some architectures until very recently.
EDIT: sure, not today or urgently, just when you are well at work.

@Dr-Desty-Nova
Copy link

What is your CPU? Like off CPU-Z screenshot will do.

Here it is (Coffee Lake, I think?):

processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
stepping        : 10
microcode       : 0xde
cpu MHz         : 4299.998
cache size      : 12288 KB
physical id     : 0
siblings        : 12
core id         : 5
cpu cores       : 6
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips        : 6399.96
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

@Dr-Desty-Nova
Copy link

Tried with 0.3.13, flag set

DYNAMIC_ARCH=1
DYNAMIC_OLDER=1
CONSISTENT_FPCSR=1
TARGET=CORE2
BINARY=64
USE_THREAD=0
USE_LOCKING=1
NUM_THREADS=200
HOSTCC=gcc 
CC=x86_64-w64-mingw32-gcc
FC=x86_64-w64-mingw32-gfortran

same freeze:

0	ntoskrnl.exe!KeSynchronizeExecution+0x5b66
1	ntoskrnl.exe!KeWaitForMutexObject+0x1460
2	ntoskrnl.exe!KeWaitForMutexObject+0x98f
3	ntoskrnl.exe!KeWaitForMutexObject+0x233
4	ntoskrnl.exe!ExWaitForRundownProtectionRelease+0x7dd
5	ntoskrnl.exe!KeWaitForMutexObject+0x3a29
6	ntoskrnl.exe!KeSynchronizeExecution+0x3140
7	libopenblas.dll!SLARTG+0x22b
8	libopenblas.dll!SBDSQR+0xebf
9	libopenblas.dll!SGESVD+0x2118

I see 12 threads (I have 12 logical cores) started by openblas.dll, which is very weird, as I have USE_THREAD=0 in my flag set

image

Each of them has this stack trace:

0	ntoskrnl.exe!KeSynchronizeExecution+0x5b66
1	ntoskrnl.exe!KeWaitForMutexObject+0x1460
2	ntoskrnl.exe!KeWaitForMutexObject+0x98f
3	ntoskrnl.exe!KeWaitForMutexObject+0x233
4	ntoskrnl.exe!ExWaitForRundownProtectionRelease+0x7dd
5	ntoskrnl.exe!KeWaitForMutexObject+0x3a29
6	ntoskrnl.exe!KeWaitForMutexObject+0x1787
7	ntoskrnl.exe!KeWaitForMutexObject+0x98f
8	ntoskrnl.exe!KeWaitForMultipleObjects+0x2be
9	ntoskrnl.exe!ObWaitForMultipleObjects+0x2f0
10	ntoskrnl.exe!FsRtlCancellableWaitForMultipleObjects+0x229
11	ntoskrnl.exe!setjmpex+0x7cc5
12	ntdll.dll!ZwWaitForMultipleObjects+0x14
13	KERNELBASE.dll!WaitForMultipleObjectsEx+0xf0
14	KERNELBASE.dll!WaitForMultipleObjects+0xe
15	libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll!openblas_read_env+0x390

@brada4
Copy link
Contributor

brada4 commented Feb 25, 2021

Strange behaviour from fortran.
Try setting:
OPENBLAS_NUM_THREADS=1
and
OMP_NUM_THREADS=1
Just speculation that fortran reads environment somehow, that one of those may apply overriding your wish at build time.

@Dr-Desty-Nova
Copy link

Dr-Desty-Nova commented Feb 25, 2021

@brada4 this helped, I no longer see libopenblas.*.gfortran*!openblas_read_env threads, now there's only one thread for the whole process. But it is still frozen in this:

ntoskrnl.exe!KeSynchronizeExecution+0x5b66
ntoskrnl.exe!KeWaitForMutexObject+0x1460
ntoskrnl.exe!KeWaitForMutexObject+0x98f
ntoskrnl.exe!KeWaitForMutexObject+0x233
ntoskrnl.exe!ExWaitForRundownProtectionRelease+0x7dd
ntoskrnl.exe!KeWaitForMutexObject+0x3a29
ntoskrnl.exe!KeSynchronizeExecution+0x3140
libopenblas.dll!SBDSQR+0x796
libopenblas.dll!SGESVD+0x2118

So there's nothing in other threads it could wait for, that's for sure. Some weird issue with USE_LOCKING?

@brada4
Copy link
Contributor

brada4 commented Feb 25, 2021

Threads should not have happened at all in first place. KeWhatever are driver functions, SBDSQR will just make use of few library calls down the call chain, would not even read a file or display a pixel to to touch OS drivers, let alone kernel mode.

@martin-frbg
Copy link
Collaborator

Very strange. All code that queries the environment for the number of threads at runtime should already be guarded by #ifdef SMP, and SMP is unset in Makefile.system when USE_THREAD=0 was specified. Did you run make clean between build attempts with varied options ?

@Dr-Desty-Nova
Copy link

I've ran git reset --hard && git clean -fdx and don't use any ccache, I hope that's sufficient.

@Dr-Desty-Nova
Copy link

Note: I didn't use make clean because in MSYS2 Windows environment calls to find were rerouted to C:\Windows\system32\find.exe which has completely different syntax.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 25, 2021

Is your own test code multithreaded by any chance ? (Still trying to understand where the ntoskernel calls come from - SGESVD/SBDSQR is plain old single-threaded fortran from the netlib reference implementation, it will call the OpenBLAS SROT kernel which on x86_64 will try to run parallel tasks - but again only #ifdef SMP at compile time.)

@brada4
Copy link
Contributor

brada4 commented Feb 25, 2021

Thread bt would wildly diverge if outer threads would call into library.
I (linux-mingw-)built 0.2.20 and 0.3.13 and devel - actually lapack included it imports gomp, no mention of any omp in make's output.

@Dr-Desty-Nova
Copy link

Is your own test code multithreaded by any chance ?

I use OpenBLAS through Python bindings and Kaldi ASR but no, this particular example is one thread only.

image

@martin-frbg
Copy link
Collaborator

hmm. can you upload your Makefile.conf and config.h please ? Maybe one of these generated files contains a clue. And if you are building the 0.3.13 tag and not current develop, you will be missing the fix from #3111 if the hang happens during thread shutdown. (Though it looks more like thread startup, and there should not be any threads starting anyway).

@Dr-Desty-Nova
Copy link

Sure.

Yes, I've been building from v0.3.13 tag. Will try develop tomorrow.

@Dr-Desty-Nova
Copy link

Dr-Desty-Nova commented Feb 26, 2021

Please disregard my last comments, looks like I was still copying from old location with 0.3.7. I'll get proper tests shortly.

@Dr-Desty-Nova
Copy link

I don't see freezes now, but sgesvd output still doesn't match what was in binaries from MXE. Rogue threads problem also remains, and make clean in MSYS2 still invokes find.exe from C:\Windows\system32.

I'll try more tests with different flag sets and building from MSYS2 MinGW 5.x/6.x/8.x/10.x and Linux MinGW 5.x/6.x/8.x/10.x.
Will report here how it went.

@brada4
Copy link
Contributor

brada4 commented Feb 26, 2021

probably you need to install findutils on msys2 to override windows tool

@brada4
Copy link
Contributor

brada4 commented Feb 26, 2021

If you built multi-processor version and constrain it with environment variables? Those fortran-borne threads are not expected.

@Dr-Desty-Nova
Copy link

Ok, so here's the thing: in my case it boils down to USE_THREAD=0. When I have USE_THREAD=1, everything works perfectly. When I have USE_THREAD=0, there are SEH: Unhandled exception 0xC0000005, hangs/freezes and strange _sgesvd() results. I've tried all MinGW versions starting from 4.9.x up to 10.x.

Now, I always have NUM_THREADS=200, regardless of USE_THREAD, because I've read a lot of issues in this repository and I see mentions that this needs to be set because that's how GotoBLAS worked, even in single-threaded mode. There's a possibility that these options (and USE_LOCKING=1) interact in a weird way.

Should I create another ticket for this issue? Don't want piggybacking here as in the end my issue seems to be unrelated to the OP's post.

@Dr-Desty-Nova
Copy link

probably you need to install findutils on msys2 to override windows tool

I checked and it' s installed. There are both /usr/bin/find and C:\Windows\system32\find, it's just system path precedes MSYS2 one. I don't know if it's the default, my MSYS2 installation is pretty old, just letting you know this can happen.

@brada4
Copy link
Contributor

brada4 commented Feb 27, 2021

0xc0000005 is null pointer exception, i.e either malloc failure result, or unitialized RAM, or something else is passed further as a pointer to supposedly existing memory area.
Try running crashing sample from x64dbg , with good wind you will get decoded all arguments to their names, namely to look for pointers (A B C) that are zero somewhere in call chain, all chain since entry into OpenBLAS leading to that are very likely test cases, the closer to the crash it is still repeatable the better.
Next is slightly more complicated, thus optional, now with working threaded sample set breakpoint at crashing function and compare entry values, with ASLR they nibble between runs, but zeroes should be perfectly distinguishable.

PS Keep this ticket, the USE_THREADS inflicting crash already reduces space to look for the bug to ifdef-s using that.

@Dr-Desty-Nova
Copy link

Looks like AVX disassembly to me. Intermediate details:

Exception info
EXCEPTION_DEBUG_INFO:
           dwFirstChance: 1
           ExceptionCode: C0000005 (EXCEPTION_ACCESS_VIOLATION)
          ExceptionFlags: 00000000
        ExceptionAddress: 000000006D945324 libopenblas.000000006D945324
        NumberParameters: 2
ExceptionInformation[00]: 0000000000000001 Write
ExceptionInformation[01]: 00000000B10E0000 Inaccessible Address
First chance exception on 000000006D945324 (C0000005, EXCEPTION_ACCESS_VIOLATION)!
No functions
CPU view

image

Memory map

image

Thread stack

image

@brada4
Copy link
Contributor

brada4 commented Feb 27, 2021

Is it possible to expand libopenblas.dll .text session to see inside which function EIP is in? Like double-clicking in section or so.

7	libopenblas.dll!SLARTG+0x22b
8	libopenblas.dll!SBDSQR+0xebf
9	libopenblas.dll!SGESVD+0x2118

Also the 3 calls before crash - do you see atgs parsed, like to match these three functions?

? I think it writes hits guard page before allocation ? @martin-frbg ?

@martin-frbg
Copy link
Collaborator

No idea so far, perhaps need a build with DEBUG=1 (or -g) to see where it crashes (interesting that getrf2 appears to be implicated now)

@Dr-Desty-Nova
Copy link

Dr-Desty-Nova commented Feb 28, 2021

Tried building with DEBUG=1. Couldn't use x64dbg as it lacks support for DWARF symbols. Here's what drmingw shows to me:

diarization_api_demo.exe caused an Access Violation at location 000000006D98A09F in module libopenblas.dll Writing to location 0000000094B40000.

AddrPC           Params
000000006D98A09F 0000000000000020 0000000000000008 0000029B943E6070  libopenblas.dll!dgemm_oncopy  [..\kernel\x86_64\..\generic\gemm_ncopy_8.c @ 159]
000000006DAD451C 000000BB7197E470 0000000000000000 000000BB7197E360  libopenblas.dll!dgetrf_single  [getrf_single.c @ 130]
000000006DAD4345 000000BB7197E470 0000000000000000 0000000000000000  libopenblas.dll!dgetrf_single  [getrf_single.c @ 107]
000000006D7DB5AD 000000BB7197E594 000000BB7197E5B4 0000029B943DE070  libopenblas.dll!dgetrf_  [lapack\getrf.c @ 103]
00007FF94381DA2A 000000BB7197E594 000000BB7197E5B4 0000029B943DE070  xxxxxxxx.dll!kaldi::clapack_Xgetrf2  [c:\xxxxxxx\libs\kaldi-5.5.519\src\matrix\cblas-wrappers.h @ 400]
   398:                             double *Mdata, KaldiBlasInt *stride, KaldiBlasInt *pivot, 
   399:                             KaldiBlasInt *result) {
>  400:   dgetrf_(num_rows, num_cols, Mdata, stride, pivot, result);
   401: }
   402: 
00007FF943804049 000000BB7197E948 0000000000000000 0000000000000000  xxxxxxx.dll!kaldi::MatrixBase<double>::Invert  [c:\xxxxxxx\libs\kaldi-5.5.519\src\matrix\kaldi-matrix.cc @ 61]
    59:   }
    60: 
>   61:   clapack_Xgetrf2(&M, &N, data_, &LDA, pivot, &result);
    62:   const int pivot_offset = 1;
    63: #else
00007FF94374C03F 000000BB7197F190 000000BB7197F580 000000BB0000006F  xxxxxxxxx.dll!kaldi::Plda::ApplyTransform  [c:\xxxxxx\libs\kaldi-5.5.519\src\ivector\plda.cc @ 240]
   238:   // prior to diagonalization.
   239:   psi_mat.AddDiagVec(1.0, psi_);
>  240:   transform_invert.Invert();
   241:   within_var.AddMat2(1.0, transform_invert, kNoTrans, 0.0);
   242:   between_var.AddMat2Sp(1.0, transform_invert, kNoTrans, psi_mat, 0.0);

(It shows different exe, not python, as I'm trying to narrow it down and build reproducible example, so all the calls are the same)
Investigating further...

@martin-frbg
Copy link
Collaborator

Looking more and more like another instance of "BUFFERSIZE too small for the desired GEMM_P/Q/R so we write past the end of the GEMM buffer"

@martin-frbg
Copy link
Collaborator

Wild guess - could you try changing the #define REAL_GEMM_R (GEMM_R - GEMM_PQ) at the start of lapack\getrf_single.c
to #define REAL_GEMM_R (GEMM_R - 2*GEMM_PQ) please ?

@Dr-Desty-Nova
Copy link

Wild guess - could you try changing the #define REAL_GEMM_R (GEMM_R - GEMM_PQ) at the start of lapack\getrf_single.c
to #define REAL_GEMM_R (GEMM_R - 2*GEMM_PQ) please ?

Didn't help; I'm now trying to build more or less reproducible example and attach it here along with debug build of the library.

@brada4
Copy link
Contributor

brada4 commented Mar 5, 2021

Probably worth setting breakpoint at s/dgemm entry and just getting out size arguments, code path is not changed by the rest of content in matrices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants