Skip to content

segmentation fault in dgemm_otcopy #1694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blgolden opened this issue Jul 21, 2018 · 15 comments
Closed

segmentation fault in dgemm_otcopy #1694

blgolden opened this issue Jul 21, 2018 · 15 comments

Comments

@blgolden
Copy link

blgolden commented Jul 21, 2018

I've been using Openblas on my ubuntu 16.04 LTS systems for a few years with no issues. However, the last month for one of my problems I am getting a segmentation fault from dgemm_otcopy. The analysis runs every week and the matrix I am factoring gets a little bigger every week. However, the fault only occurs occasionally (twice in the last 5 weeks) and only on some of my computers (2 out of 4 of them virtually identically configured). When it does fail on a system, it is easily reproducible. The matrix being factored (using cholmod) is very large but very sparse. This week it is 12544654 by 12544654 with 71272674 nnz. The really strange thing is, it only segmentation faults when the executable is called from a shell script (bash). I can even make it fail with a 1 line script. But it doesn't ever fault when I run the executable from the command line, and the answer is sensible.

The fault occurs regardless of which version of Openblass I use (currently 0.2.20). At first, I suspected it overran stack so I ulimit -s unlimited and it actually changed the location where it faulted from a free memory call (classic stack overrun) to the dgemm_otcopy.

Here's the backtrace,

(gdb) bt
#0  0x00007fc0cd6cdf3c in dgemm_otcopy_HASWELL () from /usr/local/lib/libopenblas.so.0
#1  0x00007fc0cc1e6488 in ?? () from /usr/local/lib/libopenblas.so.0
#2  0x00007fc0cc302349 in exec_blas () from /usr/local/lib/libopenblas.so.0
#3  0x00007fc0cc1e6cd0 in dsyrk_thread_LN () from /usr/local/lib/libopenblas.so.0
#4  0x00007fc0cc31a354 in dpotrf_L_parallel () from /usr/local/lib/libopenblas.so.0
#5  0x00007fc0d339b4f2 in dpotrf_ () from /usr/lib/liblapack.so.3
#6  0x0000000000406d40 in r_cholmod_super_numeric.isra ()
#7  0x0000000000432d13 in cholmod_l_super_numeric ()
#8  0x0000000000418647 in cholmod_l_factorize_p ()
#9  0x00000000004188b7 in cholmod_l_factorize ()
#10 0x000000000040a12b in main ()

Any thoughts, suggestions, etc where to look next? I have seen a few older reports of a similar failure with no resolutions.

Kind regards,
B

Edit: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz - 6 cores plus hyper-threading
I'll try both 0.3.1 and current development branch and get back....

@martin-frbg
Copy link
Collaborator

#1137 could be related (misaligned input) - in that case changing the two instances of movaps in kernel/x86_64/copy_sse2.S to movups should make the problem go away. (This fix was never implemented due to my possibly unfounded concerns over performance impact on older/AMD systems and complete lack of feedback from others.)
Does the segfault still occur with 0.3.1 or current develop branch ? Perhaps running your code from valgrind would provide additional information (if your computer is fast enough and has enough memory to do this with the given matrix size)

@brada4
Copy link
Contributor

brada4 commented Jul 21, 2018

HASWELL was added to openblas with 0.2.15 only
Ubuntu 16 ships 0.2.18
Are you sure it is 0.2.2 , like five years old?

https://packages.ubuntu.com/source/xenial/openblas
You should really use this method:
https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts

What backtrace says that liblapack.so.3 is redirection to apt-provided openblas (0.2.18),which calls internal function from your provided openblas build (lines 4 and 5 in your backtrace)

Can you provide result from consistent build e.g. adapting instruction to your build of known version of OpenBLAS, or reverting to complete apt package?
In principle using netlib lapack with openblas BLAS shoud also work via update-alternatives *that will not call openblas internals , which are not stable API, probably giving away lots of potrf performance.

@martin-frbg
Copy link
Collaborator

Fairly certain that 0.2.20 was meant here, as that was the last update that was linked on the openblas.net webpage before xianyi became mostly unavailable.

@blgolden
Copy link
Author

blgolden commented Jul 21, 2018 via email

@brada4
Copy link
Contributor

brada4 commented Jul 21, 2018

It is a mix of versions problem.
At present (setup of your system software) you cannot deviate too far from 0.2.18 unless your read the FAQ

@brada4
Copy link
Contributor

brada4 commented Jul 25, 2018

Any success stories?

@blgolden
Copy link
Author

blgolden commented Jul 25, 2018 via email

@brada4
Copy link
Contributor

brada4 commented Jul 25, 2018

Can you show update-alternatives --list to assure consistent blas and lapack are used?

debug package apt install libopenblas0-dbg would help to decode code line numbers and function parameters.

@blgolden
Copy link
Author

blgolden commented Jul 25, 2018 via email

@blgolden
Copy link
Author

blgolden commented Jul 25, 2018 via email

@martin-frbg
Copy link
Collaborator

Probably best to continue with current develop branch. Did you try changing the movaps to movups yet ?

@brada4
Copy link
Contributor

brada4 commented Jul 26, 2018

On my freshly installed ubuntu:
libblas.so and liblapack.so point to files in /usr/lib/lapack (those are used by gcc), they should never ever be openblas (or atlas or MKL or whatever)
*.3 point to /usr/lib/openblas/ counterparts (those are for ld.so at runtime, this is the place for accelerated BLAS)

There is no /usr/lib/libopenblasp-r0.2.18.so in my ubuntu 16.04 (xenial)
Since it was used by GCC compiling all your module, you will probably need to re-compile it (maybe even both - cholmod and executable)

Could you, please, uninstall all openblas apt packages, and remove spurious file(s) until alternatives point to /usr/lib/lapack/*
"Probably it takes too long to run tests here"
Then install Ubuntu openblas package (libopenblas-dev)
"now test again"
Now install 0.3.1 or development version in accordance to FAQ
"now test for real"

@martin-frbg
Copy link
Collaborator

Any progress ?

@blgolden
Copy link
Author

blgolden commented Aug 16, 2018 via email

@brada4
Copy link
Contributor

brada4 commented Aug 16, 2018

Ubuntu suitesparse should link to supplanted blas without problem, if old version is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants