Skip to content

segfault in dsyrk_thread_UT #2821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colinfang opened this issue Sep 4, 2020 · 10 comments
Closed

segfault in dsyrk_thread_UT #2821

colinfang opened this issue Sep 4, 2020 · 10 comments

Comments

@colinfang
Copy link

I hit segfault error when using scipy that links to openblas 0.3.10

It seems to only happen when OPENBLAS_NUM_THREADS is more than 1.

Any advice what can possibly go wrong here? I don't know too well with GDB. How can I find more useful information that helps debugging?

#0  0x00007ffff4d7eb98 in dsyrk_thread_UT ()
   from /home/jupyterhub/miniconda3/envs/colin/lib/python3.7/site-packages/numpy/core/../../../../libopenblas.so.0
#1  0x00007ffff4ebf03c in dpotrf_U_parallel ()
   from /home/jupyterhub/miniconda3/envs/colin/lib/python3.7/site-packages/numpy/core/../../../../libopenblas.so.0
#2  0x00007ffff4c87060 in dpotrf_ ()
   from /home/jupyterhub/miniconda3/envs/colin/lib/python3.7/site-packages/numpy/core/../../../../libopenblas.so.0
#3  0x00007fffedb1522a in f2py_rout.flapack_dpotrf ()
   from /home/jupyterhub/miniconda3/envs/colin/lib/python3.7/site-packages/scipy/linalg/_flapack.cpython-37m-x86_64-linux-gnu.so
#4  0x00005555556c100b in _PyObject_FastCallKeywords ()
    at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:199
#5  0x0000555555725d78 in call_function (kwnames=0x7fffeda35c30, oparg=<optimised out>,
    pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4619
@martin-frbg
Copy link
Collaborator

You'd need a debug build of OpenBLAS to see actual C source line number where the segfault occurred. How many threads, or is two already sufficient to blow it up? And would you happen to have a simple recipe for reproducing this?

@martin-frbg
Copy link
Collaborator

Could be something similar to #1929

@colinfang
Copy link
Author

I have made a docker image & test script at https://gist.github.com/colinfang/02b45e6751264b044e02cb7edd209c09

It is not easy to trigger the error. One has to have libhdfs.so loaded, and use conda installed numpy.

@martin-frbg
Copy link
Collaborator

Hmm. What would the Hadoop Filesystem library have to do with this - or is it only required due to the nature of your test script ?

@colinfang
Copy link
Author

I have no idea why libhdfs.so matters. I cannot reproduce the segfault without that pyarrow step. I even suspect it is an error on pyarrow side. But since the traceback stops at openblas, I put it here.

@Enchufa2
Copy link

@colinfang Do you know how many NUM_THREADS are defined in numpy's libopenblas? You could try to use a lower number to see if the segfault disappears as in #2839.

@colinfang
Copy link
Author

colinfang commented Sep 16, 2020

OPENBLAS_NUM_THREADS=2 triggers the segfault in runtime, OPENBLAS_NUM_THREADS=1 doesn't.

I don't know what effect NUM_THREADS settings in compile time would be.

@colinfang
Copy link
Author

I made a repo https://github.com/colinfang/openblas2821 so that I can use docker hub at https://hub.docker.com/repository/docker/colinfang/openblas2821

I can reproduce the error using the image built from docker hub. It segfaults everytime.

@martin-frbg
Copy link
Collaborator

As you have Java in the mix once you load libhdfs, could you try with setting _JAVA_OPTIONS="-Xss4096k" in your environment ? Current understanding of #2839 is that the java runtime imposes a much smaller stack limit than usual OS defaults which throws off OpenBLAS' expectations of how much it can simply put on the stack before switching to malloc.

@colinfang
Copy link
Author

Yes that works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants