Skip to content

Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) #3135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chokkyvista opened this issue Mar 10, 2021 · 19 comments

Comments

@chokkyvista
Copy link

Hey, I'm still seeing segfaults when doing numpy.matmul with two big matrices (numpy v1.20.1, OpenBLAS v0.3.13.dev).

This looks like potentially related to #2728 ?

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd34c490fa3, pid=1, tid=0x00007fd34f134740
#
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libopenblasp-r0-5bebc122.3.13.dev.so+0xe77fa3]  dgemm_oncopy_HASWELL+0x193

The stacktrace points to a line where it does sth. like np.matmul(samples, samples.T).

I was running the code in a docker container (enterprise environment), where NumPy was installed using pip.
Here's the spec of the compute cluster, from which 6 CPUs were allocated to the container.
image

threadpool_info via threadpoolctl shows the following, which confirms it was OpenBLAS v0.3.13.dev and that num_threads was correctly recognized to be 6.

[{'filepath': '/job/.local/lib/python3.7/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.13.dev'},
 {'filepath': '/job/.local/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1',
  'internal_api': 'openmp',
  'num_threads': 6,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/job/.local/lib/python3.7/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.9'}]

Let me know if you need any other information!

@martin-frbg
Copy link
Collaborator

Java runtime being involved suggests to me that you might simply be exceeding the default stacksize imposed by it. Please try if setting _JAVA_OPTIONS="-Xss4096k" in the environment helps.

@chokkyvista
Copy link
Author

Interesting! Thanks for the prompt reply! I'll give it a try and post what I find.

@chokkyvista
Copy link
Author

chokkyvista commented Mar 11, 2021

As it turns out, it's not as straightforward for me to change the JRE setting as the docker container is set up and maintained by others (and then provided to users like me as a compute platform).

Just for my understanding though - why should JRE settings (like stacksize) even be relevant in this case?

To clarify, the code in question doesn't involve Java at all, and after I enabled faulthandler, this became what I saw

Fatal Python error: Segmentation fault

Current thread 0x00007f38d45fe740 (most recent call first):
  ...                           # Followed by the normal Python stack trace

and JRE no longer showed up in the log.

Maybe JRE is a red herring? I mean, it seems to me that the segfault might well have originated from Python, yet, with Python faulthandler disabled, JRE was just the first to "catch" and log it?

@martin-frbg
Copy link
Collaborator

Maybe, maybe not. Why does the JRE feature in your context at all, are you perhaps loading java-based libraries like libhdfs ? (#2821) ?

@chokkyvista
Copy link
Author

Ah I get what you say. We do use fastparquet to load parquet files from HDFS! And as a temporary workaround, I've set OPENBLAS_NUM_THREADS to 1 for now.

I'll look into this a bit further and post updates.

@brada4
Copy link
Contributor

brada4 commented Mar 11, 2021

Can you confirm OPENBLAS_NUM_THREADS=1 do fix the problem 100% ?

@chokkyvista
Copy link
Author

chokkyvista commented Mar 11, 2021

It seems to be working so far, although I haven't got the time to isolate the issue into a minimal reproducible case to rigorously test it.

@chokkyvista
Copy link
Author

Hi, I've managed to reproduce this issue as shown in this gist.
I was running it inside a docker container of the same spec as above (i.e. 6 CPUs and 16 GB memory), and can consistently reproduce the results there, i.e. NO SEGFAULT only when OPENBLAS_NUM_THREADS is set to 1, and SEGFAULT otherwise.

@brada4
Copy link
Contributor

brada4 commented Mar 13, 2021

Computation takes little bit more than 16GB with threads. Docker thin-provisions memory while SGEMM will certainly use all of it.

@chokkyvista
Copy link
Author

Good point, although I increased memory to 64 GB to no avail.
I noticed memory peaked at just above 16GB with the default Java stack size, and increased to 22GB when I bumped the stack size to 8M.

Another thing I noticed - the SEGFAULT actually only happens about 80-90% of the time, and increasing the stack size seems to bring down the frequency by a little bit, albeit not entirely.

@chokkyvista
Copy link
Author

Btw, when OPENBLAS_NUM_THREADS is set to 1 and the computation goes out of memory on a larger input matrix, it's a normal OOM error and I don't see a SEGFAULT.

@brada4
Copy link
Contributor

brada4 commented Mar 13, 2021

I cannot repeat with pure 0.3.13 at all besides clearly OOM with 16GB RAM.
Could you update numpy (pip install --update --user numpy) so that openblas library versions match? Internals are not portable, it may call some locking function that does unexpected unintentionally.
EDIT: nope, it is docker that installs faraway numpy and scipy, get them from around same time, preferably latest or previous version with all partches on. (numpy=1.2.3 scipy=1.1.1 or so)

@chokkyvista
Copy link
Author

But in all my cases, NumPy is already at its latest version (i.e. v1.20.1) installed from scratch using pip.

@martin-frbg
Copy link
Collaborator

I cannot reproduce this problem (without docker/jre in the mix) so far. I do note however that the numpy-provided openblas binary gets built with a smaller GEMM buffersize than the usual default to limit memory requirements on big multi-core hardware - which may cause similar segfaults when the matrix size gets "too" big.

@brada4
Copy link
Contributor

brada4 commented Mar 13, 2021

NumPy is already at its latest version (i.e. v1.20.1)

This looks dated {'filepath': '/job/.local/lib/python3.7/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',

@chokkyvista
Copy link
Author

@brada4 That's SciPy, not NumPy. That said, it did look fishy to me at first and that's why I enabled faulthandler to capture exactly where the segfault originated, and as it turned out it was numpy.matmul (hence nothing to do with SciPy).
If you check the above minimal example I shared in the gist, you'll see there's no longer SciPy involved, and yet the segfault still happens.

@chokkyvista
Copy link
Author

@martin-frbg Thanks that's interesting. Let me test the example without JRE or docker on my side and see what happens. Will update.

@chokkyvista chokkyvista changed the title Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul (v0.3.13.dev) Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) Mar 15, 2021
@chokkyvista
Copy link
Author

Indeed I can only reproduce this issue when running it inside the docker container provided to me, which has things like JDK and Hadoop installed by default (and I cannot circumvent it).
On my laptop, the above minimal example in the gist I shared works just fine.

To add to the trickiness, the stack traceback from faulthandler isn't very helpful either, as it only shows things up until they leave Python.

So I don't know if there's more info I can provide at this stage to help with the debugging.
Feel free to close the issue. And thank you both for taking a look!

@brada4
Copy link
Contributor

brada4 commented Mar 16, 2021

Fault handler might be right that crash is in "new" library, but there might be sgemm_ or cblas_sgemm from the other library.
While numpy.dual is not explicitly selectable anymore, would be worth checking. If you have interactive console, just rename the 0.3.9 to something not ending in .so , and copy 0.3.13 over, maybe it cures magically.
EDIT i.e cblas_sgemm(cblas.so) -A-> dgemm_ (0.3.9) -X-> interface -X-> driver -X-> copy/ kernel / copy (0.3.13) ---- A is standard stable interface, X-es are internals that may and will change with new releases and even slight compiler deviations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants