Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) #3135

chokkyvista · 2021-03-10T16:32:14Z

Hey, I'm still seeing segfaults when doing numpy.matmul with two big matrices (numpy v1.20.1, OpenBLAS v0.3.13.dev).

This looks like potentially related to #2728 ?

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd34c490fa3, pid=1, tid=0x00007fd34f134740
#
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libopenblasp-r0-5bebc122.3.13.dev.so+0xe77fa3]  dgemm_oncopy_HASWELL+0x193

The stacktrace points to a line where it does sth. like np.matmul(samples, samples.T).

I was running the code in a docker container (enterprise environment), where NumPy was installed using pip.
Here's the spec of the compute cluster, from which 6 CPUs were allocated to the container.

threadpool_info via threadpoolctl shows the following, which confirms it was OpenBLAS v0.3.13.dev and that num_threads was correctly recognized to be 6.

[{'filepath': '/job/.local/lib/python3.7/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.13.dev'},
 {'filepath': '/job/.local/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1',
  'internal_api': 'openmp',
  'num_threads': 6,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/job/.local/lib/python3.7/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',
  'internal_api': 'openblas',
  'num_threads': 6,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.9'}]

Let me know if you need any other information!

The text was updated successfully, but these errors were encountered:

martin-frbg · 2021-03-10T17:00:50Z

Java runtime being involved suggests to me that you might simply be exceeding the default stacksize imposed by it. Please try if setting _JAVA_OPTIONS="-Xss4096k" in the environment helps.

chokkyvista · 2021-03-10T18:07:39Z

Interesting! Thanks for the prompt reply! I'll give it a try and post what I find.

chokkyvista · 2021-03-11T12:08:21Z

As it turns out, it's not as straightforward for me to change the JRE setting as the docker container is set up and maintained by others (and then provided to users like me as a compute platform).

Just for my understanding though - why should JRE settings (like stacksize) even be relevant in this case?

To clarify, the code in question doesn't involve Java at all, and after I enabled faulthandler, this became what I saw

Fatal Python error: Segmentation fault

Current thread 0x00007f38d45fe740 (most recent call first):
  ...                           # Followed by the normal Python stack trace

and JRE no longer showed up in the log.

Maybe JRE is a red herring? I mean, it seems to me that the segfault might well have originated from Python, yet, with Python faulthandler disabled, JRE was just the first to "catch" and log it?

martin-frbg · 2021-03-11T12:35:57Z

Maybe, maybe not. Why does the JRE feature in your context at all, are you perhaps loading java-based libraries like libhdfs ? (#2821) ?

chokkyvista · 2021-03-11T12:51:29Z

Ah I get what you say. We do use fastparquet to load parquet files from HDFS! And as a temporary workaround, I've set OPENBLAS_NUM_THREADS to 1 for now.

I'll look into this a bit further and post updates.

brada4 · 2021-03-11T16:32:28Z

Can you confirm OPENBLAS_NUM_THREADS=1 do fix the problem 100% ?

chokkyvista · 2021-03-11T21:03:20Z

It seems to be working so far, although I haven't got the time to isolate the issue into a minimal reproducible case to rigorously test it.

chokkyvista · 2021-03-12T18:15:04Z

Hi, I've managed to reproduce this issue as shown in this gist.
I was running it inside a docker container of the same spec as above (i.e. 6 CPUs and 16 GB memory), and can consistently reproduce the results there, i.e. NO SEGFAULT only when OPENBLAS_NUM_THREADS is set to 1, and SEGFAULT otherwise.

brada4 · 2021-03-13T07:26:28Z

Computation takes little bit more than 16GB with threads. Docker thin-provisions memory while SGEMM will certainly use all of it.

chokkyvista · 2021-03-13T10:16:05Z

Good point, although I increased memory to 64 GB to no avail.
I noticed memory peaked at just above 16GB with the default Java stack size, and increased to 22GB when I bumped the stack size to 8M.

Another thing I noticed - the SEGFAULT actually only happens about 80-90% of the time, and increasing the stack size seems to bring down the frequency by a little bit, albeit not entirely.

chokkyvista · 2021-03-13T10:36:50Z

Btw, when OPENBLAS_NUM_THREADS is set to 1 and the computation goes out of memory on a larger input matrix, it's a normal OOM error and I don't see a SEGFAULT.

brada4 · 2021-03-13T18:10:50Z

I cannot repeat with pure 0.3.13 at all besides clearly OOM with 16GB RAM.
Could you update numpy (pip install --update --user numpy) so that openblas library versions match? Internals are not portable, it may call some locking function that does unexpected unintentionally.
EDIT: nope, it is docker that installs faraway numpy and scipy, get them from around same time, preferably latest or previous version with all partches on. (numpy=1.2.3 scipy=1.1.1 or so)

chokkyvista · 2021-03-13T19:39:04Z

But in all my cases, NumPy is already at its latest version (i.e. v1.20.1) installed from scratch using pip.

martin-frbg · 2021-03-13T19:43:07Z

I cannot reproduce this problem (without docker/jre in the mix) so far. I do note however that the numpy-provided openblas binary gets built with a smaller GEMM buffersize than the usual default to limit memory requirements on big multi-core hardware - which may cause similar segfaults when the matrix size gets "too" big.

brada4 · 2021-03-13T22:06:37Z

NumPy is already at its latest version (i.e. v1.20.1)

This looks dated {'filepath': '/job/.local/lib/python3.7/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',

chokkyvista · 2021-03-15T10:45:27Z

@brada4 That's SciPy, not NumPy. That said, it did look fishy to me at first and that's why I enabled faulthandler to capture exactly where the segfault originated, and as it turned out it was numpy.matmul (hence nothing to do with SciPy).
If you check the above minimal example I shared in the gist, you'll see there's no longer SciPy involved, and yet the segfault still happens.

chokkyvista · 2021-03-15T10:47:11Z

@martin-frbg Thanks that's interesting. Let me test the example without JRE or docker on my side and see what happens. Will update.

chokkyvista · 2021-03-15T16:45:32Z

Indeed I can only reproduce this issue when running it inside the docker container provided to me, which has things like JDK and Hadoop installed by default (and I cannot circumvent it).
On my laptop, the above minimal example in the gist I shared works just fine.

To add to the trickiness, the stack traceback from faulthandler isn't very helpful either, as it only shows things up until they leave Python.

So I don't know if there's more info I can provide at this stage to help with the debugging.
Feel free to close the issue. And thank you both for taking a look!

brada4 · 2021-03-16T17:46:30Z

Fault handler might be right that crash is in "new" library, but there might be sgemm_ or cblas_sgemm from the other library.
While numpy.dual is not explicitly selectable anymore, would be worth checking. If you have interactive console, just rename the 0.3.9 to something not ending in .so , and copy 0.3.13 over, maybe it cures magically.
EDIT i.e cblas_sgemm(cblas.so) -A-> dgemm_ (0.3.9) -X-> interface -X-> driver -X-> copy/ kernel / copy (0.3.13) ---- A is standard stable interface, X-es are internals that may and will change with new releases and even slight compiler deviations.

chokkyvista changed the title ~~Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul (v0.3.13.dev)~~ Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) Mar 15, 2021

martin-frbg closed this as completed Apr 5, 2021

Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) #3135

Segfault on dgemm_oncopy_HASWELL triggered by numpy.matmul inside a docker container (v0.3.13.dev) #3135

Comments

chokkyvista commented Mar 10, 2021

martin-frbg commented Mar 10, 2021

Uh oh!

chokkyvista commented Mar 10, 2021

Uh oh!

chokkyvista commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 11, 2021

Uh oh!

chokkyvista commented Mar 11, 2021

Uh oh!

brada4 commented Mar 11, 2021

Uh oh!

chokkyvista commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chokkyvista commented Mar 12, 2021

Uh oh!

brada4 commented Mar 13, 2021

Uh oh!

chokkyvista commented Mar 13, 2021

Uh oh!

chokkyvista commented Mar 13, 2021

Uh oh!

brada4 commented Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chokkyvista commented Mar 13, 2021

Uh oh!

martin-frbg commented Mar 13, 2021

Uh oh!

brada4 commented Mar 13, 2021

Uh oh!

chokkyvista commented Mar 15, 2021

Uh oh!

chokkyvista commented Mar 15, 2021

Uh oh!

chokkyvista commented Mar 15, 2021

Uh oh!

brada4 commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chokkyvista commented Mar 11, 2021 •

edited

Loading

chokkyvista commented Mar 11, 2021 •

edited

Loading

brada4 commented Mar 13, 2021 •

edited

Loading

brada4 commented Mar 16, 2021 •

edited

Loading