Skip to content

OpenBLAS segfaults at high thread counts #14857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kpamnany opened this issue Jan 30, 2016 · 35 comments
Closed

OpenBLAS segfaults at high thread counts #14857

kpamnany opened this issue Jan 30, 2016 · 35 comments
Labels
building Build system, or building Julia or its dependencies multithreading Base.Threads and related functionality performance Must go faster

Comments

@kpamnany
Copy link
Contributor

With threading enabled, on a 2-socket, 18-core HSW-EP (72 threads, with hyper-threading on):

$ export OPENBLAS_NUM_THREADS=1
$ export OMP_NUM_THREADS=1
$ export JULIA_NUM_THREADS=36
$ ./julia test/perf/threads/stockcorr/pstockcorr.jl
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

signal (11): Segmentation fault
while loading /home/kpamnany/julia/julia/test/perf/threads/stockcorr/pstockcorr.jl, in expression starting on line 155
Segmentation fault

OpenMathLib/OpenBLAS#539 points out that this means more buffers are needed, i.e. NUM_THREADS needs to be increased.

OpenBLAS appears to figure the machine out correctly. In deps/build/openblas/Makefile.conf:

...
CORE=HASWELL
LIBCORE=haswell
NUM_CORES=72
HAVE_MMX=1
...

And in deps/build/openblas/Makefile.system:

...
ifndef NUM_THREADS
NUM_THREADS = $(NUM_CORES)
endif
...
CCOMMON_OPT     += -DMAX_CPU_NUMBER=$(NUM_THREADS)
...

The problem is in deps/Makefile:1092:

# On linux, try to provision for the largest possible machine currently
OPENBLAS_BUILD_OPTS += NUM_THREADS=16

This might be a bit out of date. :-)

As OpenBLAS appears to do a fine job of figuring out NUM_CORES in Makefile.conf (checked on both Linux and OS X), I suggest not setting NUM_THREADS in OPENBLAS_BUILD_OPTS at all (remove deps/Makefile:1080 through deps/Makefile:1094).

@kpamnany kpamnany added building Build system, or building Julia or its dependencies multithreading Base.Threads and related functionality labels Jan 30, 2016
@vtjnash
Copy link
Member

vtjnash commented Jan 30, 2016

that source-file comment isn't quite accurate anymore. we used to have a very large value there, but it caused performance and memory issues due to openblas creating too many static buffers and trying to use too many cores for the problem size. it is now set to the largest value that seemed to provide improved performance.

@ViralBShah
Copy link
Member

I hope that as openblas evolves, this is one of those things that goes away.

@kpamnany
Copy link
Contributor Author

kpamnany commented Feb 2, 2016

I'm suggesting not setting a value for NUM_THREADS in deps/Makefile at all. Let OpenBLAS use its NUM_CORES as it seems to be correctly figuring out the right value to use.

@vtjnash
Copy link
Member

vtjnash commented Feb 2, 2016

true, if the user disables OPENBLAS_DYNAMIC_ARCH, it might make sense to disable NUM_CORES also

@tkelman
Copy link
Contributor

tkelman commented Feb 2, 2016

If dynamic arch is disabled, then this could make sense. Otherwise wouldn't you be limiting the number of threads supported by openblas to the number the build machine (or VM) happened to have?

@kpamnany
Copy link
Contributor Author

kpamnany commented Feb 2, 2016

Yes. Isn't using the build machine specs a better default than a hard-coded value? There's no reasonable default that can be hard-coded that will work across a 4-core Core i5 laptop, a 36 core Xeon server, and a 70 core Xeon Phi blade.

As a guess at a solution -- proper cross-compilation should be made easy by defining a set of cross-compilation options that need to be specified; the number of threads a library should be able to handle would be one of these.

@vtjnash
Copy link
Member

vtjnash commented Feb 2, 2016

there's not much reason to believe the build machine is similar to the production machine, so we opt for predictability over guessing. but to expand on my point above, iirc the value in the makefile was set after observing that openblas was performing worse on small to medium problems beyond about 4-16 threads (presumably due to scheduling overhead) with testing performed on an 80-core Intel(R) Xeon(R) CPU E7-8850 (716151d)

@tkelman
Copy link
Contributor

tkelman commented Feb 2, 2016

The build machines that we produce binaries (which most users probably start with) from are small VM's so no I don't think we should be letting their particular configuration influence the build artifacts. That build configuration for creating redistributable binaries could be put behind a non-default makefile flag though, and have the conventional source build default to specialized to the local machine - #9372

@kpamnany
Copy link
Contributor Author

kpamnany commented Feb 2, 2016

@vtjnash: The buffer count that is causing this segfault has nothing to do with OpenBLAS threads; a bigger value is required to allow 72 Julia threads to make concurrent calls into OpenBLAS. This actually argues for @xianyi to separate the buffer count from the thread count.

@tkelman: I like this idea -- the conventional source build defaulting to the local machine specs.

@yuyichao
Copy link
Contributor

yuyichao commented Feb 2, 2016

If it hard to let openblas to automatically/dynamically figure this out can this at least be set at runtime?

@tkelman
Copy link
Contributor

tkelman commented Feb 2, 2016

It absolutely can be set at runtime. The build-time value is a maximum. It may be worth trying to separate the maximum allowed vs the default value without having to set an environment variable to modify the default.

@kpamnany
Copy link
Contributor Author

kpamnany commented Feb 2, 2016

Unfortunately, while you can set OPENBLAS_NUM_THREADS or OMP_NUM_THREADS, to reduce the number of threads that OpenBLAS uses at runtime, the buffer count that is computed from the specified number of threads seems to be static, i.e. it can't be increased at runtime.

@ViralBShah
Copy link
Member

Discussed in JuliaLang/LinearAlgebra.jl#323 and closing as there isn't much we can do about openblas NUM_THREADS. Maybe what we need is real multi-threading and a Julia BLAS that uses that. Just saying. ;-)

@scottstanie
Copy link

Is there anything to be done for this for now? It seems Julia 1.3 works a little better to avoid this problem, but I still get weird things like crashing when I have OPENBLAS_NUM_THREADS=1 JULIA_NUM_THREADS=56 but perfectly okay when I do OPENBLAS_NUM_THREADS=4 JULIA_NUM_THREADS=36 (which is at least unexpected to me, since I would assume it would open blas would roughly multiply those two numbers for total allocations). I've had to tinker with random combinations of the two env variables to get something that comes close to used all 56 CPUs but also doesn't crash.

@ViralBShah
Copy link
Member

This probably needs to be reported upstream.

@scottstanie
Copy link

Sure, as a separate issue for Julia with an example? or to the OpenBLAS repo?

@ViralBShah
Copy link
Member

I am reopening this one - so please post the repro here, and then file to the openblas repo linking here.

@ViralBShah ViralBShah reopened this Aug 17, 2019
@scottstanie
Copy link

Okay the example I'm running is at the bottom (an iteratively reweighted least squares algorithm, threaded across pixels in a 3D stack)

My machine has 56 cores, and the default for OPENBLAS_NUM_THREADS on my machine seems to be 8- I don't remember if I had any say in that/if I should control OpenBLAS more, I wasn't relaly aware of it until this bug.

Running with half the julia threads works for me:

$ OPENBLAS_NUM_THREADS=8 JULIA_NUM_THREADS=28 julia13 --start=no
julia> versioninfo()
Julia Version 1.3.0-alpha.0
Commit 6c11e7c2c4 (2019-07-23 01:46 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_NUM_THREADS = 28

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(8, 28)

julia> include("./blas_thread_test.jl")
190.196333 seconds (15.18 M allocations: 384.406 GiB, 9.51% gc time)

But other settings segfault with the same error `BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

$ OPENBLAS_NUM_THREADS=8 JULIA_NUM_THREADS=56 julia13 --start=no

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(8, 56)

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
....
signal (11): Segmentation fault
in expression starting at /home/scott/repos/blas_thread_test.jl:23
....
(let me know if you want the full long stack trace)
$ OPENBLAS_NUM_THREADS=1 JULIA_NUM_THREADS=56 julia13 --start=no

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(1, 56)

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

signal (11): Segmentation fault
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
in expression starting at /home/sn expression starting at /home/scott/repos/blas_thread_test.jl:23

Here's the test program:

$ cat blas_thread_test.jl
using LinearAlgebra
function irls(A, b; iters=100)
    M, N = size(A)
    x = zeros(eltype(b), N)
    ep = sqrt(eps(eltype(A)))
    p = 1

    W = diagm(0 => (abs.(b-A*x) .+ ep).^(p-2))

    for ii in 1:iters
        x .= (A' * W * A) \ (A' * W * b)
        W .= diagm(0 => (abs.(b-A*x) .+ ep).^(p-2))
    end
    return x
end


M, N = 500, 30
A = randn(Float32, M, N);
bstack = rand(Float32, 60, 60, M);
xs = zeros(Float32, 60, 60, N);

@time Threads.@threads for j=1:size(bstack, 2)
    for i = 1:size(bstack, 1)
        xs[i, j, :] .= irls(A, bstack[i, j, :], iters=100)
    end
end

@scottstanie
Copy link

Might as well note that with Julia 1.1, the crashes happen with different options (as in, the one that ran successfully for 1.3 failed here, and I had to lower the OPENBLAS_NUM_THREADS to 2):

$ OPENBLAS_NUM_THREADS=8 JULIA_NUM_THREADS=28 julia11 --start=no

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_NUM_THREADS = 28

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(8, 28)

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
$ OPENBLAS_NUM_THREADS=1 JULIA_NUM_THREADS=56 julia11 --start=no

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
$ OPENBLAS_NUM_THREADS=2 JULIA_NUM_THREADS=28 julia11 --start=no

julia> include("./blas_thread_test.jl")
(no crash)

@ViralBShah
Copy link
Member

ViralBShah commented Aug 18, 2019

The default in Julia on Linux is 16, and in order to use more than that many, you have to recompile OpenBLAS with a higher desired number. I was assuming you had already done that, but I thought it was worth asking to make sure.

@scottstanie
Copy link

ah ok thanks- once I do this, will I have to adjust Julia afterwards in anyway to acknowledge the new OpenBLAS? or will following the OpenBLAS compiling instructions be sufficient?

@ViralBShah
Copy link
Member

That won't suffice. We mangle the names for 64 bit and rewrite RPATH and such. The easiest way may be to change the number of threads in blas.mk, disable BinaryBuilder with USE_BINARYBUILDER = 0 in Make.user and build Julia from source.

@ViralBShah
Copy link
Member

We should also start exploring the USE_TLS=1 option in OpenBLAS as commented upstream.

@ViralBShah ViralBShah added the performance Must go faster label Aug 19, 2019
@scottstanie
Copy link

Thanks @ViralBShah , just confirming that building Julia with OPENBLAS_BUILD_OPTS += NUM_THREADS=56 in blas.mk and USE_BINARYBUILDER = 0 compiled so that my same test above did not crash even with JULIA_NUM_THREADS=56 OPENBLAS_NUM_THREADS=8 👍

@oxinabox
Copy link
Contributor

We are getting hit by this right now with:

julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  uname: Linux 4.14.165-133.209.amzn2.x86_64 JuliaLang/julia#1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz: 
                 speed         user         nice          sys         idle          irq
       JuliaLang/julia#1-48  1282 MHz    2145269 s        524 s     737028 s  3280901444 s          0 s
  Memory: 92.31517028808594 GB (70440.73828125 MB free)
  Uptime: 684217.0 sec
  Load Avg:  0.080078125  7.17138671875  28.93994140625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PATH = /usr/local/julia
  JULIA_VERSION = 1.3.1
  JULIA_NUM_THREADS = 48
  JULIA_PKGDIR = /root/.julia
  JULIA_PATH = /usr/local/julia
  TERM = xterm
  PATH = /usr/local/julia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  HOME = /root

Left OPENBLAS_NUM_THREADS unset, so should be 48 I think.

This was when fitting a bunch of clusters using Clustering.jl's k-means

@staticfloat
Copy link
Member

staticfloat commented Mar 27, 2020

Interestingly enough, 1.4.0 does not seem to segfault at the same point as 1.3.1. In my experiments (using the official 1.3.1 and 1.4.0 binaries on an m5a.16xlarge EC2 instance running standard Amazon Linux 2) I found that segfaults began at the threshold of JULIA_NUM_THREADS=44. NUM_OPENBLAS_THREADS doesn't seem to effect much.

Unfortunately I don't have a user-code way of switching the backing BLAS provider in 1.3.1. Chances are very good that we will have one in 1.5.0 (I already have this prototyped out in my JLL stdlib branch) but for the time being the easiest way to work around this is to manually replace your libopenblas64_.so with one built to have a much higher core count. I'm building a new JLL that has an upper limit of 128 cores instead of 32, which is our current default.

The installation procedure will be fairly simple once the JLL trickles through the Yggdrasil building and registration process. You'll install it just like any other JLL, then you'll run the following code snippet (I'm typing this out without being able to test, will update once the JLL is registered in case I mistyped anything):

using Libdl

# First, backup "true" openblas name (it's usually a symlink)
curr_openblas_path = first(filter(l -> occursin("libopenblas", l), Libdl.dllist()))
curr_openblas_realpath = realpath(curr_openblas_path)
mv(curr_openblas_realpath, "$(curr_openblas_realpath).backup")

# Copy new library into julia's libdir, then create a symlink that points to it
# Ignore errors here about being unable to dlopen libopenblas; that's expected.
using OpenBLASHighCoreCount_jll
new_realpath = realpath(OpenBLASHighCoreCount_jll.libopenblas_path)
@info("Copying $(new_realpath) into Julia's libdir...")
libdir = dirname(curr_openblas_path)
cp(new_realpath, joinpath(libdir, basename(new_realpath)); force=true)

# Update any symlinks that pointed to the old path, to instead point to the new path:
for f in readdir(libdir)
    f = joinpath(libdir, f)
	if islink(f) && readlink(f) == basename(curr_openblas_realpath)
		@info("Updating symlink $(f)")
		rm(f; force=true)
		symlink(basename(new_realpath), f)
	end
end

Then, quit and re-open Julia. I've built OpenBLAS 0.3.5 and 0.3.7 in a high-core configuration and it did not segfault when running the test program @scottstanie posted in OpenMathLib/OpenBLAS#2225 (comment). Please do feel free to test this out (once the JLL is registered) and let me know if anything is broken.

@staticfloat
Copy link
Member

Now that the OpenBLASHighCoreCount packages have been merged, I've updated my script above to actually work (tested on x86_64-linux-gnu), although I will note that while using 56 threads works, 96 does not. Despite my setting the maximum number of threads to be 128, it appears that there's something internal to OpenBLAS that breaks right after 56.

You can choose the OpenBLAS version you want to use by add'ing the appropriate OpenBLASHighCoreCount_jll; Julia 1.3.1 ships with 0.3.5 so that's the safest bet, although anecdotally I didn't run into any problems with supplanting it with 0.3.7 instead.

@ViralBShah
Copy link
Member

Seems like this is fixed. But please reopen if still an issue.

@Red-Portal
Copy link

Hi @staticfloat is there currently any workaround for more threads than 56? Our NUMA system currently has 120 threads and it's really a pain that we're capped at 56.

@ViralBShah
Copy link
Member

Use the OpenBLASHighCoreCount_jll package. But you'll need to then replace Julia's libopenblas with the one that is in this artifact.

@Red-Portal
Copy link

Red-Portal commented Jan 17, 2021

@ViralBShah Does it now work with a thread count > 56? As @staticfloat said in March, I'm still experiencing the memory region error with the new artifact with 120 threads.

@ViralBShah
Copy link
Member

ViralBShah commented Jan 17, 2021

We are building with 128 threads:

https://github.com/JuliaPackaging/Yggdrasil/blob/master/O/OpenBLAS/OpenBLASHighCoreCount%400.3.12/build_tarballs.jl#L10

Perhaps it may need reporting upstream?

@staticfloat
Copy link
Member

Yeah it might be something that should be reported upstream.

@Red-Portal
Copy link

Okay thanks for the guidance. Though I currently worked around the issue by switching to MKL. If anyone is experiencing similar issues, try MKL.

@oschulz
Copy link
Contributor

oschulz commented May 9, 2021

Just ran into this one with 64 threads (on 64 phys cores) on Julia v1.6.1 - I guess it depends on the use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
building Build system, or building Julia or its dependencies multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests

10 participants