-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
OpenBLAS segfaults at high thread counts #14857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
that source-file comment isn't quite accurate anymore. we used to have a very large value there, but it caused performance and memory issues due to openblas creating too many static buffers and trying to use too many cores for the problem size. it is now set to the largest value that seemed to provide improved performance. |
I hope that as openblas evolves, this is one of those things that goes away. |
I'm suggesting not setting a value for |
true, if the user disables OPENBLAS_DYNAMIC_ARCH, it might make sense to disable NUM_CORES also |
If dynamic arch is disabled, then this could make sense. Otherwise wouldn't you be limiting the number of threads supported by openblas to the number the build machine (or VM) happened to have? |
Yes. Isn't using the build machine specs a better default than a hard-coded value? There's no reasonable default that can be hard-coded that will work across a 4-core Core i5 laptop, a 36 core Xeon server, and a 70 core Xeon Phi blade. As a guess at a solution -- proper cross-compilation should be made easy by defining a set of cross-compilation options that need to be specified; the number of threads a library should be able to handle would be one of these. |
there's not much reason to believe the build machine is similar to the production machine, so we opt for predictability over guessing. but to expand on my point above, iirc the value in the makefile was set after observing that openblas was performing worse on small to medium problems beyond about 4-16 threads (presumably due to scheduling overhead) with testing performed on an 80-core Intel(R) Xeon(R) CPU E7-8850 (716151d) |
The build machines that we produce binaries (which most users probably start with) from are small VM's so no I don't think we should be letting their particular configuration influence the build artifacts. That build configuration for creating redistributable binaries could be put behind a non-default makefile flag though, and have the conventional source build default to specialized to the local machine - #9372 |
@vtjnash: The buffer count that is causing this segfault has nothing to do with OpenBLAS threads; a bigger value is required to allow 72 Julia threads to make concurrent calls into OpenBLAS. This actually argues for @xianyi to separate the buffer count from the thread count. @tkelman: I like this idea -- the conventional source build defaulting to the local machine specs. |
If it hard to let openblas to automatically/dynamically figure this out can this at least be set at runtime? |
It absolutely can be set at runtime. The build-time value is a maximum. It may be worth trying to separate the maximum allowed vs the default value without having to set an environment variable to modify the default. |
Unfortunately, while you can set |
Discussed in JuliaLang/LinearAlgebra.jl#323 and closing as there isn't much we can do about openblas NUM_THREADS. Maybe what we need is real multi-threading and a Julia BLAS that uses that. Just saying. ;-) |
Is there anything to be done for this for now? It seems Julia 1.3 works a little better to avoid this problem, but I still get weird things like crashing when I have |
This probably needs to be reported upstream. |
Sure, as a separate issue for Julia with an example? or to the OpenBLAS repo? |
I am reopening this one - so please post the repro here, and then file to the openblas repo linking here. |
Okay the example I'm running is at the bottom (an iteratively reweighted least squares algorithm, threaded across pixels in a 3D stack) My machine has 56 cores, and the default for OPENBLAS_NUM_THREADS on my machine seems to be 8- I don't remember if I had any say in that/if I should control OpenBLAS more, I wasn't relaly aware of it until this bug. Running with half the julia threads works for me:
But other settings segfault with the same error `BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Here's the test program:
|
Might as well note that with Julia 1.1, the crashes happen with different options (as in, the one that ran successfully for 1.3 failed here, and I had to lower the OPENBLAS_NUM_THREADS to 2):
|
The default in Julia on Linux is 16, and in order to use more than that many, you have to recompile OpenBLAS with a higher desired number. I was assuming you had already done that, but I thought it was worth asking to make sure. |
ah ok thanks- once I do this, will I have to adjust Julia afterwards in anyway to acknowledge the new OpenBLAS? or will following the OpenBLAS compiling instructions be sufficient? |
That won't suffice. We mangle the names for 64 bit and rewrite RPATH and such. The easiest way may be to change the number of threads in blas.mk, disable BinaryBuilder with |
We should also start exploring the |
Thanks @ViralBShah , just confirming that building Julia with |
We are getting hit by this right now with:
Left This was when fitting a bunch of clusters using Clustering.jl's k-means |
Interestingly enough, 1.4.0 does not seem to segfault at the same point as 1.3.1. In my experiments (using the official 1.3.1 and 1.4.0 binaries on an Unfortunately I don't have a user-code way of switching the backing BLAS provider in 1.3.1. Chances are very good that we will have one in 1.5.0 (I already have this prototyped out in my JLL stdlib branch) but for the time being the easiest way to work around this is to manually replace your The installation procedure will be fairly simple once the JLL trickles through the Yggdrasil building and registration process. You'll install it just like any other JLL, then you'll run the following code snippet (I'm typing this out without being able to test, will update once the JLL is registered in case I mistyped anything): using Libdl
# First, backup "true" openblas name (it's usually a symlink)
curr_openblas_path = first(filter(l -> occursin("libopenblas", l), Libdl.dllist()))
curr_openblas_realpath = realpath(curr_openblas_path)
mv(curr_openblas_realpath, "$(curr_openblas_realpath).backup")
# Copy new library into julia's libdir, then create a symlink that points to it
# Ignore errors here about being unable to dlopen libopenblas; that's expected.
using OpenBLASHighCoreCount_jll
new_realpath = realpath(OpenBLASHighCoreCount_jll.libopenblas_path)
@info("Copying $(new_realpath) into Julia's libdir...")
libdir = dirname(curr_openblas_path)
cp(new_realpath, joinpath(libdir, basename(new_realpath)); force=true)
# Update any symlinks that pointed to the old path, to instead point to the new path:
for f in readdir(libdir)
f = joinpath(libdir, f)
if islink(f) && readlink(f) == basename(curr_openblas_realpath)
@info("Updating symlink $(f)")
rm(f; force=true)
symlink(basename(new_realpath), f)
end
end Then, quit and re-open Julia. I've built OpenBLAS 0.3.5 and 0.3.7 in a high-core configuration and it did not segfault when running the test program @scottstanie posted in OpenMathLib/OpenBLAS#2225 (comment). Please do feel free to test this out (once the JLL is registered) and let me know if anything is broken. |
Now that the OpenBLASHighCoreCount packages have been merged, I've updated my script above to actually work (tested on You can choose the OpenBLAS version you want to use by |
Seems like this is fixed. But please reopen if still an issue. |
Hi @staticfloat is there currently any workaround for more threads than 56? Our NUMA system currently has 120 threads and it's really a pain that we're capped at 56. |
Use the |
@ViralBShah Does it now work with a thread count > 56? As @staticfloat said in March, I'm still experiencing the memory region error with the new artifact with 120 threads. |
We are building with 128 threads: Perhaps it may need reporting upstream? |
Yeah it might be something that should be reported upstream. |
Okay thanks for the guidance. Though I currently worked around the issue by switching to MKL. If anyone is experiencing similar issues, try MKL. |
Just ran into this one with 64 threads (on 64 phys cores) on Julia v1.6.1 - I guess it depends on the use case? |
With threading enabled, on a 2-socket, 18-core HSW-EP (72 threads, with hyper-threading on):
OpenMathLib/OpenBLAS#539 points out that this means more buffers are needed, i.e.
NUM_THREADS
needs to be increased.OpenBLAS appears to figure the machine out correctly. In
deps/build/openblas/Makefile.conf
:And in
deps/build/openblas/Makefile.system
:The problem is in
deps/Makefile:1092
:This might be a bit out of date. :-)
As OpenBLAS appears to do a fine job of figuring out
NUM_CORES
inMakefile.conf
(checked on both Linux and OS X), I suggest not settingNUM_THREADS
inOPENBLAS_BUILD_OPTS
at all (removedeps/Makefile:1080
throughdeps/Makefile:1094
).The text was updated successfully, but these errors were encountered: