Skip to content

OpenBLAS crashing for Julia with different threading options #2225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scottstanie opened this issue Aug 17, 2019 · 9 comments · Fixed by #3352
Closed

OpenBLAS crashing for Julia with different threading options #2225

scottstanie opened this issue Aug 17, 2019 · 9 comments · Fixed by #3352

Comments

@scottstanie
Copy link

Hi, I've posted this to the Julia issues but figured I would link here as well

JuliaLang/julia#14857 (comment)

My machine has 56 cores, and the default for OPENBLAS_NUM_THREADS on my machine seems to be 8- I don't remember if I had any control in that, as I don't believe I built OpenBLAS from source

Running with half the julia threads works for me:

$ OPENBLAS_NUM_THREADS=8 JULIA_NUM_THREADS=28 julia13 --start=no
julia> versioninfo()
Julia Version 1.3.0-alpha.0
Commit 6c11e7c2c4 (2019-07-23 01:46 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_NUM_THREADS = 28

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(8, 28)

julia> include("./blas_thread_test.jl")
190.196333 seconds (15.18 M allocations: 384.406 GiB, 9.51% gc time)

But other settings segfault with the same error `BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

$ OPENBLAS_NUM_THREADS=8 JULIA_NUM_THREADS=56 julia13 --start=no

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(8, 56)

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
....
signal (11): Segmentation fault
in expression starting at /home/scott/repos/blas_thread_test.jl:23
....
(let me know if you want the full long stack trace)
$ OPENBLAS_NUM_THREADS=1 JULIA_NUM_THREADS=56 julia13 --start=no

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()), Threads.nthreads()
(1, 56)

julia> include("./blas_thread_test.jl")
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

signal (11): Segmentation fault
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
in expression starting at /home/sn expression starting at /home/scott/repos/blas_thread_test.jl:23

Here's the test program I'm running:

$ cat blas_thread_test.jl
using LinearAlgebra
function irls(A, b; iters=100)
    M, N = size(A)
    x = zeros(eltype(b), N)
    ep = sqrt(eps(eltype(A)))
    p = 1

    W = diagm(0 => (abs.(b-A*x) .+ ep).^(p-2))

    for ii in 1:iters
        x .= (A' * W * A) \ (A' * W * b)
        W .= diagm(0 => (abs.(b-A*x) .+ ep).^(p-2))
    end
    return x
end


M, N = 500, 30
A = randn(Float32, M, N);
bstack = rand(Float32, 60, 60, M);
xs = zeros(Float32, 60, 60, N);

@time Threads.@threads for j=1:size(bstack, 2)
    for i = 1:size(bstack, 1)
        xs[i, j, :] .= irls(A, bstack[i, j, :], iters=100)
    end
end
@martin-frbg
Copy link
Collaborator

The problem is probably that Julia is trying to invoke (at least close to) 56 instances of OpenBLAS, and your libopenblas.so was built to preallocate only 28 (or thereabouts) memory buffers. In such a situation, even limiting OPENBLAS_NUM_THREADS is not likely to help as the number of concurrent calls is imposed from outside. (NUM_THREADS is a compile-time parameter that takes its default from the number of cores - including hyperthreading ones - on the build host)

@ViralBShah
Copy link
Contributor

ViralBShah commented Aug 18, 2019

We do indeed build Julia on Linux with NUM_THREADS of 16. Is there a way to make this a run-time setting rather than compile time? Because of the high level of allocation for large number of threads, we try to keep this setting on the lower side. However, people on larger machines then are unable to use all the cores without recompiling openblas.

@ViralBShah
Copy link
Contributor

ViralBShah commented Aug 18, 2019

Also, it would be nice if OpenBLAS could print an appropriate error message if it detects inconsistent compile time and run time settings. That way we can avoid a crash, and openblas can simply refuse to compute, or compute with fewer threads.

@martin-frbg
Copy link
Collaborator

There is no easy way to turn this into a runtime setting unfortunately.This is one
of the legacies from GotoBLAS - when both precompiled binaries and high core
counts were much less common.
OTOH if you know that you supply an OpenBLAS that is limited to 16 threads
perhaps you can adjust the Julia defaults to match ?
(The still somewhat experimental thread memory management code that one can select
by compiling with USE_TLS=1 should have a much smaller memory footprint but it is
unclear if all of its bugs have been found and fixed)

@ViralBShah
Copy link
Contributor

We can certainly provide a higher default, but then people on smaller machines don't like the extra memory allocated - which can be substantial.

@staticfloat @vtjnash Should we try USE_TLS=1? I guess if it passes all of Julia's tests and perhaps even package tests - it is worth a shot.

@brada4
Copy link
Contributor

brada4 commented Aug 18, 2019

Default is MAX(50,NCPU*2) regions allowd at the compile time. i.e with >50 real CPUs, threaded with OpenBLAS, or called from multithread programsm mandates setting CPU number higher. Rationale back then was to make region holding structure biger to silence most of bug reports of a kind, still keeping the structure under on memory page to not rise TLB misses & stuff.
Probably you can examine OpenBLAS included in your distribution, odds are high it is built with 32 or 64 threads and double that number of regions, and will fit your bill.

@TiborGY
Copy link
Contributor

TiborGY commented Aug 18, 2019

The problem in a nutshell is that there is a limit to how many threads of OpenBLAS can be running at any given moment. If you set OpenBLAS to use N threads, and then you call a BLAS/LAPACK function from M threads, you use up N*M slots. If you run out of slots, you either get a crash or incorrect results.

Please see the threading related remarks in Makefile.rule. They are not perfect or exhaustive, but generally OK.

@TiborGY
Copy link
Contributor

TiborGY commented Aug 18, 2019

As Andrew noted previously, the number of slots is a compile time constant for OpenBLAS.
In the current code this is handled by this line:
https://github.com/xianyi/OpenBLAS/blob/89b60dab8ad21a0cc6320cbd9fcd603c4c4bfc81/common.h#L186

So you have at least 50 slots. If the product 2 * NUM_THREADS * NUM_PARALLEL is larger than 50, than that will be the number of slots in the library. (NUM_THREADS and NUM_PARALLEL are compile time constants)

@brada4
Copy link
Contributor

brada4 commented Aug 18, 2019

:-) I actually added that 50 in that line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants