Skip to content

openblas runs on a single thread with OMP_PROC_BIND=TRUE on fedora #3435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KoykL opened this issue Nov 2, 2021 · 5 comments · Fixed by #3437
Closed

openblas runs on a single thread with OMP_PROC_BIND=TRUE on fedora #3435

KoykL opened this issue Nov 2, 2021 · 5 comments · Fixed by #3437

Comments

@KoykL
Copy link

KoykL commented Nov 2, 2021

OS: fedora 35
CPU: epyc 7502p

with numpy linked to openblas openmp 0.3.18, when OMP_PROC_BIND=TRUE, openblas runs on a single thread. This is similar to #2238.

gdb shows blas_cpu_number is indeed set to 32 (physical core count). However taskset shows the affinity mask of the openblas process is set to "1". Manually override affinity mask with taskset -pc 0-63 changes the affinity mask, but openblas is still running on a single thread.

Hello world program (https://www.geeksforgeeks.org/openmp-hello-world-program/) with OMP_PROC_BIND=TRUE has expected (0-63) affinity mask on this system.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2021

Could you try different scripts from
https://github.com/xianyi/OpenBLAS/tree/develop/benchmark/scripts/NUMPY
?
And without any OMP overrides?

@KoykL
Copy link
Author

KoykL commented Nov 2, 2021

Without OMP overrides, there is no issue. Only shows up with OMP_*.

The scripts you showed have the same behavior (fine without OMP_*, single thread with OMP_PROC_BIND)

@martin-frbg
Copy link
Collaborator

Probably another data point in the saga of how to count the number of cpus - get_num_procs() in driver/others/memory.c
used sched_getaffinity() which will probably return 1 always, when OMP_PROC_BIND is anything other than FALSE. I'll experiment a bit and produce a PR.

@martin-frbg
Copy link
Collaborator

tentative fix refined to use omp_get_num_places() now (should be better than just going with SC_NPROCESSORS_CONF)

@marioroy
Copy link

marioroy commented Apr 12, 2022

I ran into the same issue on Ubuntu 20.04 using the libopenblas-openmp-dev package. For folks experiencing this issue, the following utilizes all physical cores on an AMD Threadripper 3970X box.

sudo apt update
sudo apt install build-essential libopenblas-openmp-dev numactl wget

cd ~/Downloads
wget https://www.lanl.gov/projects/crossroads/_assets/docs/micro/mtdgemm-crossroads-v1.0.0.tgz

tar xzf mtdgemm-crossroads-v1.0.0.tgz
cd mt-dgemm/src

gcc -o mt-dgemm-openblas mt-dgemm.c -mtune=znver2 -march=znver2 -mavx2 -lm -fopenmp -Ofast -ffp-contract=fast -funroll-loops -I/usr/include/x86_64-linux-gnu/openblas-openmp /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblas.a -lpthread -lm -DUSE_CBLAS

OMP_NUM_THREADS=32 OMP_PROC_BIND=close OMP_PLACES=cores ./mt-dgemm-openblas 8192 4
GFLOP/s rate:  1381.483761 GF/s

OMP_NUM_THREADS=32 numactl -C 0-31 ./mt-dgemm-openblas 8192 4  # performed similarly to
OMP_NUM_THREADS=32 numactl --physcpubind=0-31 ./mt-dgemm-openblas 8192 4  # ditto
OMP_NUM_THREADS=32 ./mt-dgemm-openblas 8192 4
GFLOP/s rate:  1257.241689 GF/s

Using OpenBLAS, the dgemm example benefits from setting OMP_PROC_BIND=close and OMP_PLACES=cores.

For reference, the following utilizes one core.

OMP_NUM_THREADS=32 OMP_PROC_BIND=true OMP_PLACES="$( seq -s },{ 0 1 31 | sed -e 's/\(.*\)/\{\1\}/' )" ./mt-dgemm-openblas 8192 4

OMP_NUM_THREADS=32 GOMP_CPU_AFFINITY=0-31:1 ./mt-dgemm-openblas 8192 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants