Description
Hello,
I benchmarked the simple following dgemm call using 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM LC922 machine with 2 POWER-9 processors (of each 22 cores and 88 hardware threads):
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n);
While the performance is great when using exactly 1 thread per core (and specifying threads places and binding). The performance strongly drop to the sequential performance if 2 or 4 threads per core are used with gcc and we can see that only one thread is actually computing. Note that with clang there is also a drop but clearly less significant and more than threads is running.
With GCC 8.3.0:
$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=close ./a.out
462.493 Gflops (time: 0.29717 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
22.1915 Gflops (time: 6.1933 s)
$ OMP_NUM_THREADS=1 OMP_PLACES="cores(1)" OMP_PROC_BIND=close ./a.out
22.6448 Gflops (time: 6.06934 s)
With Clang 9.0.0-2:
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
219.556 Gflops (time: 0.625986 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
221.271 Gflops (time: 0.621134 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=close ./a.out
138.701 Gflops (time: 0.990901 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=spread ./a.out
135.868 Gflops (time: 1.01156 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="threads(44)" OMP_PROC_BIND=spread ./a.out
160.299 Gflops (time: 0.857392 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=spread ./a.out
381.88 Gflops (time: 0.359901 s)
All test are runned on a ubuntu18.04.1 system.
Here is the command used to compile the basic example code:
g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas -fopenmp
The commit of the OpenBLAS git used is quite up to date: 8d2a796 (on origin/develop).
Note that this problem could also be related to possible issues in the OpenMP runtime implementation.