Poor performance on Power-9 hardware with GCC and SMT enabled

Hello,

I benchmarked the simple following dgemm call using 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM LC922 machine with 2 POWER-9 processors (of each 22 cores and 88 hardware threads):
`cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n);`

While the performance is great when using exactly 1 thread per core (and specifying threads places and binding). The performance strongly drop to the sequential performance if 2 or 4 threads per core are used with gcc and we can see that only one thread is actually computing. Note that with clang there is also a drop but clearly less significant and more than threads is running.

With GCC 8.3.0:
```
$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=close ./a.out
462.493 Gflops (time: 0.29717 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
22.1915 Gflops (time: 6.1933 s)
$ OMP_NUM_THREADS=1 OMP_PLACES="cores(1)" OMP_PROC_BIND=close ./a.out
22.6448 Gflops (time: 6.06934 s)
```

With Clang 9.0.0-2:
```
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
219.556 Gflops (time: 0.625986 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
221.271 Gflops (time: 0.621134 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=close ./a.out
138.701 Gflops (time: 0.990901 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=spread ./a.out
135.868 Gflops (time: 1.01156 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="threads(44)" OMP_PROC_BIND=spread ./a.out
160.299 Gflops (time: 0.857392 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=spread ./a.out
381.88 Gflops (time: 0.359901 s)
```

All test are runned on a ubuntu18.04.1 system.

Here is the command used to compile the basic example code:
`g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas -fopenmp`

The commit of the OpenBLAS git used is quite up to date: 8d2a796 (on origin/develop).

Note that this problem could also be related to possible issues in the OpenMP runtime implementation.

[main.txt](https://github.com/xianyi/OpenBLAS/files/4140588/main.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance on Power-9 hardware with GCC and SMT enabled #2380

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance on Power-9 hardware with GCC and SMT enabled #2380

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions