Skip to content

Poor performance on Power-9 hardware with GCC and SMT enabled #2380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zephyr111 opened this issue Jan 31, 2020 · 8 comments
Open

Poor performance on Power-9 hardware with GCC and SMT enabled #2380

zephyr111 opened this issue Jan 31, 2020 · 8 comments

Comments

@zephyr111
Copy link

Hello,

I benchmarked the simple following dgemm call using 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM LC922 machine with 2 POWER-9 processors (of each 22 cores and 88 hardware threads):
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n);

While the performance is great when using exactly 1 thread per core (and specifying threads places and binding). The performance strongly drop to the sequential performance if 2 or 4 threads per core are used with gcc and we can see that only one thread is actually computing. Note that with clang there is also a drop but clearly less significant and more than threads is running.

With GCC 8.3.0:

$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=close ./a.out
462.493 Gflops (time: 0.29717 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
22.1915 Gflops (time: 6.1933 s)
$ OMP_NUM_THREADS=1 OMP_PLACES="cores(1)" OMP_PROC_BIND=close ./a.out
22.6448 Gflops (time: 6.06934 s)

With Clang 9.0.0-2:

$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
219.556 Gflops (time: 0.625986 s)
$ OMP_NUM_THREADS=176 OMP_PLACES="threads(176)" OMP_PROC_BIND=close ./a.out
221.271 Gflops (time: 0.621134 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=close ./a.out
138.701 Gflops (time: 0.990901 s)
$ OMP_NUM_THREADS=88 OMP_PLACES="threads(88)" OMP_PROC_BIND=spread ./a.out
135.868 Gflops (time: 1.01156 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="threads(44)" OMP_PROC_BIND=spread ./a.out
160.299 Gflops (time: 0.857392 s)
$ OMP_NUM_THREADS=44 OMP_PLACES="cores(44)" OMP_PROC_BIND=spread ./a.out
381.88 Gflops (time: 0.359901 s)

All test are runned on a ubuntu18.04.1 system.

Here is the command used to compile the basic example code:
g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas -fopenmp

The commit of the OpenBLAS git used is quite up to date: 8d2a796 (on origin/develop).

Note that this problem could also be related to possible issues in the OpenMP runtime implementation.

main.txt

@martin-frbg
Copy link
Collaborator

lib(g)omp and/or general compiler capabilities would be my guess as well, perhaps a more "fair" comparison would be against gcc 9.2 ? Does the behaviour change with matrix size (assuming overhead from idling threads here) ?

@RajalakshmiSR
Copy link

For smaller n values(matrix), performance is good with OMP_PLACES=threads(160) than specifying places in OMP_PLACES and for larger n value specifying places gives better numbers.

n = 512:
~$ OMP_NUM_THREADS=160 OMP_PLACES="threads(160)" OMP_PROC_BIND=close ./a.out
6.44982 Gflops (time: 0.041619 s)
~$ OMP_NUM_THREADS=160 OMP_PLACES="{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}" OMP_PROC_BIND=close ./a.out
0.916901 Gflops (time: 0.292764 s)

n = 1024:
~$ OMP_NUM_THREADS=160 OMP_PLACES="threads(160)" OMP_PROC_BIND=close ./a.out
6.83539 Gflops (time: 0.314171 s)
~$ OMP_NUM_THREADS=160 OMP_PLACES="{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}" OMP_PROC_BIND=close ./a.out
5.63736 Gflops (time: 0.380938 s)

n = 2048
~$ OMP_NUM_THREADS=160 OMP_PLACES="threads(160)" OMP_PROC_BIND=close ./a.out
6.52525 Gflops (time: 2.63283 s)
~$ OMP_NUM_THREADS=160 OMP_PLACES="{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}" OMP_PROC_BIND=close ./a.out
17.802 Gflops (time: 0.965051 s)

@brada4
Copy link
Contributor

brada4 commented Feb 1, 2020

It could be that SMT is microcoded so that you get worse performance employing both (or is it four nowadays) false CPUs at once.
GOMP is quite NUMA-aware, try experimenting around OPMP_PLACES to get to the sweet spot.

@martin-frbg
Copy link
Collaborator

@brada4 I doubt that hardware threads on power9 have restrictions similar to x86 HT that would warrant calling them "false" cores ? Apart from the thread placement issue, I wonder if defining
a GEMM_PREFERRED_SIZE to guide workload distribution as introduced in 5b708e5 would improve performance.

@brada4
Copy link
Contributor

brada4 commented Feb 2, 2020

What would be interesting to know if default thread placement policy gets any regression or no change when it starts using 2nd thread in physical core.

@zephyr111
Copy link
Author

Using GCC

The number of threads actually created is always one for GCC-8.3 when OMP_PLACES is not set to "cores(...)". This is not the case with Clang-9.0 (the number of threads created seems correct).

By looking OMP_DISPLAY_ENV when OMP_PLACES="cores(8)" (a configuration that gives 160 Gflops) we can see that:

OMP_PLACES = '{0:4},{4:4},{8:4},{12:4},{16:4},{20:4},{24:4},{28:4}'

This configuration give good performance while the following does not (it gives 22 Gflops):

OMP_PLACES = '{0},{4},{8},{12},{16},{20},{24},{28}'

And surprisingly this one is good (it gives 160 Gflops):

OMP_PLACES = '{0:2},{4},{8},{12},{16},{20},{24},{28}'

Thus, the place of the first thread is important in GCC/libGOMP and causes the issue.
Thus, this is most probably an issue in libgomp.
Note that this can also be reproduced with GCC-9.2.
So I will contact the libGOMP community for this issue.

Using Clang

On Clang-9, there is still a non-negligible drop in term of performance when SMT is enabled (using a manual thread placement).
While this is not as critical as for the GCC/libGOMP issue, it seems strange to me.
Indeed, for 4096x4096 matrices, the performance difference is 1.2x~1.6x (SMT-1 vs SMT-4). For huge matrices like 16Kx16K, the difference is around 1.1x between SMT-1 and SMT-4 but SMT-2 is strangely around 1.45x slower than SMT-1. Does it seems normal for you ?
I tried to build commit 5b708e5 as @martin-frbg proposed, but the build fail with the following errors:

getarch_2nd.c: In function ‘main’:
getarch_2nd.c:12:35: error: ‘SGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
     printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:12:35: note: each undeclared identifier is reported only once for each function it appears in
getarch_2nd.c:13:35: error: ‘SGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_N’?
     printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:14:35: error: ‘DGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
     printf("DGEMM_UNROLL_M=%d\n", DGEMM_DEFAULT_UNROLL_M);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:15:35: error: ‘DGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_N’?
     printf("DGEMM_UNROLL_N=%d\n", DGEMM_DEFAULT_UNROLL_N);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:19:35: error: ‘CGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
     printf("CGEMM_UNROLL_M=%d\n", CGEMM_DEFAULT_UNROLL_M);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:20:35: error: ‘CGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_N’?
     printf("CGEMM_UNROLL_N=%d\n", CGEMM_DEFAULT_UNROLL_N);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:21:35: error: ‘ZGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
     printf("ZGEMM_UNROLL_M=%d\n", ZGEMM_DEFAULT_UNROLL_M);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:22:35: error: ‘ZGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_N’?
     printf("ZGEMM_UNROLL_N=%d\n", ZGEMM_DEFAULT_UNROLL_N);
                                   ^~~~~~~~~~~~~~~~~~~~~~
                                   XGEMM_DEFAULT_UNROLL_N
make: *** [getarch_2nd] Error 1
In file included from ../common.h:536,
                 from lapack/zpotf2.c:40:
lapack/zpotf2.c: In function ‘cpotf2_’:
../common_param.h:971:23: error: ‘GEMM_DEFAULT_OFFSET_A’ undeclared (first use in this function); did you mean ‘GEMM_DEFAULT_UNROLL_M’?
 #define GEMM_OFFSET_A GEMM_DEFAULT_OFFSET_A
                       ^~~~~~~~~~~~~~~~~~~~~
lapack/zpotf2.c:110:37: note: in expansion of macro ‘GEMM_OFFSET_A’
   sa = (FLOAT *)((BLASLONG)buffer + GEMM_OFFSET_A);
                                     ^~~~~~~~~~~~~
../common_param.h:971:23: note: each undeclared identifier is reported only once for each function it appears in
 #define GEMM_OFFSET_A GEMM_DEFAULT_OFFSET_A
                       ^~~~~~~~~~~~~~~~~~~~~
lapack/zpotf2.c:110:37: note: in expansion of macro ‘GEMM_OFFSET_A’
   sa = (FLOAT *)((BLASLONG)buffer + GEMM_OFFSET_A);
                                     ^~~~~~~~~~~~~
../common_param.h:1010:18: error: ‘CGEMM_DEFAULT_P’ undeclared (first use in this function); did you mean ‘CGEMM_DEFAULT_R’?
 #define CGEMM_P  CGEMM_DEFAULT_P
                  ^~~~~~~~~~~~~~~
[...]

Is there anything special to do in order to build this branch ? (note that the 8 first errors also appear when building the master branch but not the develop branch).

@martin-frbg
Copy link
Collaborator

Sorry that's a misunderstanding - the develop branch at that stage (more than a year ago) probably did not recognize Power9 at all (and master has been gathering dust a good while longer). What I meant was that it might be worthwile to copy the idea behind that commit #1846 and add a GEMM_PREFERRED_SIZE for POWER8/POWER9 in param.h as well. (Could be that PPC is much less disadvantaged by non-power-of-two vector lengths than x86 though)

@RajalakshmiSR
Copy link

Specifying GEMM_PREFERRED_SIZE in param.h for ppc does not make any difference for the above testcase. However I will check using GEMM_PREFERRED_SIZE in general for common usecase and add it for POWER, if it improves performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants