-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Poor performance on Power-9 hardware with GCC and SMT enabled #2380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
lib(g)omp and/or general compiler capabilities would be my guess as well, perhaps a more "fair" comparison would be against gcc 9.2 ? Does the behaviour change with matrix size (assuming overhead from idling threads here) ? |
For smaller n values(matrix), performance is good with OMP_PLACES=threads(160) than specifying places in OMP_PLACES and for larger n value specifying places gives better numbers. n = 512: n = 1024: n = 2048 |
It could be that SMT is microcoded so that you get worse performance employing both (or is it four nowadays) false CPUs at once. |
What would be interesting to know if default thread placement policy gets any regression or no change when it starts using 2nd thread in physical core. |
Using GCCThe number of threads actually created is always one for GCC-8.3 when By looking
This configuration give good performance while the following does not (it gives 22 Gflops):
And surprisingly this one is good (it gives 160 Gflops):
Thus, the place of the first thread is important in GCC/libGOMP and causes the issue. Using ClangOn Clang-9, there is still a non-negligible drop in term of performance when SMT is enabled (using a manual thread placement).
Is there anything special to do in order to build this branch ? (note that the 8 first errors also appear when building the master branch but not the develop branch). |
Sorry that's a misunderstanding - the develop branch at that stage (more than a year ago) probably did not recognize Power9 at all (and master has been gathering dust a good while longer). What I meant was that it might be worthwile to copy the idea behind that commit #1846 and add a GEMM_PREFERRED_SIZE for POWER8/POWER9 in param.h as well. (Could be that PPC is much less disadvantaged by non-power-of-two vector lengths than x86 though) |
Specifying GEMM_PREFERRED_SIZE in param.h for ppc does not make any difference for the above testcase. However I will check using GEMM_PREFERRED_SIZE in general for common usecase and add it for POWER, if it improves performance. |
Hello,
I benchmarked the simple following dgemm call using 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM LC922 machine with 2 POWER-9 processors (of each 22 cores and 88 hardware threads):
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n);
While the performance is great when using exactly 1 thread per core (and specifying threads places and binding). The performance strongly drop to the sequential performance if 2 or 4 threads per core are used with gcc and we can see that only one thread is actually computing. Note that with clang there is also a drop but clearly less significant and more than threads is running.
With GCC 8.3.0:
With Clang 9.0.0-2:
All test are runned on a ubuntu18.04.1 system.
Here is the command used to compile the basic example code:
g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas -fopenmp
The commit of the OpenBLAS git used is quite up to date: 8d2a796 (on origin/develop).
Note that this problem could also be related to possible issues in the OpenMP runtime implementation.
main.txt
The text was updated successfully, but these errors were encountered: