-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Slowdown when using openblas-pthreads alongside openmp based parallel code #3187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
most likely you get too many threads running in parallel in the non-openmp case |
|
For what it's wotrh the number of running threads showed by htop only increases by 4 when I run this |
a profiling with linux perf shows that most of the time comes from |
Would be worth checking this:
|
Is it really |
My main concern is that the issue happens with OpenBLAS built with pthreads (not when built with OpenMP) ! When I execute my snippet above, libopenblas is loaded because numpy is imported and libgomp is loaded because of the prange loop. It can be confirmed with from threadpoolctl import threadpool_info
threadpool_info()
[{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libopenblasp-r0.3.12.so',
'prefix': 'libopenblas',
'user_api': 'blas',
'internal_api': 'openblas',
'version': '0.3.12',
'num_threads': 4,
'threading_layer': 'pthreads'},
{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libgomp.so.1.0.0',
'prefix': 'libgomp',
'user_api': 'openmp',
'internal_api': 'openmp',
'version': None,
'num_threads': 4}] If I replace prange by range, libgomp disappears from the list and if I don't import numpy libopenblas disappears. It also confirms that I have only 1 OpenMP runtime loaded (threadpoolctl takes symlinks into account).
|
Thanks for the info. That's to be expected though. When running openmp, you create a few threads and then each thread calls blas which in turn creates more threads. In a multi-threaded environment, it's safest to just do |
It's not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside). |
Also I ran |
I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp. |
|
OpenBLAS assumes each thread has CPU at its hands with all outermost caches. 30x slowdoen means multiple threads are stuck on same CPU and actually get to spill to the main memory instead of cache. |
Sorry I'm not sure to understand your answer. I agree that HT is useless in HPC most of the time but it does not seem to be the only issue here since the program is fast when OpenBLAS is built with pthreads and I run the loop in sequential mode. The issue only appears when I also run the loop in parallel with OpenMP. I recall that they are not nested but run one after the other. I post a pure C reproducible code here, hope it will make my concerns clearer. #include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "cblas.h"
double f2(double *X, int n){
double v = 0.0;
#pragma omp parallel for reduction(+:v)
for(int i=0; i<n; i++){
v += X[i];
}
return v;
}
int main(int argc, char **argv){
int m = 10000,
n = 10,
k = 100;
double *A = (double*) malloc(m * k * sizeof(double)),
*B = (double*) malloc(n * k * sizeof(double)),
*C = (double*) malloc(m * n * sizeof(double));
for(int i=0; i<m*k; i++){
A[i] = 0.1;
}
for(int i=0; i<n*k; i++){
B[i] = 1.2;
}
double v = 0.0;
for(int i=0; i<1000; i++){
// BLAS call
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, n); // C <- A @ B.T
// Followed by parallel loop
v += f2(C, m * n);
}
free(A), free(B), free(C);
return 0;
} Here a the timings on a 40 cores machine (20 physical + HT).
Focusing on the pthreads + parallel loop case:
|
@jeremiedbb, can you also try with |
@isuruf it reduces the time from 21s to 2.6s. It's better but still much slower than expected |
That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad. |
OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads. |
OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses. |
@jeremiedbb maybe there is a way to introspect the CPU affinity of new pthreads started before and after calling the OMP https://man7.org/linux/man-pages/man3/pthread_getaffinity_np.3.html |
|
here's the output of
Sorry but I've no idea how to interpret this :/ |
Put a one-minute sleep between OMP loops and check thread binding to CPU cores. It is actually documented in GOMP manual pages on any Linux system. |
I introspected the affinities for both OpenMP and OpenBLAS threadpools and it turns out that no affinity constraint is set (openblas is built with NO_AFFINITY). Here's the output of
So the affinity seems to not be the reason of the bad interaction between openblas-pthreads and openmp. However I found that when the openmp loop ends, the threads are still waiting for computation in an active way (OMP_WAIT_POLICY), which consumes resources and prevent openblas to start computations right away. By default, openmp makes waiting threads spin for a while. Unfortunately, setting OMP_WAIT_POLICY=passive does not really improve the performances on a machine with many cores for some reason that I don't understand yet. The best solution I found so far is to set the num threads for both to half the number of threads, besides building openblas with openmp of course. I guess this is a wont fix from the OpenBLAS side. OpenMP programs do not interact well with other libraries managing their own threadpool. Feel free to close the issue if you think there's no more to add about that. Still I wonder if there is the same kind of wait policy in openblas. |
@isuruf I think this issue is a good reason to always try to use an openblas built with openmp for the scikit-learn builds on conda-forge (I noticed it was not always the case). |
@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as |
Closing after copying the relevant information to the FAQ in the wiki |
Hi,
I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there's a huge slowdown. Below is a reproducible example (sorry it's from python/cython)
On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce
conda create -n tmp -c conda-forge python numpy cython ipython
.However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce
conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9
.The text was updated successfully, but these errors were encountered: