-
Notifications
You must be signed in to change notification settings - Fork 16
Build OpenBLAS with NUM_THREADS=128 #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Indeed have access to such a machine:
it's a problem... Maybe we should even shoot for 512 threads directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
512 threads?
Are there really cases where 512 threads of OpenBLAS are faster than 64? Don't people hit memory bandwith limits before they saturate 512 cores? I find it hard to believe that people with this kind of computing power are using vanilla NumPy and not recompiling for HPC using specialized compiler options. On the other hand, this change will bite the casual user by causing oversubscription whenever they run more than one NumPy process on their machine |
Maybe not but unfortunately, if the number of threads set at build time is small that the number of CPUs detected at runtime, then we get |
You are right that OpenBLAS seems to have trouble using that many threads: In [1]: import numpy as np
In [2]: data = np.random.randn(4096, 4096)
In [3]: %timeit _ = data @ data
326 ms ± 7.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: import threadpoolctl
In [5]: threadpoolctl.threadpool_info()
Out[5]:
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
'prefix': 'libopenblas',
'user_api': 'blas',
'internal_api': 'openblas',
'version': '0.3.10',
'num_threads': 128,
'threading_layer': 'pthreads'}]
In [6]: threadpoolctl.threadpool_limits(64)
Out[6]: <threadpoolctl.threadpool_limits at 0x7fee99429460>
In [7]: threadpoolctl.threadpool_info()
Out[7]:
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
'prefix': 'libopenblas',
'user_api': 'blas',
'internal_api': 'openblas',
'version': '0.3.10',
'num_threads': 64,
'threading_layer': 'pthreads'}]
In [8]: %timeit _ = data @ data
198 ms ± 6.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: threadpoolctl.threadpool_limits(32)
Out[9]: <threadpoolctl.threadpool_limits at 0x7fea5ab08d60>
In [10]: %timeit _ = data @ data
188 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: threadpoolctl.threadpool_limits(128)
Out[11]: <threadpoolctl.threadpool_limits at 0x7fea51d8a580>
In [12]: %timeit _ = data @ data
328 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Here I used OpenBLAS from conda-forge which was built with |
@ogrisel - is that right - that OpenBLAS will by default always crash with a segfault when the number of CPUs is greater than the compiled number of threads? That seems like a bug - surely? I mean - shouldn't OpenBLAS just use the number of compiled threads in that situation, rather than segfaulting? |
That sounds like a better behavior indeed. Feel free to comment on the linked issue on the openblas repo. |
@ogrisel - thanks - and sorry for being lazy - but is that true - that OpenBLAS will always ask for as many threads as there are CPUs, and segfaults if it doesn't get them? And would limiting the number of threads to the maximum compiled number be a reasonable fix, do you think? |
I am confused: I tried again the to run a sample OpenBLAS program on the 256 thread machine and OpenBLAS (from conda-forge) automatically limits itself to 128 threads without the crash. I am sure I observe this crash previously in an interactive test session but I don't remember what I did. Let me experiment a bit more to try to find a simple reproducer. |
Ok I re-read the original report and discussion and here is the minimal reproducer (using NumPy from PyPI with OpenBLAS built with NUM_THREADS=64): In [1]: import numpy as np
In [2]: a = np.random.randn(1024, 1024)
In [3]: import threadpoolctl
In [4]: threadpoolctl.threadpool_info()
Out[4]:
[{'filepath': '/scratch/ogrisel/miniforge3/envs/pypi/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so',
'prefix': 'libopenblas',
'user_api': 'blas',
'internal_api': 'openblas',
'version': '0.3.17',
'num_threads': 64,
'threading_layer': 'pthreads',
'architecture': 'Zen'}]
In [5]: import os
In [6]: os.cpu_count()
Out[6]: 256
In [7]: _ = a @ a # this run fines with 64 threads To trigger the crash, on needs to create more than 64 threads (here using In [11]: from concurrent.futures import ThreadPoolExecutor
In [12]: tpe = ThreadPoolExecutor(max_workers=256)
In [14]: with threadpoolctl.threadpool_limits(limits=1, user_api="blas"):
...: list(tpe.map(lambda _: a @ a, range(256)))
...:
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped) But increasing NUM_THREADS too much when building OpenBLAS is not really a good solution, because it can often degrade the performance on machines with many cores for typical numpy/scipy workloads. We could have a workaround / fix for the cases we observe in scikit-learn where we introspect OpenBLAS prior to choosing the number of externally managed threads to use to avoid setting it to I am not sure what is best. Ideally OpenBLAS should be able to decouple the number of allocated memory regions from the default number of threads of its own threadpool but apparently this is not easy to do. |
FYI, I reopened the original issue in scikit-learn with a possible mitigation strategy. scikit-learn/scikit-learn#20539 (comment) But it sounds complex to implement and maintain. Ideally this problem should be solved in OpenBLAS directly... I am not sure what to do. |
Can we start some discussion with the OpenBLAS developers about how to make it possible to do this, maybe with some consulting hours? |
There is some discussion of this at OpenMathLib/OpenBLAS#3321. It apparently will not be a simple fix. |
A workaround has been implemented in OpenMathLib/OpenBLAS#3352 and released as part of OpenBLAS 0.3.18. I have not yet had the opportunity to try to see if it fixes the upstream issue scikit-learn/scikit-learn#20539 but I think we can close this PR as increasing the max num_threads of OpenBLAS is probably a bad idea from a performance point of view generally. |
Motivations:
OPENBLAS_NUM_THREADS
is set to a lower value at runtime and a low number of buffers can cause Program is Terminated. Because you tried to allocate too many memory regions on otherwise multi-threaded problemsOpen question: