Skip to content

Build OpenBLAS with NUM_THREADS=128 #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Jul 28, 2021

Motivations:

Open question:

  • the windows build had NUM_THREADS=24 previously which is really low. I wonder why this was not reported as a problem earlier.
  • shall we shoot directly to NUM_THREADS=256 to avoid similar bug reports a few years from now?

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 19, 2021

shall we shoot directly to NUM_THREADS=256 to avoid similar bug reports a few years from now?

Indeed have access to such a machine:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3402.367
BogoMIPS:            4491.25
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

it's a problem... Maybe we should even shoot for 512 threads directly.

Copy link
Contributor Author

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 threads?

@mattip
Copy link
Collaborator

mattip commented Aug 19, 2021

Are there really cases where 512 threads of OpenBLAS are faster than 64? Don't people hit memory bandwith limits before they saturate 512 cores? I find it hard to believe that people with this kind of computing power are using vanilla NumPy and not recompiling for HPC using specialized compiler options. On the other hand, this change will bite the casual user by causing oversubscription whenever they run more than one NumPy process on their machine

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 19, 2021

Maybe not but unfortunately, if the number of threads set at build time is small that the number of CPUs detected at runtime, then we get BLAS : Program is Terminated. Because you tried to allocate too many memory regions. errors.
See: OpenMathLib/OpenBLAS#3321 for instance.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 19, 2021

You are right that OpenBLAS seems to have trouble using that many threads:

In [1]: import numpy as np

In [2]: data = np.random.randn(4096, 4096)

In [3]: %timeit _ = data @ data
326 ms ± 7.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: import threadpoolctl

In [5]: threadpoolctl.threadpool_info()
Out[5]: 
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.10',
  'num_threads': 128,
  'threading_layer': 'pthreads'}]

In [6]: threadpoolctl.threadpool_limits(64)
Out[6]: <threadpoolctl.threadpool_limits at 0x7fee99429460>

In [7]: threadpoolctl.threadpool_info()
Out[7]: 
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.10',
  'num_threads': 64,
  'threading_layer': 'pthreads'}]

In [8]: %timeit _ = data @ data
198 ms ± 6.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: threadpoolctl.threadpool_limits(32)
Out[9]: <threadpoolctl.threadpool_limits at 0x7fea5ab08d60>

In [10]: %timeit _ = data @ data
188 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: threadpoolctl.threadpool_limits(128)
Out[11]: <threadpoolctl.threadpool_limits at 0x7fea51d8a580>

In [12]: %timeit _ = data @ data
328 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here I used OpenBLAS from conda-forge which was built with NUM_THREADS=128. The machine has 128 physical cores on 2 sockets (see lscpu above).

@matthew-brett
Copy link
Contributor

@ogrisel - is that right - that OpenBLAS will by default always crash with a segfault when the number of CPUs is greater than the compiled number of threads? That seems like a bug - surely? I mean - shouldn't OpenBLAS just use the number of compiled threads in that situation, rather than segfaulting?

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 19, 2021

I mean - shouldn't OpenBLAS just use the number of compiled threads in that situation, rather than segfaulting?

That sounds like a better behavior indeed. Feel free to comment on the linked issue on the openblas repo.

@matthew-brett
Copy link
Contributor

@ogrisel - thanks - and sorry for being lazy - but is that true - that OpenBLAS will always ask for as many threads as there are CPUs, and segfaults if it doesn't get them? And would limiting the number of threads to the maximum compiled number be a reasonable fix, do you think?

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 20, 2021

I am confused: I tried again the to run a sample OpenBLAS program on the 256 thread machine and OpenBLAS (from conda-forge) automatically limits itself to 128 threads without the crash.

I am sure I observe this crash previously in an interactive test session but I don't remember what I did. Let me experiment a bit more to try to find a simple reproducer.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 20, 2021

Ok I re-read the original report and discussion and here is the minimal reproducer (using NumPy from PyPI with OpenBLAS built with NUM_THREADS=64):

In [1]: import numpy as np

In [2]: a = np.random.randn(1024, 1024)

In [3]: import threadpoolctl

In [4]: threadpoolctl.threadpool_info()
Out[4]: 
[{'filepath': '/scratch/ogrisel/miniforge3/envs/pypi/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.17',
  'num_threads': 64,
  'threading_layer': 'pthreads',
  'architecture': 'Zen'}]

In [5]: import os

In [6]: os.cpu_count()
Out[6]: 256

In [7]: _ = a @ a  # this run fines with 64 threads

To trigger the crash, on needs to create more than 64 threads (here using ThreadPoolExecutor but it could be with OpenMP via cython's prange) and call OpenBLAS with 1 thread in each of those externally managed threads.

In [11]: from concurrent.futures import ThreadPoolExecutor

In [12]: tpe = ThreadPoolExecutor(max_workers=256)

In [14]: with threadpoolctl.threadpool_limits(limits=1, user_api="blas"):
    ...:     list(tpe.map(lambda _: a @ a, range(256)))
    ...: 
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

But increasing NUM_THREADS too much when building OpenBLAS is not really a good solution, because it can often degrade the performance on machines with many cores for typical numpy/scipy workloads.

We could have a workaround / fix for the cases we observe in scikit-learn where we introspect OpenBLAS prior to choosing the number of externally managed threads to use to avoid setting it to os.cpu_count(). But this is hackish and would probably be brittle.

I am not sure what is best. Ideally OpenBLAS should be able to decouple the number of allocated memory regions from the default number of threads of its own threadpool but apparently this is not easy to do.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 20, 2021

FYI, I reopened the original issue in scikit-learn with a possible mitigation strategy. scikit-learn/scikit-learn#20539 (comment)

But it sounds complex to implement and maintain. Ideally this problem should be solved in OpenBLAS directly... I am not sure what to do.

@matthew-brett
Copy link
Contributor

Can we start some discussion with the OpenBLAS developers about how to make it possible to do this, maybe with some consulting hours?

@mattip
Copy link
Collaborator

mattip commented Aug 20, 2021

There is some discussion of this at OpenMathLib/OpenBLAS#3321. It apparently will not be a simple fix.

@ogrisel
Copy link
Contributor Author

ogrisel commented Nov 17, 2021

A workaround has been implemented in OpenMathLib/OpenBLAS#3352 and released as part of OpenBLAS 0.3.18. I have not yet had the opportunity to try to see if it fixes the upstream issue scikit-learn/scikit-learn#20539 but I think we can close this PR as increasing the max num_threads of OpenBLAS is probably a bad idea from a performance point of view generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants