Build OpenBLAS with NUM_THREADS=128 #64

ogrisel · 2021-07-28T13:37:49Z

Motivations:

machines with more than 64 CPUs (including hyperthreads) are getting more common
conda-forge does it: Windows / POSIX
it has an impact on the number of buffers even if OPENBLAS_NUM_THREADS is set to a lower value at runtime and a low number of buffers can cause Program is Terminated. Because you tried to allocate too many memory regions on otherwise multi-threaded problems
this problem impacts scikit-learn users on machines with 96 cores: BLAS memory allocation error in KMeans & kNN & DBSCAN scikit-learn/scikit-learn#20539 (see also: BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN OpenMathLib/OpenBLAS#3321)

Open question:

the windows build had NUM_THREADS=24 previously which is really low. I wonder why this was not reported as a problem earlier.
shall we shoot directly to NUM_THREADS=256 to avoid similar bug reports a few years from now?

ogrisel · 2021-08-19T13:18:05Z

shall we shoot directly to NUM_THREADS=256 to avoid similar bug reports a few years from now?

Indeed have access to such a machine:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3402.367
BogoMIPS:            4491.25
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

it's a problem... Maybe we should even shoot for 512 threads directly.

ogrisel

512 threads?

tools/build_openblas.sh

travis-ci/build_steps.sh

mattip · 2021-08-19T14:44:21Z

Are there really cases where 512 threads of OpenBLAS are faster than 64? Don't people hit memory bandwith limits before they saturate 512 cores? I find it hard to believe that people with this kind of computing power are using vanilla NumPy and not recompiling for HPC using specialized compiler options. On the other hand, this change will bite the casual user by causing oversubscription whenever they run more than one NumPy process on their machine

ogrisel · 2021-08-19T16:47:12Z

Maybe not but unfortunately, if the number of threads set at build time is small that the number of CPUs detected at runtime, then we get BLAS : Program is Terminated. Because you tried to allocate too many memory regions. errors.
See: OpenMathLib/OpenBLAS#3321 for instance.

ogrisel · 2021-08-19T16:52:43Z

You are right that OpenBLAS seems to have trouble using that many threads:

In [1]: import numpy as np

In [2]: data = np.random.randn(4096, 4096)

In [3]: %timeit _ = data @ data
326 ms ± 7.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: import threadpoolctl

In [5]: threadpoolctl.threadpool_info()
Out[5]: 
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.10',
  'num_threads': 128,
  'threading_layer': 'pthreads'}]

In [6]: threadpoolctl.threadpool_limits(64)
Out[6]: <threadpoolctl.threadpool_limits at 0x7fee99429460>

In [7]: threadpoolctl.threadpool_info()
Out[7]: 
[{'filepath': '/scratch/ogrisel/miniforge3/lib/libopenblasp-r0.3.10.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.10',
  'num_threads': 64,
  'threading_layer': 'pthreads'}]

In [8]: %timeit _ = data @ data
198 ms ± 6.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: threadpoolctl.threadpool_limits(32)
Out[9]: <threadpoolctl.threadpool_limits at 0x7fea5ab08d60>

In [10]: %timeit _ = data @ data
188 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: threadpoolctl.threadpool_limits(128)
Out[11]: <threadpoolctl.threadpool_limits at 0x7fea51d8a580>

In [12]: %timeit _ = data @ data
328 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here I used OpenBLAS from conda-forge which was built with NUM_THREADS=128. The machine has 128 physical cores on 2 sockets (see lscpu above).

matthew-brett · 2021-08-19T16:56:02Z

@ogrisel - is that right - that OpenBLAS will by default always crash with a segfault when the number of CPUs is greater than the compiled number of threads? That seems like a bug - surely? I mean - shouldn't OpenBLAS just use the number of compiled threads in that situation, rather than segfaulting?

ogrisel · 2021-08-19T16:58:19Z

I mean - shouldn't OpenBLAS just use the number of compiled threads in that situation, rather than segfaulting?

That sounds like a better behavior indeed. Feel free to comment on the linked issue on the openblas repo.

matthew-brett · 2021-08-19T17:05:51Z

@ogrisel - thanks - and sorry for being lazy - but is that true - that OpenBLAS will always ask for as many threads as there are CPUs, and segfaults if it doesn't get them? And would limiting the number of threads to the maximum compiled number be a reasonable fix, do you think?

ogrisel · 2021-08-20T06:40:12Z

I am confused: I tried again the to run a sample OpenBLAS program on the 256 thread machine and OpenBLAS (from conda-forge) automatically limits itself to 128 threads without the crash.

I am sure I observe this crash previously in an interactive test session but I don't remember what I did. Let me experiment a bit more to try to find a simple reproducer.

ogrisel · 2021-08-20T07:16:52Z

Ok I re-read the original report and discussion and here is the minimal reproducer (using NumPy from PyPI with OpenBLAS built with NUM_THREADS=64):

In [1]: import numpy as np

In [2]: a = np.random.randn(1024, 1024)

In [3]: import threadpoolctl

In [4]: threadpoolctl.threadpool_info()
Out[4]: 
[{'filepath': '/scratch/ogrisel/miniforge3/envs/pypi/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.17',
  'num_threads': 64,
  'threading_layer': 'pthreads',
  'architecture': 'Zen'}]

In [5]: import os

In [6]: os.cpu_count()
Out[6]: 256

In [7]: _ = a @ a  # this run fines with 64 threads

To trigger the crash, on needs to create more than 64 threads (here using ThreadPoolExecutor but it could be with OpenMP via cython's prange) and call OpenBLAS with 1 thread in each of those externally managed threads.

In [11]: from concurrent.futures import ThreadPoolExecutor

In [12]: tpe = ThreadPoolExecutor(max_workers=256)

In [14]: with threadpoolctl.threadpool_limits(limits=1, user_api="blas"):
    ...:     list(tpe.map(lambda _: a @ a, range(256)))
    ...: 
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

But increasing NUM_THREADS too much when building OpenBLAS is not really a good solution, because it can often degrade the performance on machines with many cores for typical numpy/scipy workloads.

We could have a workaround / fix for the cases we observe in scikit-learn where we introspect OpenBLAS prior to choosing the number of externally managed threads to use to avoid setting it to os.cpu_count(). But this is hackish and would probably be brittle.

I am not sure what is best. Ideally OpenBLAS should be able to decouple the number of allocated memory regions from the default number of threads of its own threadpool but apparently this is not easy to do.

ogrisel · 2021-08-20T08:54:30Z

FYI, I reopened the original issue in scikit-learn with a possible mitigation strategy. scikit-learn/scikit-learn#20539 (comment)

But it sounds complex to implement and maintain. Ideally this problem should be solved in OpenBLAS directly... I am not sure what to do.

matthew-brett · 2021-08-20T09:25:12Z

Can we start some discussion with the OpenBLAS developers about how to make it possible to do this, maybe with some consulting hours?

mattip · 2021-08-20T10:46:56Z

There is some discussion of this at OpenMathLib/OpenBLAS#3321. It apparently will not be a simple fix.

ogrisel · 2021-11-17T13:26:02Z

A workaround has been implemented in OpenMathLib/OpenBLAS#3352 and released as part of OpenBLAS 0.3.18. I have not yet had the opportunity to try to see if it fixes the upstream issue scikit-learn/scikit-learn#20539 but I think we can close this PR as increasing the max num_threads of OpenBLAS is probably a bad idea from a performance point of view generally.

ogrisel added 2 commits July 28, 2021 15:30

Build with NUM_THREADS=128 on Windows

d92b648

Build with NUM_THREADS=128 on POSIX

2e1e77d

This was referenced Jul 28, 2021

BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN OpenMathLib/OpenBLAS#3321

Closed

BLAS memory allocation error in KMeans & kNN & DBSCAN scikit-learn/scikit-learn#20539

Closed

ogrisel commented Aug 19, 2021

View reviewed changes

tools/build_openblas.sh Show resolved Hide resolved

travis-ci/build_steps.sh Show resolved Hide resolved

ogrisel closed this Nov 17, 2021

ogrisel deleted the set-NUM_THREADS-to-128 branch November 17, 2021 13:26

rgommers mentioned this pull request Feb 18, 2022

numpy aborts on import due to OpenBLAS using too much memory numpy/numpy#17856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build OpenBLAS with NUM_THREADS=128 #64

Build OpenBLAS with NUM_THREADS=128 #64

Uh oh!

ogrisel commented Jul 28, 2021 •

edited

Loading

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

mattip commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021 •

edited

Loading

Uh oh!

matthew-brett commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

matthew-brett commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

matthew-brett commented Aug 20, 2021

Uh oh!

mattip commented Aug 20, 2021

Uh oh!

ogrisel commented Nov 17, 2021

Uh oh!

Uh oh!

Build OpenBLAS with NUM_THREADS=128 #64

Build OpenBLAS with NUM_THREADS=128 #64

Uh oh!

Conversation

ogrisel commented Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattip commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthew-brett commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 19, 2021

Uh oh!

matthew-brett commented Aug 19, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

ogrisel commented Aug 20, 2021

Uh oh!

matthew-brett commented Aug 20, 2021

Uh oh!

mattip commented Aug 20, 2021

Uh oh!

ogrisel commented Nov 17, 2021

Uh oh!

Uh oh!

ogrisel commented Jul 28, 2021 •

edited

Loading

ogrisel commented Aug 19, 2021 •

edited

Loading