Performance issue with many cores #1881

jeremiedbb · 2018-11-22T14:35:57Z

Running the following code on a machine with many cores will give worse performances than limiting the number of threads.

import numpy as np
X = np.random.random_sample((2048, 2048))
%time np.linalg.svd(X)

I built numpy against OpenBLAS master (I observe the same thing with the OpenBLAS shipped with numpy from conda-forge).

Here are the timings I get for the previous code. The machine has 88 cores (44 physical + hyperthreading)

OPENBLAS_NUM_THREADS unset
CPU times: user 2min 53s, sys: 11min 38s, total: 14min 31s
Wall time: 11.9 s
OPENBLAS_NUM_THREADS=4
CPU times: user 9.45 s, sys: 3.26 s, total: 12.7 s
Wall time: 3.29 s

Below is comparison with MKL

MKL_NUM_THREADS unset
CPU times: user 57.8 s, sys: 2.46 s, total: 1min
Wall time: 1.59 s
MKL_NUM_THREADS=4
CPU times: user 8.27 s, sys: 192 ms, total: 8.46 s
Wall time: 2.16 s

It brings another issue: on cpus using hyperthreading, OpenBLAS will use the maximum number of threads possible which is twice the number of physical cores. But it should use only as many threads as the number of physical cores, because all BLAS operations don't really benefit from hyperthreading (hyperthreading is meant to parallelize tasks of different nature, which is the opposite of BLAS).

martin-frbg · 2018-11-22T15:03:33Z

Will need to see what BLAS calls np.linalg.svd makes...
Without promising any improvement, could you please try with the OpenBLAS "develop" branch ? ("master" is currently stuck at 0.2.20 so may actually be older than what ships with numpy, all subsequent releases have been made from the 0.3 branch).
It will probably depend on the individual hardware and problem size whether hyperthreading helps or not - when I brought up the topic in some other issue a while ago, a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware. The actual issue here could be that OpenBLAS creates too many threads for some subtask with a small matrix size - there have been a few changes in recent versions that address this in individual BLAS functions.

jeremiedbb · 2018-11-22T15:14:44Z

could you please try with the OpenBLAS "develop" branch ?

sorry I said master but it was against the develop branch. I didn't notice the main branch name was not master :)

a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware

It's strange because MKL never uses those :)

jeremiedbb · 2018-11-22T15:26:52Z

It seems that svd calls xGESVD or xGESDD with x = s or d

martin-frbg · 2018-11-22T17:10:03Z

Thanks. Strictly speaking this is LAPACK (which for OpenBLAS means the non-parallelized, non-optimized "netlib" reference implementation), the task now is to follow their call tree and see which BLAS functions get involved.

martin-frbg · 2018-11-22T17:25:27Z

From http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_ga84fdf22a62b12ff364621e4713ce02f2.html
this seems to be mostly GEMM/GEMV, SWAP,TRMM/TRMV which I think had already been checked for unnecessary thread creation. (This however was mostly to avoid senseless multithreading for small inputs, most if not all still take an all-or-nothing approach while MKL has been claimed to limit the number of active threads according to problem size.)

jeremiedbb · 2018-11-22T17:39:59Z

Thanks. I guess #678 is addressing the same concern, but it's still at the thinking process step :(

About the hyperthreading, I'd strongly argue against using pseudocores. On every machine I tested it, there was no improvement using those. Limiting the number of threads to use only physical was always better or even. And for the same problems, MKL never used the pseudocores.

Obviously it requires benchmarks, but I think it could be worth.

brada4 · 2018-11-22T19:56:50Z

Before l2tf one could adjust x86perf to get more out of single pseudocore while other is idle, now they are best left disabled for security and performance altogether

brada4 · 2018-11-22T19:59:53Z

I will quickly recheck the threading.
@jeremiedbb what is your exact cpu(s)?
Is it helping to limit threads to those present in one numa node as seen in numactl -H

brada4 · 2018-11-22T20:38:48Z

(0.3.3 pthread) there is the lock race around malloc taking 1-2-4-HT8 0-0.2-2-6% of time (which should be gone in develop branch after 0.3.3), but otherwise no visible spills
it should grow much worse with lots of cores indeed.
EDIT: indeed 200x more time is consumed in syscalls max vs 4 processors EDIT2: it is in initial post.
If you could try if #1785 gives an improvement in the meantime?

jeremiedbb · 2018-11-23T09:49:35Z

@brada4 Thanks for looking into it !

the cpu is an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
There are 2 sockets with 22 physical cores on each socket, and each one has hyper-threading.

There are 2 NUMA nodes. Limiting threads to a single numa node definitly helps ~~but not as much as limiting threads to physical cores.~~ EDIT: they perform equally.

If I limit to 1 numa node, OpenBLAS will use 44 threads on 22 physical cores on 1 numa node
If I limit to physical cores, OpenBLAS will use 44 threads on 44 physical cores on 2 numa nodes

~~The second case runs twice faster regarding wall time and syscalls are twice lower.~~ However both run slower that with only using 4 cores, and trigger way more syscalls.

If you could try if #1785 gives an improvement in the meantime?

I think I did try it because it was merged 1 month+ ago and I cloned the repo 2 days ago for this test.

jeremiedbb · 2018-11-23T10:19:06Z

I edited my previous comment because at first I limited the cores the code could use, but I didn't limit the number of threads for openblas (I only used taskset). Apparently, Openblas spawns as many threads as the machine has cores, even in a program restricted to run on a specific subset of core by taskset.

brada4 · 2018-11-23T10:55:36Z

There is another way that you set OPENBLAS_NUM_THREADS=22 (or 11 if CoD configuration is enabled in BIOS. check with numactl -H if you have 2 or 4 NUMA nodes) , then rely on OS to perform proper NUMA bindings. It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.
EDIT: I actually do such configuration on a multi-user systems for living and for other HPC libraries too.

jeremiedbb · 2018-11-28T14:27:03Z

numactl is not installed and I don't have root access. However, lscpu is enough here and shows 2 numa nodes (with their assigned core ids).

When setting OPENBLAS_NUM_THREADS=44, the OS does not groups all threads on the same numa node. Neither does setting OPENBLAS_NUM_THREADS=22. However, it makes OpenBLAS not use hyperthreading, and gives a huge improvement on performance.

When setting OPENBLAS_NUM_THREADS=44, and manually setting the affinity using taskset to only use 1 numa node, performance is basically the same as when setting openblas_num_threads. Setting OPENBLAS_NUM_THREADS=22 with the same affinity produces slightly better performance.

It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.

I think you're right. With bigger inputs, performances are more like expected.

In summary:

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls, especially on machines with more than one numa node. But this is a know limitation of OpenBLAS and there is already an opened issue for that, Dynamic number of threads #678.
OpenBLAS will always try to use all available cores (physical + hyperthreading). I found that it hurts performances, experimenting with all sizes of input. This has also been reported in OpenMP thread placement and affinity #1653.
MKL always uses only physical cores (as written in the doc of mkl_get_max_threads). I think that they do that for a good reason, and that it might be worth trying to do so.
When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

Meanwhile, a simple solution is setting OPENBLAS_NUM_THREADS according to input size and environment conditions.

Thanks for your answers @brada4 @martin-frbg !

martin-frbg · 2018-11-28T14:43:16Z

When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

I thought this should be solved already ( #1155 ) and init.c should only enumerate cpus in the current taskset, but I never tried my supposed fix on systems with multiple nodes. (Also the added code has various exit points for older libc or other restrictions on cpu enumeration, maybe I messed up there)

brada4 · 2018-11-28T16:55:37Z

Syscalls will be wildly reduced in 3.4. Check #1785
There is nothing specific yo numa nodes in particular code, though applying common sense there were n! potential lock interactions, which got measurably worse with dozens of threads.

MKL does that only from 2018, older takes all hyperthreads. Do you have any authoritative documentation that hyperthreads are waste? It is common sense, and a security problem for certain, but no clear statement they should be out for performance.

jeremiedbb · 2018-11-28T17:09:38Z

Do you have any authoritative documentation that hyperthreads are waste?

Absolutely not. As I said, it's just the result of experiences. It should not be taken for granted, but I think it allows to at least open the discussion.

brada4 · 2018-11-28T18:01:36Z

There is no statement past "go and try this trick" e. g.
https://software.intel.com/en-us/mkl-linux-developer-guide-managing-multi-core-performance

martin-frbg · 2018-11-28T21:14:16Z

Might be worthwile to experiment with OMP_PROC_BIND and OMP_PLACES as well
http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html
(And I also note that setting cpu affinity is disabled in the default configuration of OpenBLAS as it would affect other code in situations where the library is dlopened by an interpreter prior to loading other stuff.)

brada4 · 2018-11-28T22:03:33Z

Actually there are no timings in regard to system times after lock patch. Strange how nproc times syscall reduction went unnoticed on 88 cores detected.

fenrus75 · 2018-12-03T05:05:10Z

also, 22 is .. unfortunate.

#1846 which got merged as
f1c0227

is what I did on skylake to avoid some really bad behavior; one of these days I need to port a set of these fixes to haswell/broadwell.

basically what happens is that without the patch, openblas just divides, so

2048 / 22 = 93

93 is a really awful number from a SIMD perspective.... the patch in the commit will give much better behavior

brada4 · 2018-12-03T07:53:20Z

Not only SIMD, I spotted same with l1blas, like messing up basic (supposedly page, hopefully cache line) alignment for subsequent threads.

martin-frbg · 2018-12-03T12:53:18Z

Rereading #1846, it would seem to be sufficient to just copy the define GEMM_PREFERED_SIZE 32 line from the SKYLAKEX section of param.h to the corresponding HASWELL block (around line 1511 of the current develop/0.3.4 version) and rerun the numpy linalg.svd test

fenrus75 · 2018-12-03T13:42:00Z

maybe. One might need to do the equivalent of dcc5d62 to avoid nasty "we did not think a work chunk could be 0 sized" behavior

martin-frbg · 2018-12-03T14:01:41Z

Good question if blas_quickdivide can legitimately return zero (unless the work size is zero, in which case I wonder why we would get to that code at all).

fenrus75 · 2018-12-03T14:07:22Z

before the rounding to preferred sizes it wouldn't.
but lets say width is 64, and 3 cpus. preferred size is 32

before you get
21 21 22 as sizes
with the preferred size you get
32 32 0

(21 is one of those really nasty sizes for performance, so this case is a LOT faster with the rounding up)

but due to the rounding up you can get 0 work now

martin-frbg · 2018-12-03T14:45:49Z

Not sure I get this, as the (rounded) width gets deducted from the initial m (or n) shouldn't the divider loop exit after assigning two workloads of 32 each ?

brada4 · 2018-12-03T16:13:05Z

I will do tomorrow l regarding my alignment theory
@fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector, but not S/CD/Z, i.e. their size.

brada4 · 2018-12-05T10:46:01Z

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls

@jeremiedbb could you try in common.h

#ifndef YIELDING
// #define YIELDING     sched_yield()
#define YIELDING {};
#endif

(strace -o strace.out benchmark/dgemm.goto 1024 1024 1024)

Modern thread libraries de-schedule idle threads when idle implicitly, no need to force it,thats the syscal that was done 20000 times in 5s DGEMM.

@martin-frbg I think some small sleep should be here - it is sort of polling thread status.... That yielding syscall consumes (non-productively) CPU cycles when trivial sleep would just set timer in future.
blas_server.c

      while(tsqq) {
        YIELDING; <---- 20000 syscalls.
        pthread_mutex_lock(&thread_status[queue->assigned].lock);
        tsqq=thread_status[queue -> assigned].queue;
        pthread_mutex_unlock(&thread_status[queue->assigned].lock);

martin-frbg · 2018-12-05T11:00:31Z

@brada4 remember this was tried before and the results were inconclusive at best.

brada4 · 2018-12-05T11:22:56Z

I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one.
I think pthread_cond_wait is pthread equivalent not involving syscalls.

treadalus · 2018-12-06T19:35:09Z

I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it.

fenrus75 · 2018-12-06T19:37:31Z

I've measured and for sgemm/sgemm it was a small gain but not a large loss or anything

…

On Thu, Dec 6, 2018, 20:35 treadalus ***@***.*** wrote: I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1881 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFd7yRgi-QCzj10g82TkdK6HwWWmkks5u2XFzgaJpZM4Yvdty> .

fenrus75 · 2018-12-28T13:26:40Z

I will retract my statement based on (new) data.

I have now encountered cases where hyperthreading with openblas is a severe hinderance.

HT is fine if both threads are doing basic math operations... both do productive work and while performance increase is limited, they also don't get in the way of each other

HOWEVER

OpenBLAS at times puts a thread in a polling mode doing yield() in a tight loop. This tight loop does a system call (with all the nasty TLB and cache effects in addition), takes many kernel locks as part of this etc etc. The result is that this polling thread utterly destroys the math performance of the other thread on the same core.... sometimes in the 2x range

martin-frbg · 2018-12-28T14:29:20Z

That brings us back to #1560 and #900 - how does the same scenario look with the YIELDING macro from common.h set to a couple of nop's (as it is done on ARM, POWER and a somewhat random selection of AMD cpus) ?

fenrus75 · 2018-12-28T14:53:45Z

on x86 (both Intel and AMD) one would use "pause" not "nop" (pause is also a hyperthreading / virtualization / power management hint)

I'll test. My initial fix was to count cores not hyperthreads.

there is also a root cause behind this that I am not sure of yet; the symptom is lots of time spent in yield()

but the cause is likely some other bug. I've observed in this specific scenario that a lot of "M = 0" dgemm calls got made by the level 3 syrk code.... maybe that causes some threads to finish early and just spin

fenrus75 · 2018-12-28T15:24:27Z

the nop (pause) does not help much at all (within measurement noise)

the HT fix is still almost 2x.

fenrus75 · 2018-12-28T15:28:45Z

next experiment was to make blas_thread_server just not yield, but always use cond variables.
small gain for both with and without ht fix case...

(but I have a modern glibc, I suspect it might be doing a spinafore internally)

brada4 · 2018-12-29T14:18:48Z

The visible problem is that _yield makes syscall, nop does nothing. Some delay like .0001s could do same without grinding CPU cycles out of turbo mode.

fenrus75 · 2018-12-30T02:35:47Z

that was in the tree very very briefly... but got reverted since it hurt as well.

in the case that seem to matter the most, just using the condition variable, rather than first polling, seems to work just as well as polling.

there's a different problem a little bit, in that if I use taskset to reduce the number of cpus, performance goes up (I'll do a sweep tomorrow to see if there's a sweetspot, and then, if we can compute where the sweetspot is based on matrix size or something)

fenrus75 · 2018-12-30T22:54:00Z

On a 10 core/20 thread system, data below. All numbers are relative cost to the unthreaded (1) case, lower is better

Size	1 core	2 cores	3 cores	4 cores	5 cores	6 cores	7 cores	8 cores	9 cores	10 cores	11 threads	12 threads	13 threads	14 threads	15 threads	16 threads	17 threads	18 threads	19 threads	20 threads
96x96x96	1.000	1.181	1.517	1.087	1.146	1.496	1.540	1.065	1.135	1.144	1.152	1.365	1.494	1.515	1.495	1.068	1.159	1.159	1.141	1.078
128x128x128	1.000	0.781	1.062	0.657	0.628	0.912	0.911	0.675	0.763	0.632	0.655	0.841	0.899	0.896	0.914	0.675	0.755	0.760	0.768	0.610
192x192x192	1.000	0.719	0.669	0.648	0.684	0.440	0.406	0.535	0.515	0.664	0.670	0.441	0.440	0.401	0.403	0.539	0.539	0.539	0.541	0.656
256x256x256	1.000	0.687	0.612	0.415	0.455	0.472	0.478	0.331	0.295	0.362	0.359	0.432	0.444	0.495	0.487	0.333	0.331	0.295	0.293	0.366
384x384x384	1.000	0.606	0.429	0.350	0.360	0.256	0.258	0.265	0.277	0.278	0.288	0.291	0.259	0.262	0.269	0.281	0.271	0.290	0.285	0.296
512x512x512	1.000	0.588	0.440	0.323	0.326	0.267	0.262	0.206	0.205	0.210	0.261	0.284	0.292	0.295	0.293	0.216	0.194	0.192	0.193	0.197
768x768x768	1.000	0.259	0.180	0.141	0.126	0.102	0.101	0.082	0.084	0.086	0.133	0.100	0.101	0.091	0.091	0.090	0.088	0.089	0.089	0.088
1024x1024x1024	1.000	0.563	0.398	0.298	0.276	0.235	0.204	0.169	0.157	0.162	0.212	0.216	0.216	0.218	0.217	0.162	0.161	0.161	0.162	0.163
1536x1536x1536	1.000	0.560	0.192	0.146	0.127	0.103	0.094	0.084	0.083	0.068	0.118	0.100	0.094	0.093	0.094	0.080	0.080	0.080	0.074	0.073
2048x2048x2048	1.000	0.366	0.373	0.202	0.167	0.141	0.125	0.110	0.105	0.101	0.145	0.147	0.133	0.128	0.127	0.110	0.106	0.104	0.101	0.102

thijssteel · 2020-09-17T11:46:28Z

I think that besides pure performance, dynamic selection of the number of threads has other advantages:

it's probably way better for power usage
it makes it a bit more clear to developers how scalable their code is ( for example, my latest project, the new QZ algorithm for lapack, rarely uses more than 4 threads when linked with MKL. ). This is valuable information.

I also think it would be interesting to look at the performance for rectangular matrices. The QZ algorithm contains many (k x k x n) multiplications with k << n and I've found those to be particularly problematic when the threads aren't limited.

martin-frbg mentioned this issue Jan 19, 2019

Should the build system detect SMT/HT? #1972

Closed

rth mentioned this issue May 19, 2020

Heuristics for algorithm='auto' in NearestNeighbors scikit-learn/scikit-learn#8213

Closed

ogrisel mentioned this issue May 25, 2020

KMeans(init='k-means++') performance issue with OpenBLAS scikit-learn/scikit-learn#17334

Open

tthsqe12 mentioned this issue Apr 6, 2021

test mul_blas takes *really* very long, and longer the more threads! flintlib/flint#920

Closed

martin-frbg mentioned this issue Mar 27, 2024

Cap the number of parallel threads for GEMM;GETRF and POTRF to ensure sensible workloads on big systems #4585

Merged

martin-frbg closed this as completed in #4585 Apr 3, 2024

Performance issue with many cores #1881

Performance issue with many cores #1881

Comments

jeremiedbb commented Nov 22, 2018

martin-frbg commented Nov 22, 2018

jeremiedbb commented Nov 22, 2018 • edited Loading

jeremiedbb commented Nov 22, 2018

martin-frbg commented Nov 22, 2018

martin-frbg commented Nov 22, 2018

jeremiedbb commented Nov 22, 2018

brada4 commented Nov 22, 2018

brada4 commented Nov 22, 2018

brada4 commented Nov 22, 2018 • edited Loading

jeremiedbb commented Nov 23, 2018 • edited Loading

jeremiedbb commented Nov 23, 2018

brada4 commented Nov 23, 2018 • edited Loading

jeremiedbb commented Nov 28, 2018 • edited Loading

martin-frbg commented Nov 28, 2018

brada4 commented Nov 28, 2018

jeremiedbb commented Nov 28, 2018

brada4 commented Nov 28, 2018

martin-frbg commented Nov 28, 2018 • edited Loading

brada4 commented Nov 28, 2018

fenrus75 commented Dec 3, 2018

brada4 commented Dec 3, 2018

martin-frbg commented Dec 3, 2018 • edited Loading

fenrus75 commented Dec 3, 2018

martin-frbg commented Dec 3, 2018

fenrus75 commented Dec 3, 2018

martin-frbg commented Dec 3, 2018

brada4 commented Dec 3, 2018

brada4 commented Dec 5, 2018 • edited Loading

martin-frbg commented Dec 5, 2018

brada4 commented Dec 5, 2018

treadalus commented Dec 6, 2018

fenrus75 commented Dec 6, 2018 via email

fenrus75 commented Dec 28, 2018

martin-frbg commented Dec 28, 2018 • edited Loading

fenrus75 commented Dec 28, 2018

fenrus75 commented Dec 28, 2018

fenrus75 commented Dec 28, 2018

brada4 commented Dec 29, 2018

fenrus75 commented Dec 30, 2018

fenrus75 commented Dec 30, 2018 • edited Loading

thijssteel commented Sep 17, 2020

jeremiedbb commented Nov 22, 2018 •

edited

Loading

brada4 commented Nov 22, 2018 •

edited

Loading

jeremiedbb commented Nov 23, 2018 •

edited

Loading

brada4 commented Nov 23, 2018 •

edited

Loading

jeremiedbb commented Nov 28, 2018 •

edited

Loading

martin-frbg commented Nov 28, 2018 •

edited

Loading

martin-frbg commented Dec 3, 2018 •

edited

Loading

brada4 commented Dec 5, 2018 •

edited

Loading

martin-frbg commented Dec 28, 2018 •

edited

Loading

fenrus75 commented Dec 30, 2018 •

edited

Loading