-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Performance issue with many cores #1881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Will need to see what BLAS calls np.linalg.svd makes... |
sorry I said master but it was against the develop branch. I didn't notice the main branch name was not master :)
It's strange because MKL never uses those :) |
It seems that svd calls |
Thanks. Strictly speaking this is LAPACK (which for OpenBLAS means the non-parallelized, non-optimized "netlib" reference implementation), the task now is to follow their call tree and see which BLAS functions get involved. |
From http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_ga84fdf22a62b12ff364621e4713ce02f2.html |
Thanks. I guess #678 is addressing the same concern, but it's still at the thinking process step :( About the hyperthreading, I'd strongly argue against using pseudocores. On every machine I tested it, there was no improvement using those. Limiting the number of threads to use only physical was always better or even. And for the same problems, MKL never used the pseudocores. Obviously it requires benchmarks, but I think it could be worth. |
Before l2tf one could adjust x86perf to get more out of single pseudocore while other is idle, now they are best left disabled for security and performance altogether |
I will quickly recheck the threading. |
(0.3.3 pthread) there is the lock race around malloc taking 1-2-4-HT8 0-0.2-2-6% of time (which should be gone in develop branch after 0.3.3), but otherwise no visible spills |
@brada4 Thanks for looking into it ! the cpu is an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz There are 2 NUMA nodes. Limiting threads to a single numa node definitly helps If I limit to 1 numa node, OpenBLAS will use 44 threads on 22 physical cores on 1 numa node
I think I did try it because it was merged 1 month+ ago and I cloned the repo 2 days ago for this test. |
I edited my previous comment because at first I limited the cores the code could use, but I didn't limit the number of threads for openblas (I only used taskset). Apparently, Openblas spawns as many threads as the machine has cores, even in a program restricted to run on a specific subset of core by taskset. |
There is another way that you set OPENBLAS_NUM_THREADS=22 (or 11 if CoD configuration is enabled in BIOS. check with |
numactl is not installed and I don't have root access. However, lscpu is enough here and shows 2 numa nodes (with their assigned core ids). When setting OPENBLAS_NUM_THREADS=44, the OS does not groups all threads on the same numa node. Neither does setting OPENBLAS_NUM_THREADS=22. However, it makes OpenBLAS not use hyperthreading, and gives a huge improvement on performance. When setting OPENBLAS_NUM_THREADS=44, and manually setting the affinity using taskset to only use 1 numa node, performance is basically the same as when setting openblas_num_threads. Setting OPENBLAS_NUM_THREADS=22 with the same affinity produces slightly better performance.
I think you're right. With bigger inputs, performances are more like expected. In summary:
Meanwhile, a simple solution is setting OPENBLAS_NUM_THREADS according to input size and environment conditions. Thanks for your answers @brada4 @martin-frbg ! |
I thought this should be solved already ( #1155 ) and init.c should only enumerate cpus in the current taskset, but I never tried my supposed fix on systems with multiple nodes. (Also the added code has various exit points for older libc or other restrictions on cpu enumeration, maybe I messed up there) |
Syscalls will be wildly reduced in 3.4. Check #1785 MKL does that only from 2018, older takes all hyperthreads. Do you have any authoritative documentation that hyperthreads are waste? It is common sense, and a security problem for certain, but no clear statement they should be out for performance. |
Absolutely not. As I said, it's just the result of experiences. It should not be taken for granted, but I think it allows to at least open the discussion. |
There is no statement past "go and try this trick" e. g. |
Might be worthwile to experiment with OMP_PROC_BIND and OMP_PLACES as well |
Actually there are no timings in regard to system times after lock patch. Strange how |
also, 22 is .. unfortunate. #1846 which got merged as is what I did on skylake to avoid some really bad behavior; one of these days I need to port a set of these fixes to haswell/broadwell. basically what happens is that without the patch, openblas just divides, so 2048 / 22 = 93 93 is a really awful number from a SIMD perspective.... the patch in the commit will give much better behavior |
Not only SIMD, I spotted same with l1blas, like messing up basic (supposedly page, hopefully cache line) alignment for subsequent threads. |
Rereading #1846, it would seem to be sufficient to just copy the |
maybe. One might need to do the equivalent of dcc5d62 to avoid nasty "we did not think a work chunk could be 0 sized" behavior |
Good question if blas_quickdivide can legitimately return zero (unless the work size is zero, in which case I wonder why we would get to that code at all). |
before the rounding to preferred sizes it wouldn't. before you get (21 is one of those really nasty sizes for performance, so this case is a LOT faster with the rounding up) but due to the rounding up you can get 0 work now |
Not sure I get this, as the (rounded) width gets deducted from the initial m (or n) shouldn't the divider loop exit after assigning two workloads of 32 each ? |
I will do tomorrow l regarding my alignment theory |
@jeremiedbb could you try in common.h
(strace -o strace.out benchmark/dgemm.goto 1024 1024 1024) Modern thread libraries de-schedule idle threads when idle implicitly, no need to force it,thats the syscal that was done 20000 times in 5s DGEMM. @martin-frbg I think some small sleep should be here - it is sort of polling thread status.... That yielding syscall consumes (non-productively) CPU cycles when trivial sleep would just set timer in future.
|
@brada4 remember this was tried before and the results were inconclusive at best. |
I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one. |
I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it. |
I've measured and for sgemm/sgemm it was a small gain but not a large loss
or anything
…On Thu, Dec 6, 2018, 20:35 treadalus ***@***.*** wrote:
I just scanned this thread. In my experience on AWS, hyperthreading causes
performance issues. I use scripts to turn that off. Btw, I wrote a wrapper
for Octave to control the number of threads that would fit nicely under
OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1881 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFd7yRgi-QCzj10g82TkdK6HwWWmkks5u2XFzgaJpZM4Yvdty>
.
|
I will retract my statement based on (new) data. I have now encountered cases where hyperthreading with openblas is a severe hinderance. HT is fine if both threads are doing basic math operations... both do productive work and while performance increase is limited, they also don't get in the way of each other HOWEVER OpenBLAS at times puts a thread in a polling mode doing yield() in a tight loop. This tight loop does a system call (with all the nasty TLB and cache effects in addition), takes many kernel locks as part of this etc etc. The result is that this polling thread utterly destroys the math performance of the other thread on the same core.... sometimes in the 2x range |
on x86 (both Intel and AMD) one would use "pause" not "nop" (pause is also a hyperthreading / virtualization / power management hint) I'll test. My initial fix was to count cores not hyperthreads. there is also a root cause behind this that I am not sure of yet; the symptom is lots of time spent in yield() but the cause is likely some other bug. I've observed in this specific scenario that a lot of "M = 0" dgemm calls got made by the level 3 syrk code.... maybe that causes some threads to finish early and just spin |
the nop (pause) does not help much at all (within measurement noise) the HT fix is still almost 2x. |
next experiment was to make blas_thread_server just not yield, but always use cond variables. (but I have a modern glibc, I suspect it might be doing a spinafore internally) |
The visible problem is that _yield makes syscall, nop does nothing. Some delay like .0001s could do same without grinding CPU cycles out of turbo mode. |
that was in the tree very very briefly... but got reverted since it hurt as well. in the case that seem to matter the most, just using the condition variable, rather than first polling, seems to work just as well as polling. there's a different problem a little bit, in that if I use taskset to reduce the number of cpus, performance goes up (I'll do a sweep tomorrow to see if there's a sweetspot, and then, if we can compute where the sweetspot is based on matrix size or something) |
On a 10 core/20 thread system, data below. All numbers are relative cost to the unthreaded (1) case, lower is better
|
I think that besides pure performance, dynamic selection of the number of threads has other advantages:
I also think it would be interesting to look at the performance for rectangular matrices. The QZ algorithm contains many (k x k x n) multiplications with k << n and I've found those to be particularly problematic when the threads aren't limited. |
Running the following code on a machine with many cores will give worse performances than limiting the number of threads.
I built numpy against OpenBLAS master (I observe the same thing with the OpenBLAS shipped with numpy from conda-forge).
Here are the timings I get for the previous code. The machine has 88 cores (44 physical + hyperthreading)
OPENBLAS_NUM_THREADS unset
CPU times: user 2min 53s, sys: 11min 38s, total: 14min 31s
Wall time: 11.9 s
OPENBLAS_NUM_THREADS=4
CPU times: user 9.45 s, sys: 3.26 s, total: 12.7 s
Wall time: 3.29 s
Below is comparison with MKL
MKL_NUM_THREADS unset
CPU times: user 57.8 s, sys: 2.46 s, total: 1min
Wall time: 1.59 s
MKL_NUM_THREADS=4
CPU times: user 8.27 s, sys: 192 ms, total: 8.46 s
Wall time: 2.16 s
It brings another issue: on cpus using hyperthreading, OpenBLAS will use the maximum number of threads possible which is twice the number of physical cores. But it should use only as many threads as the number of physical cores, because all BLAS operations don't really benefit from hyperthreading (hyperthreading is meant to parallelize tasks of different nature, which is the opposite of BLAS).
The text was updated successfully, but these errors were encountered: