multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

OliverQin · 2016-05-13T13:02:22Z

I am using NumPy 0.11.0, linked against libopenblas_haswellp-r0.2.18 built by myself.
My CPU is Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
I found that OpenBLAS is slower when using multiple threads to do Eigen decomposition. Here is my Python code:

import numpy as np
import time

n = 1000

A = np.random.randn(n,n)

t = time.time()
w, v = np.linalg.eig(A)
td = time.time() - t

print  ("(%d, %d) randn EigenDecomp, secs: %0.3f" % (n, n, td))

Result:

$> OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.616
$> OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.573
$> OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.904
$> OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 5.801
$> OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 9.269

More threads bring no benefit. Performance significantly reduces when num of threads > 4.
I am sorry that I don't know which routine NumPy exactly calls. But I guess I should post the issue here. How to solve this?

The text was updated successfully, but these errors were encountered:

brada4 · 2016-05-13T14:05:47Z

Your hyperthreads are just facelift, no performance in them
There is some thread pool sizing issue on top of that:
n=5000 12core CPu with hyperthreads off

$ for a in 12 6 4 3 2 1 ; do OPENBLAS_NUM_THREADS=$a time python 873.py; done
(5000, 5000) randn EigenDecomp, secs: 102.124
586.74user 591.35system 1:43.76elapsed 1135%CPU (0avgtext+0avgdata 1401664maxresident)k
0inputs+0outputs (0major+6132581minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 88.160
326.73user 186.28system 1:29.64elapsed 572%CPU (0avgtext+0avgdata 1399048maxresident)k
0inputs+0outputs (0major+4812354minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 90.964
242.11user 113.69system 1:32.42elapsed 384%CPU (0avgtext+0avgdata 1396948maxresident)k
0inputs+0outputs (0major+3292859minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 102.464
227.28user 74.45system 1:44.08elapsed 289%CPU (0avgtext+0avgdata 1395868maxresident)k
0inputs+8outputs (0major+1123119minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 130.448
208.62user 50.39system 2:11.93elapsed 196%CPU (0avgtext+0avgdata 1394672maxresident)k
0inputs+8outputs (0major+5804352minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 228.030
227.98user 1.52system 3:49.67elapsed 99%CPU (0avgtext+0avgdata 1391808maxresident)k
0inputs+8outputs (0major+944643minor)pagefaults 0swaps

jeromerobert · 2016-05-13T19:16:52Z

I do reproduce this issue. np.linalg.eig mostly call gemm and gemv (size from 5 to n). Using perf I can see that during gemm steps, the kernel scheduling overhead is very high. So I guess that the gemm threshold is still not good.

brada4 · 2016-05-13T19:54:11Z

https://code.openhub.net/file?fid=s9iZncQcPLDk224dPuKR0QWwG9k&cid=c1rWbQq120U&s=&fp=300512&projSelected=true#L2579
must be sgeev in test case

Down the road gemm and gemmv are invoked indirectly

jeromerobert · 2016-05-13T20:39:14Z

The main problem is swap. When I fixed #731 I wondered if removing multi-threading was not a good idea. I don't understand what's the point having multi-thread in swap. I do not see any case where the memory bus will not be the bottleneck.

Anyone against removing multi-threading in swap ?

A thread thresholding is also missing in trmv but the effect is much lower.

brada4 · 2016-05-13T20:59:28Z

back then my hypothesis was that swap is optimal while data is approx cache per thread, if data is (?how much?) less it bounces lower speed transfer between caches. If data is more than all caches - 2 CPUs can saturate remaining memory channel.
I dont think it will be easy to keep result warm in caches for next call after in that exact L2 cache it will be used....

brada4 · 2016-05-15T11:35:17Z

trmv threshold missing is absolutely cause of a problem exactly like #731
signalling all other threads to chew few bytes is plain waste.
Better limit to 2 threads in all-memory-bound swap (trmv looks like is the same memory bound)
Also axpy threshold is quite low, with all same considerations, but does not hit us in this case.

jeromerobert · 2016-05-16T13:06:58Z

Benchmarks without threading in swap:

+ OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 3.482
+ OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.942
+ OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.603
+ OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.557
+ OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.465

+ OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 2.273
+ OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.964
+ OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.885
+ OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.651
+ OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.631

swap benchmark that shows no benefits of threading whatever the size of the cache and the size of the buffer:
https://gist.github.com/jeromerobert/af2a5b6ceec1a4858ef3d191bcb440a6

* Close OpenMathLib#873

OliverQin · 2016-05-16T13:16:17Z

I'll try to build this patch on my own and test it. Thx!

OliverQin · 2016-05-16T14:04:37Z

Better than before. When matrix goes larger more threads bring more benefits. But for my 1000x1000 case no significant improvement on my machine. At least now it does not go slower. That's probably because I only modified the swap.c, without merging other modification recently. Thank you again!

jeromerobert · 2016-05-16T14:34:58Z

Yes you won't get a good speed up. Internally this algorithm work on sub blocks which are too small to be multi threaded. A part of the algorithm is also iterative which does not help.

You'll get a better speed up on large matrices because the number of multi-thread operation will be higher. But as some (no idea of the ratio, but this is the key), will remain single threaded because too small, it won't be that efficient.

* Close OpenMathLib#873

jeromerobert added a commit to jeromerobert/OpenBLAS that referenced this issue May 16, 2016

Disable multi-threading in swap

40af513

* Close OpenMathLib#873

jeromerobert mentioned this issue May 16, 2016

Disable multi-threading in swap #877

Merged

OliverQin closed this as completed May 16, 2016

martin-frbg mentioned this issue Jul 13, 2016

caffe + openblas not using cpu to the max #923

Closed

npanpaliya pushed a commit to ibmsoe/OpenBLAS that referenced this issue Aug 11, 2016

Disable multi-threading in swap

193b1a2

* Close OpenMathLib#873

btjanaka mentioned this issue Jan 23, 2021

Use single-threaded BLAS for emitters in Optimizer icaros-usc/pyribs#51

Merged

5 tasks

martin-frbg mentioned this issue Jan 9, 2024

SVE implementation for dgemv using sve intrinsics #4411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

OliverQin commented May 13, 2016

brada4 commented May 13, 2016

jeromerobert commented May 13, 2016

brada4 commented May 13, 2016

jeromerobert commented May 13, 2016

brada4 commented May 13, 2016

brada4 commented May 15, 2016

jeromerobert commented May 16, 2016

OliverQin commented May 16, 2016

OliverQin commented May 16, 2016

jeromerobert commented May 16, 2016

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

Comments

OliverQin commented May 13, 2016

brada4 commented May 13, 2016

jeromerobert commented May 13, 2016

brada4 commented May 13, 2016

jeromerobert commented May 13, 2016

brada4 commented May 13, 2016

brada4 commented May 15, 2016

jeromerobert commented May 16, 2016

OliverQin commented May 16, 2016

OliverQin commented May 16, 2016

jeromerobert commented May 16, 2016