Skip to content

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
OliverQin opened this issue May 13, 2016 · 10 comments

Comments

@OliverQin
Copy link

I am using NumPy 0.11.0, linked against libopenblas_haswellp-r0.2.18 built by myself.
My CPU is Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
I found that OpenBLAS is slower when using multiple threads to do Eigen decomposition. Here is my Python code:

import numpy as np
import time

n = 1000

A = np.random.randn(n,n)

t = time.time()
w, v = np.linalg.eig(A)
td = time.time() - t

print  ("(%d, %d) randn EigenDecomp, secs: %0.3f" % (n, n, td))

Result:

$> OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.616
$> OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.573
$> OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 3.904
$> OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 5.801
$> OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 9.269

More threads bring no benefit. Performance significantly reduces when num of threads > 4.
I am sorry that I don't know which routine NumPy exactly calls. But I guess I should post the issue here. How to solve this?

@brada4
Copy link
Contributor

brada4 commented May 13, 2016

Your hyperthreads are just facelift, no performance in them
There is some thread pool sizing issue on top of that:
n=5000 12core CPu with hyperthreads off

$ for a in 12 6 4 3 2 1 ; do OPENBLAS_NUM_THREADS=$a time python 873.py; done
(5000, 5000) randn EigenDecomp, secs: 102.124
586.74user 591.35system 1:43.76elapsed 1135%CPU (0avgtext+0avgdata 1401664maxresident)k
0inputs+0outputs (0major+6132581minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 88.160
326.73user 186.28system 1:29.64elapsed 572%CPU (0avgtext+0avgdata 1399048maxresident)k
0inputs+0outputs (0major+4812354minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 90.964
242.11user 113.69system 1:32.42elapsed 384%CPU (0avgtext+0avgdata 1396948maxresident)k
0inputs+0outputs (0major+3292859minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 102.464
227.28user 74.45system 1:44.08elapsed 289%CPU (0avgtext+0avgdata 1395868maxresident)k
0inputs+8outputs (0major+1123119minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 130.448
208.62user 50.39system 2:11.93elapsed 196%CPU (0avgtext+0avgdata 1394672maxresident)k
0inputs+8outputs (0major+5804352minor)pagefaults 0swaps
(5000, 5000) randn EigenDecomp, secs: 228.030
227.98user 1.52system 3:49.67elapsed 99%CPU (0avgtext+0avgdata 1391808maxresident)k
0inputs+8outputs (0major+944643minor)pagefaults 0swaps

@jeromerobert
Copy link
Contributor

I do reproduce this issue. np.linalg.eig mostly call gemm and gemv (size from 5 to n). Using perf I can see that during gemm steps, the kernel scheduling overhead is very high. So I guess that the gemm threshold is still not good.

@jeromerobert
Copy link
Contributor

The main problem is swap. When I fixed #731 I wondered if removing multi-threading was not a good idea. I don't understand what's the point having multi-thread in swap. I do not see any case where the memory bus will not be the bottleneck.

Anyone against removing multi-threading in swap ?

A thread thresholding is also missing in trmv but the effect is much lower.

@brada4
Copy link
Contributor

brada4 commented May 13, 2016

back then my hypothesis was that swap is optimal while data is approx cache per thread, if data is (?how much?) less it bounces lower speed transfer between caches. If data is more than all caches - 2 CPUs can saturate remaining memory channel.
I dont think it will be easy to keep result warm in caches for next call after in that exact L2 cache it will be used....

@brada4
Copy link
Contributor

brada4 commented May 15, 2016

trmv threshold missing is absolutely cause of a problem exactly like #731
signalling all other threads to chew few bytes is plain waste.
Better limit to 2 threads in all-memory-bound swap (trmv looks like is the same memory bound)
Also axpy threshold is quite low, with all same considerations, but does not hit us in this case.

@jeromerobert
Copy link
Contributor

Benchmarks without threading in swap:

+ OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 3.482
+ OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.942
+ OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.603
+ OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.557
+ OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1200, 1200) randn EigenDecomp, secs: 2.465

+ OPENBLAS_NUM_THREADS=1 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 2.273
+ OPENBLAS_NUM_THREADS=2 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.964
+ OPENBLAS_NUM_THREADS=4 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.885
+ OPENBLAS_NUM_THREADS=5 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.651
+ OPENBLAS_NUM_THREADS=8 python ./testnumpy.py
(1000, 1000) randn EigenDecomp, secs: 1.631

swap benchmark that shows no benefits of threading whatever the size of the cache and the size of the buffer:
https://gist.github.com/jeromerobert/af2a5b6ceec1a4858ef3d191bcb440a6

jeromerobert added a commit to jeromerobert/OpenBLAS that referenced this issue May 16, 2016
@OliverQin
Copy link
Author

I'll try to build this patch on my own and test it. Thx!

@OliverQin
Copy link
Author

Better than before. When matrix goes larger more threads bring more benefits. But for my 1000x1000 case no significant improvement on my machine. At least now it does not go slower. That's probably because I only modified the swap.c, without merging other modification recently. Thank you again!

@jeromerobert
Copy link
Contributor

Yes you won't get a good speed up. Internally this algorithm work on sub blocks which are too small to be multi threaded. A part of the algorithm is also iterative which does not help.

You'll get a better speed up on large matrices because the number of multi-thread operation will be higher. But as some (no idea of the ratio, but this is the key), will remain single threaded because too small, it won't be that efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants