-
Notifications
You must be signed in to change notification settings - Fork 1.6k
multithreaded eigen decomposition slower than single-threaded version, with NumPy #873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Your hyperthreads are just facelift, no performance in them
|
I do reproduce this issue. |
The main problem is Anyone against removing multi-threading in swap ? A thread thresholding is also missing in |
back then my hypothesis was that swap is optimal while data is approx cache per thread, if data is (?how much?) less it bounces lower speed transfer between caches. If data is more than all caches - 2 CPUs can saturate remaining memory channel. |
trmv threshold missing is absolutely cause of a problem exactly like #731 |
Benchmarks without threading in swap:
swap benchmark that shows no benefits of threading whatever the size of the cache and the size of the buffer: |
I'll try to build this patch on my own and test it. Thx! |
Better than before. When matrix goes larger more threads bring more benefits. But for my 1000x1000 case no significant improvement on my machine. At least now it does not go slower. That's probably because I only modified the |
Yes you won't get a good speed up. Internally this algorithm work on sub blocks which are too small to be multi threaded. A part of the algorithm is also iterative which does not help. You'll get a better speed up on large matrices because the number of multi-thread operation will be higher. But as some (no idea of the ratio, but this is the key), will remain single threaded because too small, it won't be that efficient. |
I am using NumPy 0.11.0, linked against
libopenblas_haswellp-r0.2.18
built by myself.My CPU is
Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
I found that OpenBLAS is slower when using multiple threads to do Eigen decomposition. Here is my Python code:
Result:
More threads bring no benefit. Performance significantly reduces when num of threads > 4.
I am sorry that I don't know which routine NumPy exactly calls. But I guess I should post the issue here. How to solve this?
The text was updated successfully, but these errors were encountered: