-
Notifications
You must be signed in to change notification settings - Fork 1.6k
combined threading #3913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Built with USE_OPENMP=1 will behave exactly like that. |
It seems that when using OPENMP=1 it threads the blas routines inside my already threaded omp parallel for region, which slows down everything. Should it be any different?
… On 17. Feb 2023, at 18:25, Andrew ***@***.***> wrote:
Built with USE_OPENMP=1 will behave exactly like that.
—
Reply to this email directly, view it on GitHub <#3913 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMG4ZBZUHWEJZOAXNOJJA3WX6X6ZANCNFSM6AAAAAAU7T2XHY>.
You are receiving this because you authored the thread.
|
Other question: do you mean the building of openblas, or the building of my own program that uses openblas?
… On 17. Feb 2023, at 18:25, Andrew ***@***.***> wrote:
Built with USE_OPENMP=1 will behave exactly like that.
—
Reply to this email directly, view it on GitHub <#3913 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMG4ZBZUHWEJZOAXNOJJA3WX6X6ZANCNFSM6AAAAAAU7T2XHY>.
You are receiving this because you authored the thread.
|
You need to use same OpenMP implementation ~ compiler for both library and your software for OpenBLAS to detect it is called inside parallel section. |
I rebuilt OpenBLAS with USE_OPENMP=1, and checked with -otools (on MacOS) that it uses libgomp just like my main program does. The result is awful, as the performance drops to 50% compared to the previous built. The best performance is achieved with the MacOS accelerate framework which seems to reduce the compute time by about 20% compared to the original OpenBLAS built. |
I managed to locate the difference between the accelerate framework and openblas. It happens in repeated diagonalizations of relatively small matrices (size 144x144). For these matrices Openblas needs 5 times as much CPU time as the accelerate framework, probably because these diagonalizations perform better when not threaded. I verified this by added an opm_set_num_threads(1) just before the diagonalization and resetting the number of threads to its maximum afterwards (because it is needed in the rest of the code). The accelerate framework seems to handle the diagonalization of these small matrices better when threading is on. |
Yes, switching to multithreading most likely occurs too early for both GEMM and GEMV - the factors in their respective interface files were last tuned in 2016 (for the hardware of the time, and when the library was still not thread-safe i.e. lacked many locks that will have affected performance since). Started looking into this for #2846 but got derailed by family emergencies lately, hope to pick it up again by next weekend |
Thanks for this very interesting comment! |
When called from an OpenMP parallel region, OpenBLAS will use only a single thread by default. Possibly the OMP_WAIT_POLICY setting caused idle threads to hang around unnecessarily, but there is too little information to go by. If the hardware on which the Accelerate library provides a significant speed advantage happens to be an Apple M system, the explanation for that lies simply in its use of the proprietary AMX2 matrix coprocessor. |
closing as assuming the remaining (multithreading threshold) issue to be fixed by #4441 |
I have a program where one part of the program has an omp parallel for loop which calls open blas routines. Here the openblas routines should be called in single threading. Other serial parts of my program also call openblas routines, which should then be called in parallel. How can I change the number of threads allocated to openblas dynamically in my code? I guess that I could use openblas_set_num_threads(), however these commands are not available if someone decides to use some other version of blas. Is there any other way to set the openblas threads dynamically?
The text was updated successfully, but these errors were encountered: