Skip to content

combined threading #3913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcrbloch opened this issue Feb 17, 2023 · 10 comments
Closed

combined threading #3913

jcrbloch opened this issue Feb 17, 2023 · 10 comments

Comments

@jcrbloch
Copy link

I have a program where one part of the program has an omp parallel for loop which calls open blas routines. Here the openblas routines should be called in single threading. Other serial parts of my program also call openblas routines, which should then be called in parallel. How can I change the number of threads allocated to openblas dynamically in my code? I guess that I could use openblas_set_num_threads(), however these commands are not available if someone decides to use some other version of blas. Is there any other way to set the openblas threads dynamically?

@brada4
Copy link
Contributor

brada4 commented Feb 17, 2023

Built with USE_OPENMP=1 will behave exactly like that.

@jcrbloch
Copy link
Author

jcrbloch commented Feb 17, 2023 via email

@jcrbloch
Copy link
Author

jcrbloch commented Feb 17, 2023 via email

@brada4
Copy link
Contributor

brada4 commented Feb 17, 2023

You need to use same OpenMP implementation ~ compiler for both library and your software for OpenBLAS to detect it is called inside parallel section.
Among implementations - clang omp, gomp iomp msomp , use ldd or dependency walker to "look at" your final artifact.

@jcrbloch
Copy link
Author

I rebuilt OpenBLAS with USE_OPENMP=1, and checked with -otools (on MacOS) that it uses libgomp just like my main program does. The result is awful, as the performance drops to 50% compared to the previous built. The best performance is achieved with the MacOS accelerate framework which seems to reduce the compute time by about 20% compared to the original OpenBLAS built.

@jcrbloch
Copy link
Author

I managed to locate the difference between the accelerate framework and openblas. It happens in repeated diagonalizations of relatively small matrices (size 144x144). For these matrices Openblas needs 5 times as much CPU time as the accelerate framework, probably because these diagonalizations perform better when not threaded. I verified this by added an opm_set_num_threads(1) just before the diagonalization and resetting the number of threads to its maximum afterwards (because it is needed in the rest of the code). The accelerate framework seems to handle the diagonalization of these small matrices better when threading is on.

@martin-frbg
Copy link
Collaborator

Yes, switching to multithreading most likely occurs too early for both GEMM and GEMV - the factors in their respective interface files were last tuned in 2016 (for the hardware of the time, and when the library was still not thread-safe i.e. lacked many locks that will have affected performance since). Started looking into this for #2846 but got derailed by family emergencies lately, hope to pick it up again by next weekend

@jcrbloch
Copy link
Author

Thanks for this very interesting comment!

@martin-frbg
Copy link
Collaborator

When called from an OpenMP parallel region, OpenBLAS will use only a single thread by default. Possibly the OMP_WAIT_POLICY setting caused idle threads to hang around unnecessarily, but there is too little information to go by. If the hardware on which the Accelerate library provides a significant speed advantage happens to be an Apple M system, the explanation for that lies simply in its use of the proprietary AMX2 matrix coprocessor.

@martin-frbg
Copy link
Collaborator

closing as assuming the remaining (multithreading threshold) issue to be fixed by #4441

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants