Description
This is more a comment about an undocumented feature in case other users encounter a similar problem. I have an implementation of the OpenMP runtime that supports multiple copies of the OpenMP runtime in the same process. I was having threads bound to different OpenMP runtimes call into OpenBLAS simultaneously, but their executions were being serialized by OpenBLAS which was causing bad performance. The relevant bit of code is here:
https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server_omp.c#L321-L336
Effectively there is a fixed number of buffers for managing parallel OpenMP calls available and the default is 1. So if multiple OpenMP runtimes call into OpenBLAS at the same time then only one of them will be able to make progress while all the rest of them spin-wait for the one available buffer. It seems like the right way to fix this is to set NUM_PARALLEL to the upper bound on the number of OpenMP runtimes that you can have in a process.
https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.system#L197-L199
This will then set the max parallel number:
https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.system#L1015
and then that will fill in extra buffers for OpenMP usage:
https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server_omp.c#L57-L62
As far as I can tell this isn't documented anywhere. Maybe I just missed it. Please feel free to point me at the proper documentation if I did overlook it.