-
Notifications
You must be signed in to change notification settings - Fork 1.6k
BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN #3321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The recommendation given there was on the right track, I wonder why it did not work. Multithreaded OpenBLAS requires a memory buffer per thread, and the maximum number of buffers is set at compile time. So there is an (ideally/normally )invisible limitation caused by what the OpenBLAS that came with either numpy or your operating system was configured for. Does it work when you set OPENBLAS_NUM_THREADS to a smaller value, like 16 or 32 ? (The OpenBLAS that comes with numpy 1.21 is built for 64 threads as recently established here #3318 (comment) but maybe you have some other version imported elsewhere in your combination of programs) |
I installed numpy: |
Sounds like that package was built with NUM_THREADS=16 although the conda-forge "recipe" for building openblas packages sets this number to 128. Not sure about the who/how/where for the packages though. |
Just to clarify, I download Scikit-learn from pip channel. Numpy downloads as a dependency and installing with openblas. |
Please install official stable versions with |
I have no errors when I download numpy from |
So, problem is gone? |
No, the problem remains. I do not download any nightly builds, I just type this: |
Please check configuration of
|
If you do not want to limit the number of OpenBLAS threads, the only solution would seem to be to build OpenBLAS from source with a high enough NUM_THREADS. I cannot match the hash in the library names you mentioned to a specific build (and consequently build options), but I would expect the maximum supported thread count to be at least 64. Maybe what is happening |
|
There are multiple installations you are talking about here.
This is a nightly build with INTERFACE64=1
I'm not sure where you got this from. It doesn't seem to be from PyPI Can you remove numpy and make sure that there's nothing in |
As you recommended I install numpy using this:
|
Build command for OpenBLAS here: i.e DYNAMIC_ARCH and 128 threads |
I'm not going to manually build openblas. As a user, I want to be sure that with the most ordinary download of a numpy, it will start and work on any machine. When downloading from a conda channels, this happens, but the problem is in the pip channel. Have you been able to reproduce the problem yourself? |
Can you run the script with |
|
Is there any control over the parallelism in scikit ? Assuming you had 8 threads running in parallel, each calling an OpenBLAS function that uses 16 threads, there would be no free slots in the buffer list for a 9th thread doing the same. |
Actually, I do not know. Gotta ask the guys from the scikit-learn team |
This parallelism is used. Please set OPENBLAS_NUM_THREADS=1 |
In KMeans we call OpenBLAS gemm inside a parallel (openmp) loop, but we set openblas num threads to one to avoid nesting parallelism. In KNN and DBSCAN we call OpenBLAS in a multi-process setup and as mentioned above we set the number of openblas threads such that the total number of threads does not exceed the number of cpus. Setting OPENBLAS_NUM_THREADS=1 means all openblas calls will be sequential in non-nested regions which is non optimal unfortunately. @OnlyDeniko do you set the n_jobs parameter for these estimators ? |
@jeremiedbb I set |
Could you try setting to a lower value like 16 32 64 and see when it breaks ? |
@jeremiedbb it is |
Kmeans works with |
It is timing race until you reach 256 allocations. Oversubscription damages performance worse than linear. You somehow need to achieve that there is one OpenBLAS thread per CPU for fastest result, say njobs=48 OPENBLAS_NUM_THREADS=2 or something else that multiplies to 96 cores and returns result in shortest time. |
For KMeans, we deal with the number of openblas threads internally so setting OMP_NUM_THREADS=64 or n_jobs=64 should be enough. For KNN, I'd suggest to set n_jobs=1 and maybe OPENBLAS_NUM_THREADS=64, since I don't think multiprocessing brings something for this estimator. We are currently reworking it to have a way better scalability on multicore settings but it's still WIP. |
Since it looks like this is just a matter of increasing the parameter at build time from 64 to 128, can you open an issue in https://github.com/MacPython/openblas-libs/ ? |
this one was official release w openblas binary pulled from conda. 128threads and wild , well documented cpu oversubscription. |
|
Thanks for the details. So we get the confirmation that your code relies on the OpenBLAS shipped in the numpy and scipy wheels and each wheel brings a different version. Usually this is not a problem. But I am still not sure why this crashes in scikit-learn:
In either cases we should neither get oversubscription related performance problems: OpenBLAS should always run in sequential mode in the end. According to: The error you observe could still be resolved by increasing the @OnlyDeniko do you confirm that you do not reproduce the problem if you install everything from conda-forge which sets conda create -n sklearn-cf -c conda-forge scikit-learn
conda activate sklearn-cf
python -m threadpoolctl --import sklearn # just to check
python your_reproducer_script.py Python, numpy, scipy, openblas, joblib and threadpoolctl are all dependencies of scikit-learn so conda will install them all from conda-forge automatically. Edit: from the reference linked above:
So indeed for KMeans, even if OpenBLAS is called with 1 thread at runtime by 96 OpenMP threads so this might be the problem. |
in both cases we try to create 96 memory regions which is more than 64. For KMeans, setting OMP_NUM_THREADS=64 or n_jobs=64 should be ok. For KNN setting n_jobs=64 and OPENBLAS_NUM_THREADS should be ok (alternatively n_jobs=1 and OPENBLAS_NUM_THREADS=64). |
I don't understand why this breaks when we use joblib sub-processes for KNN: each worker process manages its memory independently of the other. There should be no shared buffers. |
in KNN parallelism (assuming brute force) comes from pairwise distances computations which uses joblib with the threading backend. |
Alright that makes sense then. And DBSCAN does the same to precompute the neighborhood graph. |
Yes, I confirm
|
I think we understand the root cause and the solution of the problem now (and workarounds). I think we can close the issue on this repo in favor of MacPython/openblas-libs#64 which I just created. |
@ogrisel thank you very much for looking into this. This buffer remains the major design flaw in OpenBLAS, but I suspect the only thing I can do short-term to mitigate its effect is to add more information to the error message, in particular the number of threads the library was built for. |
That would be great. You could also link to a dedicated markdown document on github that gives more details to users on how to introspect how many CPUs they have on their machine and how OpenBLAS where was installed from (I am pretty sure that most of OpenBLAS users do no know that they use OpenBLAS because they use it via numpy, scipy, pytorch, R or something similar). |
Actually, this is not the only problem: using more than 64 threads seems to degrade the performance of a 4096x4096 DGEMM, see MacPython/openblas-libs#64 (comment). |
It is certainly possible to throw so many threads at a "small" problem that performance degrades again, but I believe it would need an unmanageable (and itself costly) set of rules to tailor the number of threads to each problem size, where OpenBLAS currently switches between 1 and all threads only. Hardware layout (cache locality, multi-die cpu interconnects etc) will also play a role. |
I am not sure how to move forward with this. Increasing Implementing an ad-hoc mitigation in scikit-learn for estimators who call OpenBLAS routines in sequential mode from a large number of externally managed threads is possible but complex, hard to maintain and brittle. See MacPython/openblas-libs#64 (comment) for a minimal reproducer and some details on how to technically implement this. But such an ad-hoc mitigation would not solve the problem for other libraries (apparently it might impact PyTorch users as well). Ideally the problem should be solved in OpenBLAS by making it possible to allocate extra buffers when needed when OpenBLAS is called by a large number of externally managed threads. |
It would be nice if there is a mechanism to report the error and return a sentinel or set some errno without crashing the process. |
you are not the first to come up with that suggestions. Unfortunately when we reach this situation there is nowhere left to go, and there is no universally agreed error code or mechanism to return "BLAS just died on you" anyway |
@ogrisel @mattip can you give |
change \site-packages\joblib\externals\loky\backend\context.py can be do it; os_cpu_count = min(os.cpu_count() or 1,12) cpu_count_user = min(_cpu_count_user(os_cpu_count),12) |
scikit-learn/scikit-learn#20539
Do you have any ideas?
The text was updated successfully, but these errors were encountered: