-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Sklearn Performance on ARM NeoverseN1 #3925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Difficult to address without knowing which version of OpenBLAS you used, what workload (is there a simple testcase that reproduces the slowness?), and what constitutes poor performance" in your eyes - is it some percent slower, order of magnitudes slower ? |
@martin-frbg OpenBLAS version is 0.3.20. I am running a simple linear regression model and using diabetes data from sklearn. So, I see that core scaling doesn't happen meaning as I increase num of cpus the latency actually goes up or flattens instead of decreasing. I am setting n_jobs, OMP_NUM_THREADS. I am running this on a ARM neoversen1 CPU. It is like 30% slower than x86. |
the small parameter change from pull request #3855 would probably help, also the algorithm may be switching to using all available cores instead of just one too early, i.e. for too small matrix sizes. On the other hand it could depend on which generation of x86 cpu you compare the N1 to... |
Obviously MKL does not run on N1. Please post repeater if you find ARM performance libraries faster than OpenBLAS. |
@martin-frbg It doesn't use all cores, there is no scaling as per the cores. |
Do you see the job use all cores at all ? Maybe your libopenblas was not even built multithreaded, or not for as many cpus ? (There is a build-time parameter NUM_THREADS that defaults to the number of cores detected in the build host). OpenBLAS should typically run a task on a single core if the matrix size is small (say around 100 rows/columns), and switch to using all available cores otherwise - there are no intermediate steps where only some of the cpus would be busy. |
@martin-frbg Yes I do see it uses how many ever cores I specify through numactl binding. It is multithreaded NUM_THREADS has been specified at the time of building. I am benchmarking a linear regresssion model with diabetes dataset and comparing against Intel M6i AWS instance. |
Well, M6i uses "Ice Lake" Xeons,AVX512 and all, it would not surprise me if they run rings around the N1 (basically a 2018 smartphone core adapted for multicore servers, only NEON but no advanced vector instructions) in terms of raw performance. |
Scikit-learn does not bundle OpenBLAS directly, what you're seeing is scikit-learn calling either To see how OpenBLAS is exactly built (OpenMP vs. pthreads, max number of threads, CPU architecture built for, etc.), please use https://github.com/joblib/threadpoolctl (its README will explain how). If you want actual feedback on performance you're seeing, please consider posting a standalone example of the code you're running in addition to the |
Bad performance of sklearn on ARM neoverse N1 compared to INtel MKL. Any suggestions or ideas here?
The text was updated successfully, but these errors were encountered: