Skip to content

Sklearn Performance on ARM NeoverseN1 #3925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smamindl opened this issue Feb 27, 2023 · 9 comments
Closed

Sklearn Performance on ARM NeoverseN1 #3925

smamindl opened this issue Feb 27, 2023 · 9 comments
Labels

Comments

@smamindl
Copy link

Bad performance of sklearn on ARM neoverse N1 compared to INtel MKL. Any suggestions or ideas here?

@martin-frbg
Copy link
Collaborator

Difficult to address without knowing which version of OpenBLAS you used, what workload (is there a simple testcase that reproduces the slowness?), and what constitutes poor performance" in your eyes - is it some percent slower, order of magnitudes slower ?

@smamindl
Copy link
Author

smamindl commented Feb 27, 2023

Difficult to address without knowing which version of OpenBLAS you used, what workload (is there a simple testcase that reproduces the slowness?), and what constitutes poor performance" in your eyes - is it some percent slower, order of magnitudes slower ?

@martin-frbg OpenBLAS version is 0.3.20. I am running a simple linear regression model and using diabetes data from sklearn. So, I see that core scaling doesn't happen meaning as I increase num of cpus the latency actually goes up or flattens instead of decreasing. I am setting n_jobs, OMP_NUM_THREADS. I am running this on a ARM neoversen1 CPU. It is like 30% slower than x86.

@martin-frbg
Copy link
Collaborator

the small parameter change from pull request #3855 would probably help, also the algorithm may be switching to using all available cores instead of just one too early, i.e. for too small matrix sizes. On the other hand it could depend on which generation of x86 cpu you compare the N1 to...

@brada4
Copy link
Contributor

brada4 commented Feb 28, 2023

Obviously MKL does not run on N1. Please post repeater if you find ARM performance libraries faster than OpenBLAS.

@smamindl
Copy link
Author

#3855

@martin-frbg It doesn't use all cores, there is no scaling as per the cores.

@martin-frbg
Copy link
Collaborator

Do you see the job use all cores at all ? Maybe your libopenblas was not even built multithreaded, or not for as many cpus ? (There is a build-time parameter NUM_THREADS that defaults to the number of cores detected in the build host). OpenBLAS should typically run a task on a single core if the matrix size is small (say around 100 rows/columns), and switch to using all available cores otherwise - there are no intermediate steps where only some of the cpus would be busy.
You still have not told us much of what you are doing, and of what you are comparing it to.

@smamindl
Copy link
Author

Do you see the job use all cores at all ? Maybe your libopenblas was not even built multithreaded, or not for as many cpus ? (There is a build-time parameter NUM_THREADS that defaults to the number of cores detected in the build host). OpenBLAS should typically run a task on a single core if the matrix size is small (say around 100 rows/columns), and switch to using all available cores otherwise - there are no intermediate steps where only some of the cpus would be busy. You still have not told us much of what you are doing, and of what you are comparing it to.

@martin-frbg Yes I do see it uses how many ever cores I specify through numactl binding. It is multithreaded NUM_THREADS has been specified at the time of building. I am benchmarking a linear regresssion model with diabetes dataset and comparing against Intel M6i AWS instance.

@martin-frbg
Copy link
Collaborator

Well, M6i uses "Ice Lake" Xeons,AVX512 and all, it would not surprise me if they run rings around the N1 (basically a 2018 smartphone core adapted for multicore servers, only NEON but no advanced vector instructions) in terms of raw performance.

@rgommers
Copy link
Contributor

Scikit-learn does not bundle OpenBLAS directly, what you're seeing is scikit-learn calling either numpy.linalg or scipy.linalg routines - NumPy and SciPy vendor OpenBLAS into their wheels.

To see how OpenBLAS is exactly built (OpenMP vs. pthreads, max number of threads, CPU architecture built for, etc.), please use https://github.com/joblib/threadpoolctl (its README will explain how).

If you want actual feedback on performance you're seeing, please consider posting a standalone example of the code you're running in addition to the threadpoolctl output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants