-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Unexpected slowdown on aarch64 using more than two cores/threads #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Strange - do you see all four cores busy at all ? |
Yes, cores are busy according to N. From the beginning I struggled a bit as unset variable OMP_NUM_THREADS means OpenBLAS uses all available cores, while Blis does not do that. So for MPI you need OMP_NUM_THREADS=1. |
Could be thermal throttling too. Observable with powertop or cpupower. |
No, it is not throttling, I installed an extra cooler and fan. The temperatures and frequencies can be read by bcmstat. Also it would not explain normal performance with openblas under arm32 and with blis under aarch64. |
What is weird that for 1st core OpenBLAS is slightly faster but goes down with more. |
It could be that, RPi4B has only 1MB L2 and it is not listed in /proc/cpuinfo. I can also reproduce the issue on a RPi3B+, which is Cortex-A53 and has only 0.5MB L2 and again noting in /proc/cpuinfo. And there is no directory /sys/devices/system/cpu/cpu0/cache. Can I easily set the L2 cache size during compilation? Or should I change the value in getarch.c? RPi4:
RPi3:
|
Change 2MB in that file to 1MB, and see if it does any good ( |
Can you please check what value for L2_SIZE gets written to config.h? The line that brada4 quoted has the default that is used when autodetection is disabled (by supplying a TARGET to make) |
Oh, I see. I changed the values in getarch.c, but the result is the same. Still slowdown from 3 threads and 4 threads only as fast as 1. Actually the config.h says:
I also compiled OpenBLAS-0.37 on RPi3 (Cortex-A53), which has just 256KB L2 in getarch.c and the same amount in config.h (in reality it should have 512KB). And I've launched it on RPi4 (Cortex-A72) and I still see the same slowdown. So probably the problem is not L2 cache related. BTW. both lscpu and lstopo do not show any cache information. |
512k is minimal cache for all cores on that CPU. The way OpenBLAS perceives is per-core cache. |
Playing with config.h will not work as it will get overwritten by the rebuild. |
|
My HPL.dat for my first post was N=8000, NB=200 and P=1, Q=1 for OpenMP tests. Obviously, you need to change Q and P to 2 etc. using MPI. If you want to measure peak performance, you can have N=2000 for 4GB RAM, but the test takes longer and you'll probably get thermal throttling without at least slight air flow. You can monitor the temperature and frequencies by bcmstat, it also uses vcgencmd. If the temperature gets close to 80C, the frequency is limited to 1000MHz instead of 1500MHz. And yes, my tests were also run with graphical desktop. It would have been better without, but I have the same desktop running during both the 32bit and 64bit tests. And I only observe the strange slowdown in 64bit and only using OpenBLAS. I always run 10-20 tests and I take the highest score. I have also checked the older OpenBLAS versions and I see the same slowdown for all versions 0.3.7 - 0.3.0. The version 0.2.20 gives compilation error, probably needs older gcc. HPL.dat:
|
Looks as if this may have gone wrong as early as 0.3.0 (judging by the dgemm.goto benchmark - 0.2.20 had 3080/5880/7980/8850 MFlops for 1/2/3/4 cores at M,N,K=1000), but I have not had time to properly |
Hmm. Forcing a build for TARGET=ARMV8, the xhpl numbers are more in line with BLIS results:
which is a bit strange as generic ARMV8 uses the generic 2x2 DGEMM kernel, while the Cortex targets have an optimized 8x4 one. |
Just transplanting the ARMV8 generic dgemm (and trmm) files and unroll values into KERNEL.CORTEXA72 does not provide any significant improvement. |
Seems the GEMM_P and Q are way off the mark (at least for this problem size and hardware), if I just put in the defaults for "other/undetected ARMV8 cores" the Pi4 reaches
(defaults for "other" are DGEMM_P 160, DGEMM_Q 128 while the entire CortexAxx family currently gets 256/512) |
My mobile has same problem, linear improvement until 4 cores, then no gain from remaining two. Cut from cpuinfo: |
From what I could find, the snapdragon660 has a big.LITTLE configuration of 4x 2.2GHz CortexA73 plus 4x 1.8GHz CortexA53 (or rather the respective derivatives, Kryo260 "Gold" and "Silver") so not quite the same situation. (And I think there is currently no explicit support for such beasts in getarch, so on a host with two different TARGETs you will get a library built for whichever |
Thanks for the tip, android |
I've again tried 0.2.20, it compiles fine also with The same is true also for 0.3.7. CORTEXA53 and CORTEXA72 (this is autodetected) show the slowdown, while ARMV8 does not. I think now we have the answer for the 32bit behavior, as in 32bit the autodetect uses ARMV7 and there is no slowdown. It seems it is a CORTEX problem in OpenBLAS 0.2.20-0.3.7 (CORTEXA57 in BLIS seems ok). I looked at the better performance you got with Ns=5040,NBs=128 and now I think it may be a cache related problem. For lower problem size Ns the slowdown of OpenBLAS in 64bit is smaller. For example for Ns=1000 OpenBLAS on 4 cores is already faster than BLIS. To see the problem I use at least Ns=8000. Also the NBs seems to have a big impact on OpenBLAS in 64bit. Smaller NBs is usually faster on more cores. If I understand it correctly, it somehow cuts the problem into smaller pieces, which fits cache better. For BLIS and OpenBLAS 32 bit NBs can stay almost the same with number of cores. So I'm running the test for NBs=32...256 in steps of 16, makes 15 different NBs. I run the task ten times and take the fastest one. Unfortunately, it takes a long time. It seems the other values in HPL.dat do not have influence the speed much. We now have another way to detect the problem: best NBs does not decrease with number of cores = no problem, best NBs decreases with number of cores = problem.
I also stopped the desktop environment (
Results for Ns=8000 OpenBLAS 0.3.7, 64bit, CORTEXA72 = problem
OpenBLAS 0.3.7, 64bit, ARMV8 = ok
BLIS 0.6.0, 64bit, CORTEXA57 = ok
Also there is now an easier way to run 64bit Gentoo already with kernel and everything in a single image file. Moreover, if you use the Lite image, there is no desktop environment, that's better for benchmarking. |
I suspect the CortexA57 change from #696 four years ago was made with a different class of hardware in mind (@ashwinyes ?). The much more recent rewrite then applied it to all supported Cortex cpus and the Pi is probably just too small to cope with it. (Makes me wonder if it will become necessary to take other hardware properties into consideration when picking such |
From recent discussions in #1976 it is clear that the currrent choice of parameters was driven by the requirements of server-class systems, while the original set was more appropriate for small systems. |
Hello,
I was trying to compile hpl-2.3 on a Raspberry Pi4 with four core Cortex-A72 at 1.5 GHz using recent aarch64 Debian Buster. It doesn't matter if I use OpenMP or MPI, the results are the same. For more than two cores or threads (N) there is an unexpected slowdown in obtained GFlops:
This is for OpenBLAS-0.3.5 available through Debian and also for compiled versions 0.3.6 and 0.3.7.
While the same test using Blis-0.5.1 available through Debian gives expected increase using more cores/threads:
Moreover, I've run the test on the same machine on arm32 architecture using recent Raspbian. Here OpenBLAS-0.3.6 compiled from source behaves also as expected (the armv7 version in Raspbian is much slower, not using vfpv4 nor neon):
The text was updated successfully, but these errors were encountered: