Skip to content

Unexpected slowdown on aarch64 using more than two cores/threads #2230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jfikar opened this issue Aug 21, 2019 · 24 comments · Fixed by #2267
Closed

Unexpected slowdown on aarch64 using more than two cores/threads #2230

jfikar opened this issue Aug 21, 2019 · 24 comments · Fixed by #2267

Comments

@jfikar
Copy link

jfikar commented Aug 21, 2019

Hello,

I was trying to compile hpl-2.3 on a Raspberry Pi4 with four core Cortex-A72 at 1.5 GHz using recent aarch64 Debian Buster. It doesn't matter if I use OpenMP or MPI, the results are the same. For more than two cores or threads (N) there is an unexpected slowdown in obtained GFlops:

N    MPI    OpenMP
1    4.8    4.8
2    8.7    8.5
3    6.2    6.4
4    4.2    4.7

This is for OpenBLAS-0.3.5 available through Debian and also for compiled versions 0.3.6 and 0.3.7.

While the same test using Blis-0.5.1 available through Debian gives expected increase using more cores/threads:

N    MPI    OpenMP
1    4.6    4.6
2    8.3    8.6
3    11.1   12.1
4    12.4   14.2

Moreover, I've run the test on the same machine on arm32 architecture using recent Raspbian. Here OpenBLAS-0.3.6 compiled from source behaves also as expected (the armv7 version in Raspbian is much slower, not using vfpv4 nor neon):

N    MPI    OpenMP
1    3.9    3.9
2    7.2    5.9
3    9.8    7.5
4    11.2   10.9
@martin-frbg
Copy link
Collaborator

Strange - do you see all four cores busy at all ?

@jfikar
Copy link
Author

jfikar commented Aug 21, 2019

Yes, cores are busy according to N.

From the beginning I struggled a bit as unset variable OMP_NUM_THREADS means OpenBLAS uses all available cores, while Blis does not do that. So for MPI you need OMP_NUM_THREADS=1.

@brada4
Copy link
Contributor

brada4 commented Aug 23, 2019

Could be thermal throttling too. Observable with powertop or cpupower.

@jfikar
Copy link
Author

jfikar commented Aug 23, 2019

No, it is not throttling, I installed an extra cooler and fan. The temperatures and frequencies can be read by bcmstat.

Also it would not explain normal performance with openblas under arm32 and with blis under aarch64.

@brada4
Copy link
Contributor

brada4 commented Aug 23, 2019

What is weird that for 1st core OpenBLAS is slightly faster but goes down with more.
https://github.com/xianyi/OpenBLAS/blob/5fdf9ad24fe4fcdd0f5fa8b25d783d0c826f4eff/getarch.c#L1007
What is the cache size as from /proc/cpuinfo ? 2MB might be too big and spill to main memory reads in place of turning all over inside cache....

@jfikar
Copy link
Author

jfikar commented Aug 23, 2019

It could be that, RPi4B has only 1MB L2 and it is not listed in /proc/cpuinfo. I can also reproduce the issue on a RPi3B+, which is Cortex-A53 and has only 0.5MB L2 and again noting in /proc/cpuinfo.

And there is no directory /sys/devices/system/cpu/cpu0/cache.

Can I easily set the L2 cache size during compilation? Or should I change the value in getarch.c?

RPi4:

processor       : 0
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3
...

RPi3:

processor       : 0
BogoMIPS        : 38.40
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4
...

@brada4
Copy link
Contributor

brada4 commented Aug 23, 2019

Change 2MB in that file to 1MB, and see if it does any good (make clean needed after, this getarch emits header files included in almost any C file around)
Some tool like lscpu or lstopo may detect better, but dont count on that. I just have qemu that I can set any value for caches without any visible impact whatsoever...

@martin-frbg
Copy link
Collaborator

Can you please check what value for L2_SIZE gets written to config.h? The line that brada4 quoted has the default that is used when autodetection is disabled (by supplying a TARGET to make)

@jfikar
Copy link
Author

jfikar commented Aug 24, 2019

Oh, I see. I changed the values in getarch.c, but the result is the same. Still slowdown from 3 threads and 4 threads only as fast as 1. Actually the config.h says:

#define OS_LINUX        1
#define ARCH_ARM64      1
#define C_GCC   1
#define __64BIT__       1
#define PTHREAD_CREATE_FUNC     pthread_create
#define BUNDERSCORE     _
#define NEEDBUNDERSCORE 1
#define ARMV8
#define HAVE_NEON
#define HAVE_VFPV4
#define CORTEXA72
#define L1_CODE_SIZE 49152
#define L1_CODE_LINESIZE 64
#define L1_CODE_ASSOCIATIVE 3
#define L1_DATA_SIZE 32768
#define L1_DATA_LINESIZE 64
#define L1_DATA_ASSOCIATIVE 2
#define L2_SIZE 524288
#define L2_LINESIZE 64
#define L2_ASSOCIATIVE 16
#define DTB_DEFAULT_ENTRIES 64
#define DTB_SIZE 4096
#define CHAR_CORENAME "CORTEXA72"
#define GEMM_MULTITHREAD_THRESHOLD      4

I also compiled OpenBLAS-0.37 on RPi3 (Cortex-A53), which has just 256KB L2 in getarch.c and the same amount in config.h (in reality it should have 512KB). And I've launched it on RPi4 (Cortex-A72) and I still see the same slowdown. So probably the problem is not L2 cache related.

BTW. both lscpu and lstopo do not show any cache information.

@brada4
Copy link
Contributor

brada4 commented Aug 26, 2019

512k is minimal cache for all cores on that CPU. The way OpenBLAS perceives is per-core cache.
@martin-frbg - is it sane to change config.h and rebuild?

@martin-frbg
Copy link
Collaborator

Playing with config.h will not work as it will get overwritten by the rebuild.
@jfikar what is you HPL.dat ? With the example from the "HPL Benchmarking Raspberry PIs" tutorial on howtoforge (N=5040,NB=128,P=1,Q=1) my numbers are not quite as bad as yours but still surprising - I get 4.5, 8.3, 10.4 , 7.9 for 1 to 4 cores.
(With passive cooling, cpu temperature as reported by vcgencmd measure_temp reaches 70-72 degrees celsius. also this pi is currently set up with a graphical desktop, so the 4th core will have other things to do).

@martin-frbg
Copy link
Collaborator

perf for the case with four threads shows vast majority of time spent in the dgemm kernel, and the hottest instructions there appear to be the prefetches into L1.

@jfikar
Copy link
Author

jfikar commented Aug 30, 2019

My HPL.dat for my first post was N=8000, NB=200 and P=1, Q=1 for OpenMP tests. Obviously, you need to change Q and P to 2 etc. using MPI. If you want to measure peak performance, you can have N=2000 for 4GB RAM, but the test takes longer and you'll probably get thermal throttling without at least slight air flow.

You can monitor the temperature and frequencies by bcmstat, it also uses vcgencmd. If the temperature gets close to 80C, the frequency is limited to 1000MHz instead of 1500MHz.

And yes, my tests were also run with graphical desktop. It would have been better without, but I have the same desktop running during both the 32bit and 64bit tests. And I only observe the strange slowdown in 64bit and only using OpenBLAS. I always run 10-20 tests and I take the highest score.

I have also checked the older OpenBLAS versions and I see the same slowdown for all versions 0.3.7 - 0.3.0. The version 0.2.20 gives compilation error, probably needs older gcc.

HPL.dat:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
8000        Ns
1           # of NBs
200         NBs
0           PMAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
1           Ps
1           Qs
16.0        threshold
1           # of panel fact
2           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
2           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
0           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
1           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
8          swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U  in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

@martin-frbg
Copy link
Collaborator

Looks as if this may have gone wrong as early as 0.3.0 (judging by the dgemm.goto benchmark - 0.2.20 had 3080/5880/7980/8850 MFlops for 1/2/3/4 cores at M,N,K=1000), but I have not had time to properly git bisect it.

@martin-frbg
Copy link
Collaborator

Hmm. Forcing a build for TARGET=ARMV8, the xhpl numbers are more in line with BLIS results:

N GFlops
1 4.8
2 8.4
3 11.2
4 12.3

which is a bit strange as generic ARMV8 uses the generic 2x2 DGEMM kernel, while the Cortex targets have an optimized 8x4 one.

@martin-frbg
Copy link
Collaborator

Just transplanting the ARMV8 generic dgemm (and trmm) files and unroll values into KERNEL.CORTEXA72 does not provide any significant improvement.

@martin-frbg
Copy link
Collaborator

Seems the GEMM_P and Q are way off the mark (at least for this problem size and hardware), if I just put in the defaults for "other/undetected ARMV8 cores" the Pi4 reaches

N GFlops
1 4.55
2 8.34
3 10.99
4 11.33

(defaults for "other" are DGEMM_P 160, DGEMM_Q 128 while the entire CortexAxx family currently gets 256/512)

@martin-frbg
Copy link
Collaborator

This may be an unintended spillover of A57-specific changes from #696 (in particular the "change BUFFER_SIZE for Cortex A57 to 20MB" bit) to the other supported CortexAxx chips through #1876.

@brada4
Copy link
Contributor

brada4 commented Sep 18, 2019

My mobile has same problem, linear improvement until 4 cores, then no gain from remaining two. Cut from cpuinfo:
Hardware : Qualcomm Technologies, Inc SDM660
Autodetected as armv8

@martin-frbg
Copy link
Collaborator

From what I could find, the snapdragon660 has a big.LITTLE configuration of 4x 2.2GHz CortexA73 plus 4x 1.8GHz CortexA53 (or rather the respective derivatives, Kryo260 "Gold" and "Silver") so not quite the same situation. (And I think there is currently no explicit support for such beasts in getarch, so on a host with two different TARGETs you will get a library built for whichever
was active first (normally the slower one if the build started on an idle system, but at least that will
keep the library from failing when it encounters the other cpus at runtime).

@brada4
Copy link
Contributor

brada4 commented Sep 19, 2019

Thanks for the tip, android /proc/cpuinfo does not spot the difference, and it shows certainly six cores. I will check the 'official' API, but that was scaning same 'file' last time I looked.

@jfikar
Copy link
Author

jfikar commented Sep 20, 2019

I've again tried 0.2.20, it compiles fine also with make -j4 TARGET=CORTEXA57. It seems just the autodetect does not work. Anyway, I see exactly the same slowdown as with 0.3.7. However, TARGET=ARMV8 does not have the slowdown!

The same is true also for 0.3.7. CORTEXA53 and CORTEXA72 (this is autodetected) show the slowdown, while ARMV8 does not. I think now we have the answer for the 32bit behavior, as in 32bit the autodetect uses ARMV7 and there is no slowdown. It seems it is a CORTEX problem in OpenBLAS 0.2.20-0.3.7 (CORTEXA57 in BLIS seems ok).

I looked at the better performance you got with Ns=5040,NBs=128 and now I think it may be a cache related problem. For lower problem size Ns the slowdown of OpenBLAS in 64bit is smaller. For example for Ns=1000 OpenBLAS on 4 cores is already faster than BLIS. To see the problem I use at least Ns=8000.

Also the NBs seems to have a big impact on OpenBLAS in 64bit. Smaller NBs is usually faster on more cores. If I understand it correctly, it somehow cuts the problem into smaller pieces, which fits cache better. For BLIS and OpenBLAS 32 bit NBs can stay almost the same with number of cores.

So I'm running the test for NBs=32...256 in steps of 16, makes 15 different NBs. I run the task ten times and take the fastest one. Unfortunately, it takes a long time. It seems the other values in HPL.dat do not have influence the speed much.

We now have another way to detect the problem: best NBs does not decrease with number of cores = no problem, best NBs decreases with number of cores = problem.

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
8000        Ns
15          # of NBs
32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 NBs
0           PMAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
1           Ps
1           Qs
16.0        threshold
1           # of panel fact
1           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
2           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
0           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
0           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
8           swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U  in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

I also stopped the desktop environment (systemctl set-defaultmulti-user.target + restart) and the cron and switched the CPU governor to performance. My bash script:

#!/bin/bash

if [ -f /usr/bin/systemctl ];
then
        sudo systemctl stop cron.service
fi

if [ -f /etc/init.d/dcron ];
then
        sudo /etc/init.d/dcron stop
fi

if [ -f /etc/init.d/cronie ];
then
        sudo /etc/init.d/cronie stop
fi


sync
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor;do
                echo performance | sudo tee ${i} > /dev/null
done

TEMP_FILE=`mktemp`
unset OMP_NUM_THREADS
unset BLIS_NUM_THREADS

for j in {1..4}; do
        echo -n $j "threads: "
        rm $TEMP_FILE
        for i in {1..10}; do
                OMP_NUM_THREADS=$j  BLIS_NUM_THREADS=$j ./xhpl | grep WR >> $TEMP_FILE 
                if [[ $? -ne 0 ]]; then
                        echo "Executable failed, abort test loop!"
                        exit 1
                fi
        echo -n "."
        done
        echo " "
        sort -n -k 6 $TEMP_FILE | head -1
done
rm $TEMP_FILE

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor;do
                echo ondemand | sudo tee ${i} > /dev/null
done

if [ -f /usr/bin/systemctl ];
then
        sudo systemctl start cron.service
fi

if [ -f /etc/init.d/dcron ];
then
        sudo /etc/init.d/dcron start
fi

if [ -f /etc/init.d/cronie ];
then
        sudo /etc/init.d/cronie start
fi

Results for Ns=8000

OpenBLAS 0.3.7, 64bit, CORTEXA72 = problem

cores NBs GFlops
1 256 4.9
2 176 9.0
3 96 11.5
4 64 12.5

OpenBLAS 0.3.7, 64bit, ARMV8 = ok

cores NBs GFlops
1 256 4.7
2 256 8.7
3 256 11.9
4 224 13.1

BLIS 0.6.0, 64bit, CORTEXA57 = ok

cores NBs GFlops
1 240 4.7
2 240 8.7
3 240 12.1
4 208 14.3

Also there is now an easier way to run 64bit Gentoo already with kernel and everything in a single image file. Moreover, if you use the Lite image, there is no desktop environment, that's better for benchmarking.

@martin-frbg
Copy link
Collaborator

I suspect the CortexA57 change from #696 four years ago was made with a different class of hardware in mind (@ashwinyes ?). The much more recent rewrite then applied it to all supported Cortex cpus and the Pi is probably just too small to cope with it. (Makes me wonder if it will become necessary to take other hardware properties into consideration when picking such
compile-time parameters)

@martin-frbg
Copy link
Collaborator

From recent discussions in #1976 it is clear that the currrent choice of parameters was driven by the requirements of server-class systems, while the original set was more appropriate for small systems.
Unfortunately it appears to be next to impossible to query an ARM system for cache sizes, so unless the kernel already knows how to associate SoC ID with a particular configuration to report through /proc/cpuinfo there is little one can do to identify appropriate values except by trial and error. My current thinking is to use the readily available number of cores in a system as proxy for
likely system configuration and use the "old" parameters for everything reporting less than about
10 cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants