Simplifying ARMv8 build parameters #1876

rengolin · 2018-11-19T16:42:58Z

ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:

Removed TX2 code from ARMv8 build, to make sure it is compatible with
all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
harmed performance on big cores.
Commoned up ARMv8 architectures' defines in params.h, to make sure
that all will benefit from ARMv8 settings, in addition to their own.
Adding a few more cores, using ARMv8's include strategy, to benefit
from compiler optimisations using mtune. Also updated cache
information from the manuals, making sure we set good conservative
values by default. Removed Vulcan, as it's an alias to TX2.
Auto-detecting most of those cores, but also updating the forced
compilation in getarch.c, to make sure the parameters are the same
whether compiled natively or forced arch.

Benefits:

ARMv8 build is now guaranteed to work on all ARMv8 cores
Improved performance for ARMv8 builds on some cores (A72, Falkor,
ThunderX1 and 2: up to 11%) over current develop
Improved performance for all cores comparing to develop branch
before TX2's patch (9% ~ 36%)
ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
current develop's branch and 8% faster than deveop before tx2 patches

Issues:

Regression from current develop branch for A53 (-12%) and A57 (-3%)
with ARMv8 builds, but still faster than before TX2's commit (+15%
and +24% respectively). This can be improved with a simplification of
TX2's code, to be done in future patches. At least the code is
guaranteed to be ARMv8.0 now.

Comments:

CortexA57 builds are unchanged on A57 hardware from develop's branch,
which makes sense, as it's untouched.
CortexA72 builds improve over A57 on A72 hardware, even if they're
using the same includes due to new compiler tunning in the makefile.

ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.

brada4 · 2018-11-19T19:57:00Z

CI just timed out, you can close and re-open the request to trigger build again.

martin-frbg · 2018-11-19T19:59:21Z

Pointless to retrigger CI - Appveyor does not test build ARM code anyway, and we know already that the qemu-system-arm builds time out.

brada4 · 2018-11-19T20:01:10Z

Also.

martin-frbg · 2019-09-20T12:02:26Z

rengolin what hardware did you use to test A72 performance with back then ? Using the A57 GEMM_P and GEMM_Q parameters for all Cortex chips seems to have caused a significant drop in multithreading performance at least on small systems like the new Raspi4.

rengolin · 2019-09-20T12:55:42Z

I used a Huawei server with 16 cores and MachiattoBin with 4 cores.

Reading the original bug report, it sounds like that's a cache thrashing related issue, not a core specific problem. I guess the Pi4 has a smaller cache than desktop and server class hardware. Perhaps a more conservative cache size would be beneficial to more common platforms like Pi4.

Note that both the Huawei server and MachiattoBin are deprecated hardware, so it wouldn't be a massive problem if they got slightly slower in order to improve the Pi4.

martin-frbg · 2019-09-20T20:07:06Z

Thanks, that is about what I assumed (and I bet Ashwin was using server-class hardware as well when he increased the parameters for A57 - only back then this change affected just that one target.
Guess we really need some means of detecting the actual cache available, and pick parameters accordingly.

rengolin · 2019-09-20T20:33:21Z

That'd be great, actually. We have have been working to get all that info into /proc/cpuinfo, so for newer hardware, it should be as easy as for Intel and others. Alas, just checked on the Pi4 with kernel 4.19 and it's not there. :(

Here's some more info on the Pi4 chip:
https://www.raspberrypi.org/forums/viewtopic.php?p=1499285

But you still need some sensible default so that people don't get surprised by things like performance drop on high core count. Also, distros need to build a package that will "work" and not "surprise" people. For that, the minimum sensible options are best. This is what I tried to do, but apparently, the Pi4 is worse than those chips.

The main difference is:

L1 Associativity is 2, not 3
L1 size is 16k, not 32k
L2 size is 64k, not 512k

Perhaps changing those values would make scaling better in the Pi4, but it would also make it slower across the board (57, 72, 73...). If that works for the bug report, then we need to decide how to get that in without killing performance for all other chips.

I personally don't think the default being slow is a huge problem, as long as it works and is predictable. People that care about performance will always recompile anyway. But for that to work, you really need a way to feed in the cache info, which is not trivial.

Even on Intel, you only have "some" info, like cache sizes. You need to dig the web and obscure documents and forum posts to get any real info. But in the Arm world is worse, because of the multitude of vendors.

RaspberryPi is even worse, because the Broadcomm chip was essentially discontinued (that's why it was dirt cheap), and to keep it cheap, they still need to get the refuse of the lowest standard, where gathering information in a sane way would make the chip cost $10 more. :)

Jokes aside, keeping a "database" of configurations would be a mess in OpenBLAS, and you can't always rely on cpuinfo. Perhaps a simple set of small headers (ie. one header per chip, not architecture) and a way to include them via a build option (separate from march), would be quick and dirty, but would work.

We could also have a "custom.h" where people put on their own options and that gets inlined in the right headers. I know it's not pretty, but the options are not great to begin with.

If that option is not chosen, neither is march, then standard architecture detection kicks in and we get the minimum defaults.

rengolin · 2019-09-20T20:35:16Z

Also, Ashwin was using a ThunderX2 to assume parameters for all others, and that's why too many things broke down. In the same way, I assumed some defaults on the hardware I had, albeit I tested on each actual architecture, the micro-architecture changes a lot from vendor to vendor, which is why this is such a pain to fix.

martin-frbg · 2019-09-20T21:01:08Z

Hmm. Actually it seems to me that things worked "better" (from a small, common devices perspective) when almost everything was ARMV8 with mostly minimum defaults and only A57 and the ThunderX chips assumed a "big" system. The easiest solution would be to drop all Cortex GEMM_P,GEMM_Q to the old, small values but I still hope to find some middle ground that does
not penalize servers. If there is some effort to get it into /proc/cpuinfo or /sys/devices/system/cpu, I assume there must be some assembly instructions to query the cache size (which could be added to cpuid_arm64.c) ?
(And i do not think making things fully configurable would help much - a good number of issues appears to be coming from people who have little idea - and litte interest in knowing - what BLAS actually is, they just want to have it set up for some image or speech recognition package that depends on it)

rengolin · 2019-09-21T13:19:14Z

The motivation for my contribution was driven by companies and labs using Arm HPC clusters, not small devices. I understand that there is a big market of devices running some embedded ML stuff, but Arm is getting traction on the data centre and HPC worlds, so perhaps getting some input form them would help.

Unfortunately, I don't work with Arm hardware anymore, so I don't have a way to test it again the same level of detail as I did before. But if they don't participate in the community, then we have to take the decisions that make sense for those that do. I wish I could help more, but the hardware I have access now are only development boards, not servers any more.

On the automatic detection side, unlike Intel, there are no Arm assembly instructions to get the information from the chip, because those parameters are depending on the micro-architecture, not the architecture. The effort to put that in cpuinfo is in the kernel, hard-coded headers, and device-tree files, etc.

Seems Pi4 is not there yet:
https://github.com/torvalds/linux/tree/master/arch/arm64/boot/dts/broadcom

martin-frbg · 2019-09-21T15:31:28Z

Quite frustrating for someone more used to other architectures (and sorry I did not mean to criticise you or your work). Perhaps some kludge can be used based on some detectable quantity like number of cores - assuming that Arm HPC clusters are unlikely to sport less than 4 or maybe even 8 cpus ?

rengolin · 2019-09-21T16:08:32Z

Don't worry, I didn't see as critic. :)

It's very frustrating, and it's the bad side of the competition between the different Arm vendors. The good side (innovation, differentiation) outweighs the bad, but it doesn't make it go away.

You are quite right that Arm servers are extremely unlikely to have less than 8 cores, or even 16. I wager none will have less than 32 real cores, with some form of hardware threading. Some of them can already support 256 threads per system.

But I think looking at the number of cores will make the heuristics brittle, because this is likely a cache thrashing issue. Even in a big server, if the threads run on the same core cluster, they'll have the same effect as if they were in 4 cores. And as you increase the number of software threads, performance will drop in the same way for whatever number of cores.

I think assuming Pi4's cache config as the base for A72 (ie. lower the current sizes) makes sense for two reasons: first, it will be slower in servers, but it will scale correctly everywhere, so no one is surprised by scalability, mobile or server; and second, we can't cover all cases anyway, so it's much easier for people to have patches for their own special servers, than trying to cater for every little case.

Automatic detection, using table lookup, device-tree or magic is nice to have, but isn't sustainable, because we don't have a CI that runs on all the hardware out there for every commit.

brada4 · 2019-09-21T17:00:49Z

Is the per-cpu share of outer cache detectable from Linux userland on Rpi4 or on current/upcoming 2^N-core servers?

rengolin · 2019-09-21T17:35:16Z

Without kernel support, no. cpuinfo is perhaps the only way (I'm not an expert), so if it's not implemented, you will have to guess it. One could craft a program that multiply matrices of varying sizes and number of threads to guess L1/L2/L3 caches from drops in performance, but that's a whole other story.

Pi4's cache config is detailed in the link I shared in this thread, and other chips sometimes do the same. So it's not totally unreasonable to expect people to fill in those numbers into some config/header file.

The recommendation is to have patches to the kernel before one ships a product, but not all vendors do that. Big server vendors are better in that respect than little development board ones.

brada4 · 2019-09-21T19:08:52Z

Completely agree with you. Even with kernel support present OpenBLAS lacks machinery to dynamically adapt to different cache configurations. Where I see the prospect for improvement that 'just type "make"' style build could pick those cache configs from cpuinfo and in their absence pick the smallest produced values?

rengolin · 2019-09-21T19:14:41Z

Yup, that'd be a good simple way.

brada4 · 2019-09-21T20:17:55Z

That essentially fixes popular consumer product in its present state. Thanks.

martin-frbg merged commit 9100509 into OpenMathLib:develop Nov 25, 2018

martin-frbg added this to the 0.3.4 milestone Nov 25, 2018

rengolin mentioned this pull request Nov 25, 2018

Wrong asm file included in ARMv8 build #1854

Closed

martin-frbg mentioned this pull request Sep 18, 2019

Unexpected slowdown on aarch64 using more than two cores/threads #2230

Closed

martin-frbg mentioned this pull request Dec 2, 2019

Update DYNAMIC_ARCH support for ARM64 and PPC #2332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying ARMv8 build parameters #1876

Simplifying ARMv8 build parameters #1876

rengolin commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Sep 20, 2019

rengolin commented Sep 20, 2019

martin-frbg commented Sep 20, 2019

rengolin commented Sep 20, 2019

rengolin commented Sep 20, 2019

martin-frbg commented Sep 20, 2019

rengolin commented Sep 21, 2019

martin-frbg commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019

Simplifying ARMv8 build parameters #1876

Simplifying ARMv8 build parameters #1876

Conversation

rengolin commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Sep 20, 2019

rengolin commented Sep 20, 2019

martin-frbg commented Sep 20, 2019

rengolin commented Sep 20, 2019

rengolin commented Sep 20, 2019

martin-frbg commented Sep 20, 2019

rengolin commented Sep 21, 2019

martin-frbg commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019

rengolin commented Sep 21, 2019

brada4 commented Sep 21, 2019