Skip to content

Simplifying ARMv8 build parameters #1876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 25, 2018
Merged

Simplifying ARMv8 build parameters #1876

merged 1 commit into from
Nov 25, 2018

Conversation

rengolin
Copy link
Contributor

ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:

  • Removed TX2 code from ARMv8 build, to make sure it is compatible with
    all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
    harmed performance on big cores.
  • Commoned up ARMv8 architectures' defines in params.h, to make sure
    that all will benefit from ARMv8 settings, in addition to their own.
  • Adding a few more cores, using ARMv8's include strategy, to benefit
    from compiler optimisations using mtune. Also updated cache
    information from the manuals, making sure we set good conservative
    values by default. Removed Vulcan, as it's an alias to TX2.
  • Auto-detecting most of those cores, but also updating the forced
    compilation in getarch.c, to make sure the parameters are the same
    whether compiled natively or forced arch.

Benefits:

  • ARMv8 build is now guaranteed to work on all ARMv8 cores
  • Improved performance for ARMv8 builds on some cores (A72, Falkor,
    ThunderX1 and 2: up to 11%) over current develop
  • Improved performance for all cores comparing to develop branch
    before TX2's patch (9% ~ 36%)
  • ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
    current develop's branch and 8% faster than deveop before tx2 patches

Issues:

  • Regression from current develop branch for A53 (-12%) and A57 (-3%)
    with ARMv8 builds, but still faster than before TX2's commit (+15%
    and +24% respectively). This can be improved with a simplification of
    TX2's code, to be done in future patches. At least the code is
    guaranteed to be ARMv8.0 now.

Comments:

  • CortexA57 builds are unchanged on A57 hardware from develop's branch,
    which makes sense, as it's untouched.
  • CortexA72 builds improve over A57 on A72 hardware, even if they're
    using the same includes due to new compiler tunning in the makefile.

ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:
 * Removed TX2 code from ARMv8 build, to make sure it is compatible with
   all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
   harmed performance on big cores.
 * Commoned up ARMv8 architectures' defines in params.h, to make sure
   that all will benefit from ARMv8 settings, in addition to their own.
 * Adding a few more cores, using ARMv8's include strategy, to benefit
   from compiler optimisations using mtune. Also updated cache
   information from the manuals, making sure we set good conservative
   values by default. Removed Vulcan, as it's an alias to TX2.
 * Auto-detecting most of those cores, but also updating the forced
   compilation in getarch.c, to make sure the parameters are the same
   whether compiled natively or forced arch.

Benefits:
 * ARMv8 build is now guaranteed to work on all ARMv8 cores
 * Improved performance for ARMv8 builds on some cores (A72, Falkor,
   ThunderX1 and 2: up to 11%) over current develop
 * Improved performance for *all* cores comparing to develop branch
   before TX2's patch (9% ~ 36%)
 * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
   current develop's branch and 8% faster than deveop before tx2 patches

Issues:
 * Regression from current develop branch for A53 (-12%) and A57 (-3%)
   with ARMv8 builds, but still faster than before TX2's commit (+15%
   and +24% respectively). This can be improved with a simplification of
   TX2's code, to be done in future patches. At least the code is
   guaranteed to be ARMv8.0 now.

Comments:
 * CortexA57 builds are unchanged on A57 hardware from develop's branch,
   which makes sense, as it's untouched.
 * CortexA72 builds improve over A57 on A72 hardware, even if they're
   using the same includes due to new compiler tunning in the makefile.
@brada4
Copy link
Contributor

brada4 commented Nov 19, 2018

CI just timed out, you can close and re-open the request to trigger build again.

@martin-frbg
Copy link
Collaborator

Pointless to retrigger CI - Appveyor does not test build ARM code anyway, and we know already that the qemu-system-arm builds time out.

@brada4
Copy link
Contributor

brada4 commented Nov 19, 2018

Also.

@martin-frbg
Copy link
Collaborator

rengolin what hardware did you use to test A72 performance with back then ? Using the A57 GEMM_P and GEMM_Q parameters for all Cortex chips seems to have caused a significant drop in multithreading performance at least on small systems like the new Raspi4.

@rengolin
Copy link
Contributor Author

I used a Huawei server with 16 cores and MachiattoBin with 4 cores.

Reading the original bug report, it sounds like that's a cache thrashing related issue, not a core specific problem. I guess the Pi4 has a smaller cache than desktop and server class hardware. Perhaps a more conservative cache size would be beneficial to more common platforms like Pi4.

Note that both the Huawei server and MachiattoBin are deprecated hardware, so it wouldn't be a massive problem if they got slightly slower in order to improve the Pi4.

@martin-frbg
Copy link
Collaborator

Thanks, that is about what I assumed (and I bet Ashwin was using server-class hardware as well when he increased the parameters for A57 - only back then this change affected just that one target.
Guess we really need some means of detecting the actual cache available, and pick parameters accordingly.

@rengolin
Copy link
Contributor Author

That'd be great, actually. We have have been working to get all that info into /proc/cpuinfo, so for newer hardware, it should be as easy as for Intel and others. Alas, just checked on the Pi4 with kernel 4.19 and it's not there. :(

Here's some more info on the Pi4 chip:
https://www.raspberrypi.org/forums/viewtopic.php?p=1499285

But you still need some sensible default so that people don't get surprised by things like performance drop on high core count. Also, distros need to build a package that will "work" and not "surprise" people. For that, the minimum sensible options are best. This is what I tried to do, but apparently, the Pi4 is worse than those chips.

The main difference is:

  • L1 Associativity is 2, not 3
  • L1 size is 16k, not 32k
  • L2 size is 64k, not 512k

Perhaps changing those values would make scaling better in the Pi4, but it would also make it slower across the board (57, 72, 73...). If that works for the bug report, then we need to decide how to get that in without killing performance for all other chips.

I personally don't think the default being slow is a huge problem, as long as it works and is predictable. People that care about performance will always recompile anyway. But for that to work, you really need a way to feed in the cache info, which is not trivial.

Even on Intel, you only have "some" info, like cache sizes. You need to dig the web and obscure documents and forum posts to get any real info. But in the Arm world is worse, because of the multitude of vendors.

RaspberryPi is even worse, because the Broadcomm chip was essentially discontinued (that's why it was dirt cheap), and to keep it cheap, they still need to get the refuse of the lowest standard, where gathering information in a sane way would make the chip cost $10 more. :)

Jokes aside, keeping a "database" of configurations would be a mess in OpenBLAS, and you can't always rely on cpuinfo. Perhaps a simple set of small headers (ie. one header per chip, not architecture) and a way to include them via a build option (separate from march), would be quick and dirty, but would work.

We could also have a "custom.h" where people put on their own options and that gets inlined in the right headers. I know it's not pretty, but the options are not great to begin with.

If that option is not chosen, neither is march, then standard architecture detection kicks in and we get the minimum defaults.

@rengolin
Copy link
Contributor Author

Also, Ashwin was using a ThunderX2 to assume parameters for all others, and that's why too many things broke down. In the same way, I assumed some defaults on the hardware I had, albeit I tested on each actual architecture, the micro-architecture changes a lot from vendor to vendor, which is why this is such a pain to fix.

@martin-frbg
Copy link
Collaborator

Hmm. Actually it seems to me that things worked "better" (from a small, common devices perspective) when almost everything was ARMV8 with mostly minimum defaults and only A57 and the ThunderX chips assumed a "big" system. The easiest solution would be to drop all Cortex GEMM_P,GEMM_Q to the old, small values but I still hope to find some middle ground that does
not penalize servers. If there is some effort to get it into /proc/cpuinfo or /sys/devices/system/cpu, I assume there must be some assembly instructions to query the cache size (which could be added to cpuid_arm64.c) ?
(And i do not think making things fully configurable would help much - a good number of issues appears to be coming from people who have little idea - and litte interest in knowing - what BLAS actually is, they just want to have it set up for some image or speech recognition package that depends on it)

@rengolin
Copy link
Contributor Author

The motivation for my contribution was driven by companies and labs using Arm HPC clusters, not small devices. I understand that there is a big market of devices running some embedded ML stuff, but Arm is getting traction on the data centre and HPC worlds, so perhaps getting some input form them would help.

Unfortunately, I don't work with Arm hardware anymore, so I don't have a way to test it again the same level of detail as I did before. But if they don't participate in the community, then we have to take the decisions that make sense for those that do. I wish I could help more, but the hardware I have access now are only development boards, not servers any more.

On the automatic detection side, unlike Intel, there are no Arm assembly instructions to get the information from the chip, because those parameters are depending on the micro-architecture, not the architecture. The effort to put that in cpuinfo is in the kernel, hard-coded headers, and device-tree files, etc.

Seems Pi4 is not there yet:
https://github.com/torvalds/linux/tree/master/arch/arm64/boot/dts/broadcom

@martin-frbg
Copy link
Collaborator

Quite frustrating for someone more used to other architectures (and sorry I did not mean to criticise you or your work). Perhaps some kludge can be used based on some detectable quantity like number of cores - assuming that Arm HPC clusters are unlikely to sport less than 4 or maybe even 8 cpus ?

@rengolin
Copy link
Contributor Author

Don't worry, I didn't see as critic. :)

It's very frustrating, and it's the bad side of the competition between the different Arm vendors. The good side (innovation, differentiation) outweighs the bad, but it doesn't make it go away.

You are quite right that Arm servers are extremely unlikely to have less than 8 cores, or even 16. I wager none will have less than 32 real cores, with some form of hardware threading. Some of them can already support 256 threads per system.

But I think looking at the number of cores will make the heuristics brittle, because this is likely a cache thrashing issue. Even in a big server, if the threads run on the same core cluster, they'll have the same effect as if they were in 4 cores. And as you increase the number of software threads, performance will drop in the same way for whatever number of cores.

I think assuming Pi4's cache config as the base for A72 (ie. lower the current sizes) makes sense for two reasons: first, it will be slower in servers, but it will scale correctly everywhere, so no one is surprised by scalability, mobile or server; and second, we can't cover all cases anyway, so it's much easier for people to have patches for their own special servers, than trying to cater for every little case.

Automatic detection, using table lookup, device-tree or magic is nice to have, but isn't sustainable, because we don't have a CI that runs on all the hardware out there for every commit.

@brada4
Copy link
Contributor

brada4 commented Sep 21, 2019

Is the per-cpu share of outer cache detectable from Linux userland on Rpi4 or on current/upcoming 2^N-core servers?

@rengolin
Copy link
Contributor Author

Without kernel support, no. cpuinfo is perhaps the only way (I'm not an expert), so if it's not implemented, you will have to guess it. One could craft a program that multiply matrices of varying sizes and number of threads to guess L1/L2/L3 caches from drops in performance, but that's a whole other story.

Pi4's cache config is detailed in the link I shared in this thread, and other chips sometimes do the same. So it's not totally unreasonable to expect people to fill in those numbers into some config/header file.

The recommendation is to have patches to the kernel before one ships a product, but not all vendors do that. Big server vendors are better in that respect than little development board ones.

@brada4
Copy link
Contributor

brada4 commented Sep 21, 2019

Completely agree with you. Even with kernel support present OpenBLAS lacks machinery to dynamically adapt to different cache configurations. Where I see the prospect for improvement that 'just type "make"' style build could pick those cache configs from cpuinfo and in their absence pick the smallest produced values?

@rengolin
Copy link
Contributor Author

Yup, that'd be a good simple way.

@brada4
Copy link
Contributor

brada4 commented Sep 21, 2019

That essentially fixes popular consumer product in its present state. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants