-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Simplifying ARMv8 build parameters #1876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
CI just timed out, you can close and re-open the request to trigger build again. |
Pointless to retrigger CI - Appveyor does not test build ARM code anyway, and we know already that the qemu-system-arm builds time out. |
Also. |
rengolin what hardware did you use to test A72 performance with back then ? Using the A57 GEMM_P and GEMM_Q parameters for all Cortex chips seems to have caused a significant drop in multithreading performance at least on small systems like the new Raspi4. |
I used a Huawei server with 16 cores and MachiattoBin with 4 cores. Reading the original bug report, it sounds like that's a cache thrashing related issue, not a core specific problem. I guess the Pi4 has a smaller cache than desktop and server class hardware. Perhaps a more conservative cache size would be beneficial to more common platforms like Pi4. Note that both the Huawei server and MachiattoBin are deprecated hardware, so it wouldn't be a massive problem if they got slightly slower in order to improve the Pi4. |
Thanks, that is about what I assumed (and I bet Ashwin was using server-class hardware as well when he increased the parameters for A57 - only back then this change affected just that one target. |
That'd be great, actually. We have have been working to get all that info into /proc/cpuinfo, so for newer hardware, it should be as easy as for Intel and others. Alas, just checked on the Pi4 with kernel 4.19 and it's not there. :( Here's some more info on the Pi4 chip: But you still need some sensible default so that people don't get surprised by things like performance drop on high core count. Also, distros need to build a package that will "work" and not "surprise" people. For that, the minimum sensible options are best. This is what I tried to do, but apparently, the Pi4 is worse than those chips. The main difference is:
Perhaps changing those values would make scaling better in the Pi4, but it would also make it slower across the board (57, 72, 73...). If that works for the bug report, then we need to decide how to get that in without killing performance for all other chips. I personally don't think the default being slow is a huge problem, as long as it works and is predictable. People that care about performance will always recompile anyway. But for that to work, you really need a way to feed in the cache info, which is not trivial. Even on Intel, you only have "some" info, like cache sizes. You need to dig the web and obscure documents and forum posts to get any real info. But in the Arm world is worse, because of the multitude of vendors. RaspberryPi is even worse, because the Broadcomm chip was essentially discontinued (that's why it was dirt cheap), and to keep it cheap, they still need to get the refuse of the lowest standard, where gathering information in a sane way would make the chip cost $10 more. :) Jokes aside, keeping a "database" of configurations would be a mess in OpenBLAS, and you can't always rely on cpuinfo. Perhaps a simple set of small headers (ie. one header per chip, not architecture) and a way to include them via a build option (separate from march), would be quick and dirty, but would work. We could also have a "custom.h" where people put on their own options and that gets inlined in the right headers. I know it's not pretty, but the options are not great to begin with. If that option is not chosen, neither is march, then standard architecture detection kicks in and we get the minimum defaults. |
Also, Ashwin was using a ThunderX2 to assume parameters for all others, and that's why too many things broke down. In the same way, I assumed some defaults on the hardware I had, albeit I tested on each actual architecture, the micro-architecture changes a lot from vendor to vendor, which is why this is such a pain to fix. |
Hmm. Actually it seems to me that things worked "better" (from a small, common devices perspective) when almost everything was ARMV8 with mostly minimum defaults and only A57 and the ThunderX chips assumed a "big" system. The easiest solution would be to drop all Cortex GEMM_P,GEMM_Q to the old, small values but I still hope to find some middle ground that does |
The motivation for my contribution was driven by companies and labs using Arm HPC clusters, not small devices. I understand that there is a big market of devices running some embedded ML stuff, but Arm is getting traction on the data centre and HPC worlds, so perhaps getting some input form them would help. Unfortunately, I don't work with Arm hardware anymore, so I don't have a way to test it again the same level of detail as I did before. But if they don't participate in the community, then we have to take the decisions that make sense for those that do. I wish I could help more, but the hardware I have access now are only development boards, not servers any more. On the automatic detection side, unlike Intel, there are no Arm assembly instructions to get the information from the chip, because those parameters are depending on the micro-architecture, not the architecture. The effort to put that in cpuinfo is in the kernel, hard-coded headers, and device-tree files, etc. Seems Pi4 is not there yet: |
Quite frustrating for someone more used to other architectures (and sorry I did not mean to criticise you or your work). Perhaps some kludge can be used based on some detectable quantity like number of cores - assuming that Arm HPC clusters are unlikely to sport less than 4 or maybe even 8 cpus ? |
Don't worry, I didn't see as critic. :) It's very frustrating, and it's the bad side of the competition between the different Arm vendors. The good side (innovation, differentiation) outweighs the bad, but it doesn't make it go away. You are quite right that Arm servers are extremely unlikely to have less than 8 cores, or even 16. I wager none will have less than 32 real cores, with some form of hardware threading. Some of them can already support 256 threads per system. But I think looking at the number of cores will make the heuristics brittle, because this is likely a cache thrashing issue. Even in a big server, if the threads run on the same core cluster, they'll have the same effect as if they were in 4 cores. And as you increase the number of software threads, performance will drop in the same way for whatever number of cores. I think assuming Pi4's cache config as the base for A72 (ie. lower the current sizes) makes sense for two reasons: first, it will be slower in servers, but it will scale correctly everywhere, so no one is surprised by scalability, mobile or server; and second, we can't cover all cases anyway, so it's much easier for people to have patches for their own special servers, than trying to cater for every little case. Automatic detection, using table lookup, device-tree or magic is nice to have, but isn't sustainable, because we don't have a CI that runs on all the hardware out there for every commit. |
Is the per-cpu share of outer cache detectable from Linux userland on Rpi4 or on current/upcoming 2^N-core servers? |
Without kernel support, no. cpuinfo is perhaps the only way (I'm not an expert), so if it's not implemented, you will have to guess it. One could craft a program that multiply matrices of varying sizes and number of threads to guess L1/L2/L3 caches from drops in performance, but that's a whole other story. Pi4's cache config is detailed in the link I shared in this thread, and other chips sometimes do the same. So it's not totally unreasonable to expect people to fill in those numbers into some config/header file. The recommendation is to have patches to the kernel before one ships a product, but not all vendors do that. Big server vendors are better in that respect than little development board ones. |
Completely agree with you. Even with kernel support present OpenBLAS lacks machinery to dynamically adapt to different cache configurations. Where I see the prospect for improvement that 'just type "make"' style build could pick those cache configs from cpuinfo and in their absence pick the smallest produced values? |
Yup, that'd be a good simple way. |
That essentially fixes popular consumer product in its present state. Thanks. |
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.
Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.
Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.
A summary:
all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
harmed performance on big cores.
that all will benefit from ARMv8 settings, in addition to their own.
from compiler optimisations using mtune. Also updated cache
information from the manuals, making sure we set good conservative
values by default. Removed Vulcan, as it's an alias to TX2.
compilation in getarch.c, to make sure the parameters are the same
whether compiled natively or forced arch.
Benefits:
ThunderX1 and 2: up to 11%) over current develop
before TX2's patch (9% ~ 36%)
current develop's branch and 8% faster than deveop before tx2 patches
Issues:
with ARMv8 builds, but still faster than before TX2's commit (+15%
and +24% respectively). This can be improved with a simplification of
TX2's code, to be done in future patches. At least the code is
guaranteed to be ARMv8.0 now.
Comments:
which makes sense, as it's untouched.
using the same includes due to new compiler tunning in the makefile.