-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Use ThunderX2 Neon Kernels for ARMV8 Target #1821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ThunderX2 Neon Kernels for ARMV8 Target #1821
Conversation
Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.
Am I misreading the diff, or does this now make CORTEX A57 an "inferior" target compared to generic V8 ? |
The need for this change is that packages like openhpc builds the OpenBLAS package with the ARMV8 target and then complains of poor performance. With this change ARMV8 target can take advantage of all the optimizations that I had done for THUNDERX2T99. The idea is to keep all other targets the same and change the ARMV8 target to use THUNDERX2T99 After this change, all other targets performs the same, while ARMV8 target performance improves. The next step is to look at implementing DYNAMIC_ARCH for arm64 like x86. |
So there is no fundamental limitation in A57 hardware that would keep it from using the newer code ? Would it make sense to retire A57 (at least in the cpuid_arm64 autodetection code), or are the performance differences between "old" optimized A57 assembly and "new" THUNDERX2T99 code expected to be negligible ? |
Right now, the THUNDERX2T99 code can be used by any generic ARMv8 target like CortexA57. To be more precise, THUNDERX2T99 is ARMv8.1 compliant while cortexa57 is ARMv8.0 compliant (https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores). But the THUNDERX2T99 code in openblas uses noene of the 8.1 features. And the difference in performance between CORTEXA57 and THUNDERX2 targets on a A57 machine should be negligible. It doesn't make sense to retire CortexA57. There are subtle differences between different ARM64 cores' microarchitectures which makes it necessary to keep separate targets. And thanks for merging. |
Thanks a lot for this information. Just trying to learn more about the platform. |
Coming late, but I did have a look at the code and I'm intrigued by some of the choices. First, Cortex-A57 is a generic core design that implementations can diverge. So, choosing specific cache sizes for it makes no sense. If those values are required for specific optimisations in OpenBLAS (ex. strip mining), then the most conservative (or common?) values should be used by default, and applications/libraries like Second, VFP/NEON are mandatory for all ARMv8, so those defines, if relevant, should be present in all ARMv8 cases in [1]. VFP2/3/4 are an ARMv7 thing, so irrelevant to ARMv8. NEON was the ARMv7 name, now called "AdvSIMD" in ARMv8 and both are mandatory. So, a simple "#define ARMV8" would mean both FP and SIMD available by definition. If I may propose an implementation, I think regardless of what the core is, "ARMV8" should always be defined, so that you can easily just check for "ARMv8" for generic features like FP and vector instruction support. We only need the switch for specific cores when setting cache sizes/associativity and DTB details. I'd recommend against checking for core names, or the code will end up as a long string of random names that will not be relevant in a few years (we have enough of them in compilers already). Third, the compilation flags on ARMV8 vs A57 [2] are redundant. Compilers default to "march=armv8" on AArch64 and both FP and SIMD extensions are mandatory. I'm assuming you don't use CRC nor CRYPTO. The only flag you really need is "mtune". Lastly, this is more of a question of me trying to add compilation flags comparing GCC and LLVM. For the two compilers to have comparable output (we use this for benchmarks), these are the flags that we pass:
So, I imagine fast-math would be contentious, but the other two (unroll, fp-contract) are harmless (mathematically speaking) and tend to improve code-generation and vectorisation. Hope this helps. Thanks! [1] https://github.com/xianyi/OpenBLAS/blob/develop/cpuid_arm64.c#L200 |
Hi Renato, Thanks for the detailed review. The primary idea of the PR is to make the the default ARMv8 target use the neon kernels to improve performance. This would, regardless of the fixed blocking sizes chosen, improve the performance (I would be surprised if this is proven wrong) on all machines which were running the base C kernels previously. The #defines for cache sizes and the defines like VFPv2/v3/v4 has no effect on the performance on ARmv8 targets. In fact I dont think they are used anywhere (or has any effect) in the ARMv8 code. These defines crept into ARMV8 code in the initial days as it was adopted initially from the ARMv7 code. In fact, there is a lot of cleanup required in the ARMV8 code in addition to the excellent suggestions you have proposed. If we have people interested to work on these changes, I think they can be made on top of this commit and #1829 as these just aims at improving the ARMV8 performance staying within the limitations of the current code structure. Thanks |
I agree these are on top to the two existing PRs. I'll try to submit a few clean up PRs in the following days. |
Currently the generic ARMV8 target is using C implementations for many routines. Replace them with the neon implementations for ThunderX2. These neon versions are up to 6x faster for particular routines.