Skip to content

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ashwinyes
Copy link
Contributor

Currently the generic ARMV8 target is using C implementations for many routines. Replace them with the neon implementations for ThunderX2. These neon versions are up to 6x faster for particular routines.

Ashwin Sekhar T K added 5 commits October 17, 2018 08:01
@martin-frbg
Copy link
Collaborator

Am I misreading the diff, or does this now make CORTEX A57 an "inferior" target compared to generic V8 ?

@ashwinyes
Copy link
Contributor Author

The need for this change is that packages like openhpc builds the OpenBLAS package with the ARMV8 target and then complains of poor performance. With this change ARMV8 target can take advantage of all the optimizations that I had done for THUNDERX2T99.

The idea is to keep all other targets the same and change the ARMV8 target to use THUNDERX2T99
kernels. So in a way, now with this change ARMV8 will be having some optimizations that are not there for CORTEXA57 target.

After this change, all other targets performs the same, while ARMV8 target performance improves.

The next step is to look at implementing DYNAMIC_ARCH for arm64 like x86.

@martin-frbg
Copy link
Collaborator

So there is no fundamental limitation in A57 hardware that would keep it from using the newer code ? Would it make sense to retire A57 (at least in the cpuid_arm64 autodetection code), or are the performance differences between "old" optimized A57 assembly and "new" THUNDERX2T99 code expected to be negligible ?

@martin-frbg martin-frbg merged commit 71c6dee into OpenMathLib:develop Oct 18, 2018
@ashwinyes
Copy link
Contributor Author

ashwinyes commented Oct 18, 2018

Right now, the THUNDERX2T99 code can be used by any generic ARMv8 target like CortexA57. To be more precise, THUNDERX2T99 is ARMv8.1 compliant while cortexa57 is ARMv8.0 compliant (https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores). But the THUNDERX2T99 code in openblas uses noene of the 8.1 features.

And the difference in performance between CORTEXA57 and THUNDERX2 targets on a A57 machine should be negligible.

It doesn't make sense to retire CortexA57. There are subtle differences between different ARM64 cores' microarchitectures which makes it necessary to keep separate targets.

And thanks for merging.

@martin-frbg martin-frbg added this to the 0.3.4 milestone Oct 18, 2018
@martin-frbg
Copy link
Collaborator

Thanks a lot for this information. Just trying to learn more about the platform.

@rengolin
Copy link
Contributor

Coming late, but I did have a look at the code and I'm intrigued by some of the choices.

First, Cortex-A57 is a generic core design that implementations can diverge. So, choosing specific cache sizes for it makes no sense. If those values are required for specific optimisations in OpenBLAS (ex. strip mining), then the most conservative (or common?) values should be used by default, and applications/libraries like lscpu should be used to enquire about the actual values.

Second, VFP/NEON are mandatory for all ARMv8, so those defines, if relevant, should be present in all ARMv8 cases in [1]. VFP2/3/4 are an ARMv7 thing, so irrelevant to ARMv8. NEON was the ARMv7 name, now called "AdvSIMD" in ARMv8 and both are mandatory. So, a simple "#define ARMV8" would mean both FP and SIMD available by definition.

If I may propose an implementation, I think regardless of what the core is, "ARMV8" should always be defined, so that you can easily just check for "ARMv8" for generic features like FP and vector instruction support. We only need the switch for specific cores when setting cache sizes/associativity and DTB details. I'd recommend against checking for core names, or the code will end up as a long string of random names that will not be relevant in a few years (we have enough of them in compilers already).

Third, the compilation flags on ARMV8 vs A57 [2] are redundant. Compilers default to "march=armv8" on AArch64 and both FP and SIMD extensions are mandatory. I'm assuming you don't use CRC nor CRYPTO. The only flag you really need is "mtune".

Lastly, this is more of a question of me trying to add compilation flags comparing GCC and LLVM. For the two compilers to have comparable output (we use this for benchmarks), these are the flags that we pass:

  • GCC: -O3 -ffast-math -funroll-loops (LLVM does this by default)
  • LLVM: -O3 -ffast-math -ffp-contract=on (GCC does this by default)

So, I imagine fast-math would be contentious, but the other two (unroll, fp-contract) are harmless (mathematically speaking) and tend to improve code-generation and vectorisation.

Hope this helps. Thanks!
--renato

[1] https://github.com/xianyi/OpenBLAS/blob/develop/cpuid_arm64.c#L200
[2] https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.arm64

@ashwinyes
Copy link
Contributor Author

Hi Renato,

Thanks for the detailed review.

The primary idea of the PR is to make the the default ARMv8 target use the neon kernels to improve performance. This would, regardless of the fixed blocking sizes chosen, improve the performance (I would be surprised if this is proven wrong) on all machines which were running the base C kernels previously.

The #defines for cache sizes and the defines like VFPv2/v3/v4 has no effect on the performance on ARmv8 targets. In fact I dont think they are used anywhere (or has any effect) in the ARMv8 code. These defines crept into ARMV8 code in the initial days as it was adopted initially from the ARMv7 code.

In fact, there is a lot of cleanup required in the ARMV8 code in addition to the excellent suggestions you have proposed. If we have people interested to work on these changes, I think they can be made on top of this commit and #1829 as these just aims at improving the ARMV8 performance staying within the limitations of the current code structure.

Thanks
Ashwin

@rengolin
Copy link
Contributor

I agree these are on top to the two existing PRs.

I'll try to submit a few clean up PRs in the following days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants