Use ThunderX2 Neon Kernels for ARMV8 Target #1821

ashwinyes · 2018-10-17T17:49:08Z

Currently the generic ARMV8 target is using C implementations for many routines. Replace them with the neon implementations for ThunderX2. These neon versions are up to 6x faster for particular routines.

Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.

martin-frbg · 2018-10-17T20:39:04Z

Am I misreading the diff, or does this now make CORTEX A57 an "inferior" target compared to generic V8 ?

ashwinyes · 2018-10-18T03:19:35Z

The need for this change is that packages like openhpc builds the OpenBLAS package with the ARMV8 target and then complains of poor performance. With this change ARMV8 target can take advantage of all the optimizations that I had done for THUNDERX2T99.

The idea is to keep all other targets the same and change the ARMV8 target to use THUNDERX2T99
kernels. So in a way, now with this change ARMV8 will be having some optimizations that are not there for CORTEXA57 target.

After this change, all other targets performs the same, while ARMV8 target performance improves.

The next step is to look at implementing DYNAMIC_ARCH for arm64 like x86.

martin-frbg · 2018-10-18T06:12:57Z

So there is no fundamental limitation in A57 hardware that would keep it from using the newer code ? Would it make sense to retire A57 (at least in the cpuid_arm64 autodetection code), or are the performance differences between "old" optimized A57 assembly and "new" THUNDERX2T99 code expected to be negligible ?

ashwinyes · 2018-10-18T10:42:58Z

Right now, the THUNDERX2T99 code can be used by any generic ARMv8 target like CortexA57. To be more precise, THUNDERX2T99 is ARMv8.1 compliant while cortexa57 is ARMv8.0 compliant (https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores). But the THUNDERX2T99 code in openblas uses noene of the 8.1 features.

And the difference in performance between CORTEXA57 and THUNDERX2 targets on a A57 machine should be negligible.

It doesn't make sense to retire CortexA57. There are subtle differences between different ARM64 cores' microarchitectures which makes it necessary to keep separate targets.

And thanks for merging.

martin-frbg · 2018-10-18T10:52:49Z

Thanks a lot for this information. Just trying to learn more about the platform.

rengolin · 2018-10-22T14:15:42Z

Coming late, but I did have a look at the code and I'm intrigued by some of the choices.

First, Cortex-A57 is a generic core design that implementations can diverge. So, choosing specific cache sizes for it makes no sense. If those values are required for specific optimisations in OpenBLAS (ex. strip mining), then the most conservative (or common?) values should be used by default, and applications/libraries like lscpu should be used to enquire about the actual values.

Second, VFP/NEON are mandatory for all ARMv8, so those defines, if relevant, should be present in all ARMv8 cases in [1]. VFP2/3/4 are an ARMv7 thing, so irrelevant to ARMv8. NEON was the ARMv7 name, now called "AdvSIMD" in ARMv8 and both are mandatory. So, a simple "#define ARMV8" would mean both FP and SIMD available by definition.

If I may propose an implementation, I think regardless of what the core is, "ARMV8" should always be defined, so that you can easily just check for "ARMv8" for generic features like FP and vector instruction support. We only need the switch for specific cores when setting cache sizes/associativity and DTB details. I'd recommend against checking for core names, or the code will end up as a long string of random names that will not be relevant in a few years (we have enough of them in compilers already).

Third, the compilation flags on ARMV8 vs A57 [2] are redundant. Compilers default to "march=armv8" on AArch64 and both FP and SIMD extensions are mandatory. I'm assuming you don't use CRC nor CRYPTO. The only flag you really need is "mtune".

Lastly, this is more of a question of me trying to add compilation flags comparing GCC and LLVM. For the two compilers to have comparable output (we use this for benchmarks), these are the flags that we pass:

GCC: -O3 -ffast-math -funroll-loops (LLVM does this by default)
LLVM: -O3 -ffast-math -ffp-contract=on (GCC does this by default)

So, I imagine fast-math would be contentious, but the other two (unroll, fp-contract) are harmless (mathematically speaking) and tend to improve code-generation and vectorisation.

Hope this helps. Thanks!
--renato

[1] https://github.com/xianyi/OpenBLAS/blob/develop/cpuid_arm64.c#L200
[2] https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.arm64

ashwinyes · 2018-10-22T14:47:16Z

Hi Renato,

Thanks for the detailed review.

The primary idea of the PR is to make the the default ARMv8 target use the neon kernels to improve performance. This would, regardless of the fixed blocking sizes chosen, improve the performance (I would be surprised if this is proven wrong) on all machines which were running the base C kernels previously.

The #defines for cache sizes and the defines like VFPv2/v3/v4 has no effect on the performance on ARmv8 targets. In fact I dont think they are used anywhere (or has any effect) in the ARMv8 code. These defines crept into ARMV8 code in the initial days as it was adopted initially from the ARMv7 code.

In fact, there is a lot of cleanup required in the ARMV8 code in addition to the excellent suggestions you have proposed. If we have people interested to work on these changes, I think they can be made on top of this commit and #1829 as these just aims at improving the ARMV8 performance staying within the limitations of the current code structure.

Thanks
Ashwin

rengolin · 2018-10-22T14:50:47Z

I agree these are on top to the two existing PRs.

I'll try to submit a few clean up PRs in the following days.

Ashwin Sekhar T K added 5 commits October 17, 2018 08:01

ARM64: Remove dependency of XGENE1 Makefile on ARMV8 Makefile

c3d93ca

ARM64: Remove dependency of CORTEXA57 Makefile on ARMV8 Makefile

162e312

ARM64: Remove dependency of THUNDERX Makefile on ARMV8 Makefile

8001fdc

ARM64: Remove dependency of THUNDERX2T99 Makefile on CORTEXA57 Makefile

caf3394

ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8

21f46a1

Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.

martin-frbg merged commit 71c6dee into OpenMathLib:develop Oct 18, 2018

martin-frbg added this to the 0.3.4 milestone Oct 18, 2018

rengolin mentioned this pull request Oct 22, 2018

Add DYNAMIC_ARCH support for ARM64 #1829

Merged

martin-frbg mentioned this pull request Nov 5, 2018

Wrong asm file included in ARMv8 build #1854

Closed

martin-frbg mentioned this pull request Nov 14, 2018

Incorrect GEMM results with armv8 #1870

Closed

martin-frbg mentioned this pull request Sep 30, 2019

Build OpenBLAS 0.3.6 for iOS #2275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

Uh oh!

ashwinyes commented Oct 17, 2018

Uh oh!

martin-frbg commented Oct 17, 2018

Uh oh!

ashwinyes commented Oct 18, 2018

Uh oh!

martin-frbg commented Oct 18, 2018

Uh oh!

ashwinyes commented Oct 18, 2018 •

edited

Loading

Uh oh!

martin-frbg commented Oct 18, 2018

Uh oh!

rengolin commented Oct 22, 2018

Uh oh!

ashwinyes commented Oct 22, 2018

Uh oh!

rengolin commented Oct 22, 2018

Uh oh!

Uh oh!

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

Uh oh!

Conversation

ashwinyes commented Oct 17, 2018

Uh oh!

martin-frbg commented Oct 17, 2018

Uh oh!

ashwinyes commented Oct 18, 2018

Uh oh!

martin-frbg commented Oct 18, 2018

Uh oh!

ashwinyes commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Oct 18, 2018

Uh oh!

rengolin commented Oct 22, 2018

Uh oh!

ashwinyes commented Oct 22, 2018

Uh oh!

rengolin commented Oct 22, 2018

Uh oh!

Uh oh!

ashwinyes commented Oct 18, 2018 •

edited

Loading