Skip to content

[RFC] Adding softfp support for ARM by a inline assembly wrapper #853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erlv opened this issue Apr 26, 2016 · 19 comments
Closed

[RFC] Adding softfp support for ARM by a inline assembly wrapper #853

erlv opened this issue Apr 26, 2016 · 19 comments

Comments

@erlv
Copy link

erlv commented Apr 26, 2016

Dear OpenBLAS community,
We recently suffers from the Android default SoftFP link with hardfp OpenBLAS correctness issue while linking several libraries with OpenBLAS as #777 .

Within our team, we have successfully make the compiler and linker happy on Android by linking softfp library and hardfp OpenBLAS together with a inline assembly wrapper and a compiler wrapper. Meanwhile there is no correctness issue both in theory and by many existing tests.

I am wondering whether the upstream like such idea, and if yes, I'd like to find a way to contribute it back to OpenBLAS community.

The inline assembly wrapper is a simulator, which tries to prepare the hardfp OpenBLAS function call's register and stack parameter passing using embedd assembly when compiler compile the code with -mfloat-abi=softfp.

The reason of using inline assembly wrapper instead of direct softfp implementation are:

  1. The OpenBLAS of hardfp is written in assembly, implement a brand new softfp version requires lots of work (as @xianyi has already mentioned).
  2. Comparing to hardfp, softfp uses less register to pass the parameters, therefore even we implement the equivalent version of OpenBLAS in softfp, the performance will be a bit lower than hardfp.
  3. OpenBLAS does not depend on other libraries, therefore in theory there will be no correctness issue if we use softfp to call the interface while the OpenBLAS code is implemented as hardfp.

The compiler wrapper is basically used to get rid of the float-abi attribute in each object file within the libblas.a/libblas.so file, so that the linker does not complain the error uses VFP register arguments, output does not while link Android toolchain compiled object files with OpenBLAS on ARM.

@erlv erlv changed the title [RFC] Adding softfp support for ARM by a wrapper with inline assembly code for hardfp ABI [RFC] Adding softfp support for ARM by a inline assembly wrapper Apr 26, 2016
@erlv
Copy link
Author

erlv commented Apr 26, 2016

To give a brief idea about what we have done, following is an example of our implementation of cblas_saxpy

#ifdef __ARM_PCS_VFP
#define cblas_sgemm cblas_sgemm
#define cblas_sgemv cblas_sgemv
#define cblas_saxpy cblas_saxpy
#define cblas_saxpby cblas_saxpby
#define cblas_sasum cblas_sasum
#define cblas_sscal cblas_sscal
#define cblas_sdot cblas_sdot
#define cblas_scopy cblas_scopy
#else
#define cblas_sgemm "cblas_sgemm"
#define cblas_sgemv "cblas_sgemv"
#define cblas_saxpy "cblas_saxpy"
#define cblas_saxpby "cblas_saxpby"
#define cblas_sasum "cblas_sasum"
#define cblas_sscal "cblas_sscal"
#define cblas_sdot "cblas_sdot"
#endif
void cblas_saxpy_wrapper(const int N, const float a, const float *x, const int incX,
                 float *y, const int incY) {
#ifdef __ARM_PCS_VFP
  // if compiled by -mfloat-abi=hard, directly call cblas_saxpy
  return cblas_saxpy(N, a, x, incX, y, incY); 
#else
#ifdef __SOFTFP
#error "ERROR:please build use softfp or hard ABI\n"
#else
  // if compiled by -mfloat-abi=softfp, run the assembly to prepare for the hardfp ABI call.
  __asm__ __volatile__("sub sp, sp, #24 \n\t"
                       "mov r0, %0 \n\t"
                       "fmsr s0, %1 \n\t"
                       "mov r1, %2 \n\t"
                       "mov r2, %3 \n\t"
                       "mov r3, %4 \n\t"
                       "str %5, [sp] \n\t"
                       "bl " cblas_saxpy "(PLT)\n\t"
                       "add sp, sp, #24 \n\t"
                       :
                       : "r"(N), "r"(a), "r"(x), "r"(incX), "r"(y), "r"(incY)
                       : "cc", "memory", "r0", "s0", "r1", "r2", "r3", "sp");
  return;
#endif
#endif
}

For functions with many parameters, such as cblas_sgemm, it is also doable, but need more assembly code.

@xianyi
Copy link
Collaborator

xianyi commented Apr 26, 2016

@erlv , thank you for the suggestion.

Actually, I prefer to modify the assembly kernel directly. I think it only need edit the beginning part of the source codes.

@buffer51
Copy link
Contributor

Hi @erlv

so that the linker does not complain the error uses VFP register arguments, output does not while link Android toolchain compiled object files with OpenBLAS on ARM.

Have you tried setting the ABI to armeabi-v7a-hard? If you're not using ndk-build, you can find relevant flags here.

This allows to link with OpenBLAS (hardfp), hardfp Android libraries when available, and still some remaining softfp Android libraries, and will run on any ARMv7 device.

If your goal is to build for general Android softfp (armeabi) with OpenBLAS in hardfp, know that this will fail on older devices because OpenBLAS uses vfpv3-d32 which they won't have. It will only work for devices that do have a floating point register, i.e. that support armeabi-v7a-hard.

@erlv
Copy link
Author

erlv commented Apr 27, 2016

Hi @xianyi ,
Thanks for your quick reply.

@erlv
Copy link
Author

erlv commented Apr 27, 2016

Hi @buffer51 ,
Thanks for your reminding, this is a problem we did not notice before.

@brada4
Copy link
Contributor

brada4 commented Apr 28, 2016

Android NDK recommends correct procedure to detect CPU features. Part of the problem is #844 that assemblies are not marked with co-processor they use.
Either way you need to define multiple code paths in your code (can be dlopen()) depending on architecture it is running on. It is not easy, OpenBLAS weights like 5-10MB per target. And there is no NO_AVX2-like flags on ARM.

@xianyi
Copy link
Collaborator

xianyi commented Apr 29, 2016

@brada4 , do you know how to detect CPU correctly on Android?

@brada4
Copy link
Contributor

brada4 commented May 1, 2016

It is very similar to x86 cpuid, they even have library functions to detect it:
https://developer.android.com/ndk/guides/abis.html
And those coprocessors are very alike between all platforms, not like X86 where some CPUs have SSE2 in firmware just to fill minimum requirement for some random OS.
Easy win is in tweaking compiler (march etc) parameters for C kernels to minimum function of ABI
But then again - android+arm is also about power saving, and not waking co-processors might be double of time but quarter of electricity....

@Mikaso
Copy link

Mikaso commented May 20, 2016

Hey erlv! That's a great idea and I really appreciate that you want to contribute back to the community. I did test your example wrapper and it works like a charm. If you want to give the other wrappers you wrote to the openblas users, I would be very grateful!

@andreas-eberle
Copy link

@buffer51: With ndk-12, the armeabi-v7a-hard has been removed from the ndk. Therefore you cannot build with that anymore. See https://android.googlesource.com/platform/ndk/+/master/docs/HardFloatAbi.md

@xianyi
Copy link
Collaborator

xianyi commented Aug 23, 2016

@andreas-eberle , thank you for the information.

@buffer51
Copy link
Contributor

@andreas-eberle Agreed. I think this is a good news, this was very painful in practice. Also, all new phones are now AArch64, which I guess is why they stopped bothering.

If you really need ARMv7a, you can still use NDK r11c. Otherwise building for AArch64 already works, and you can use simple armeabi for older devices (slow though..).

@xianyi
Copy link
Collaborator

xianyi commented Aug 31, 2016

@buffer51 , I didn't try our ARM Cortex-A57 codes on Android AArch64 phone. I think it should work. Do you already try it?

@buffer51
Copy link
Contributor

@xianyi I have not tried the target CORTEXA57 specifically, just ARMV8. It worked without issues. I have a phone with big.LITTLE A53 / A57, I could try CORTEXA57 if you want.

@brada4
Copy link
Contributor

brada4 commented Sep 2, 2016

It is A53 unless #844 cpuid mess is addressed by kernel patch and there is a chance for openblas to support asymmetric multiprocessing. Especially freeze in place of invalid instruction trap is impossible to work around.

@xianyi
Copy link
Collaborator

xianyi commented Sep 2, 2016

@brada4 , I didn't familiar with A53. Are there any differences between A53 and A57?

@pathscale
Copy link

Both are ARMv8 ISA, but A53 is in-order core and A57 is Out-of-Order.

@gwangsc
Copy link

gwangsc commented Mar 1, 2017

@erlv, could you please post a wrapper for cblas_sgemm? Thanks a lot. That will be extremely helpful.

@martin-frbg
Copy link
Collaborator

softfp support was added through #1221 and related changes, so closing here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants