Skip to content

performance is very bad when porting openblas #1693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TianBinyu opened this issue Jul 21, 2018 · 16 comments
Closed

performance is very bad when porting openblas #1693

TianBinyu opened this issue Jul 21, 2018 · 16 comments

Comments

@TianBinyu
Copy link

Dear Sir~
Now we want to porting openblas to Tizen platform, but the performance is very bad ... :(
For example ,we only test this api:
float a[2] = {1.0,2.0};
float b[2]={1.0,1.0};
clock_t START = clock();
printf ("%f\n",cblas_sdot(dim, a, 1, b, 0));
printf("time %d\n", clock() - START);
The result in Tizen platform is:
3.000000
time 945

But when we test same code in linux,result is:
3.000000
time 52
We have test many times, and little change.

It is not related with the hardware i think, because i test this code,too:
clock_t START = clock();
int i = 0;
while(i<99999999){
i++;
}
printf("time %d\n", clock() - START);
In Tizen platform, the result is:
time 292073
which in linux platform is:
time 157073
We build openBlas in tizen with this cmd:
make TARGET=ARMV7

If you can help me to check this issue?
Thank you very much~

@TianBinyu
Copy link
Author

TianBinyu commented Jul 21, 2018

how to turn on ARM optimizations (NEON) when build TARGET=ARMV7?
Because our platform is arm32, i think maybe NEON has high performance~
:)

@martin-frbg
Copy link
Collaborator

From kernel/arm/KERNEL.ARMV7 (which includes KERNEL.ARMV6) the SDOT kernel is sdot_vfp.S (so no NEON instructions, see #1483 for a related discussion). The performance difference between Tizen and Linux is strange, perhaps Tizen needs different alignment (or the assembler on Tizen misunderstands the .align instructions in the code - we had this problem with OSX until recently)

@TianBinyu
Copy link
Author

@martin-frbg
if i use make TARGET=ARMV8
how can i choose neon when build armv8?

@martin-frbg
Copy link
Collaborator

As I understand it, it is not necessary to choose it. (In the past, ARMV8 used to be a generic, plain C implementation, and assembly was only used in specific TARGETs like CORTEXA57 - nowadays the two are almost identical.) I am not sure about the extent and quality of NEON support in the arm64 assembly code however. (Most recent work focused on the thunderx2t99 target, but I am not sure if
modifiying the KERNEL.ARMV8 to pick up the files specified in KERNEL.THUNDERX2T99 would make sense on a less powerful platform)

@martin-frbg
Copy link
Collaborator

BTW was your comparison to Linux running on a comparable hardware, or to x86_64 (entirely different codebase) ?

@brada4
Copy link
Contributor

brada4 commented Jul 22, 2018

Tizen is linux as far as openblas.so is concerned. If you posess multiple evaluation boards, maybe get one running standard debian to get reference for comparisons. Note that some early spectre meltdown series patches did great performance damage...

@TianBinyu
Copy link
Author

TianBinyu commented Jul 24, 2018

my comparison is to x86_64 (entirely different codebase)
so it seems not very valuable...
@martin-frbg @brada4
when i want to build openblas armv8 lib, this error happened..
make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/OpenBLAS/kernel'
make[1]: *** No rule to make target '../kernel/arm/amax.S', needed by 'samax_k.o'. Stop.
make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/OpenBLAS/kernel'
Makefile:145: recipe for target 'libs' failed
make: *** [libs] Error 1
i think that is because our platform is 32 bit, not 64
And this is the hardware info:
ARM Cortex-A73&ARM Cortex-A53
How can i build armv8 32bit lib?
Thank you very much~

(this hardware needs armv8 and neon i think, after enable it, the performance will improve a lot~)

@martin-frbg
Copy link
Collaborator

I am actually not sure if building ARMV8 for 32bit (something like TARGET=ARMV8 BINARY=32) is currently supported. From the error message, it seems the build system is already trying to search files in the 32bit ARMV7 tree so autodetection appears to have worked to some extent.
Maybe this is a regression caused by making KERNEL.ARMV8 use almost all the optimizations from KERNEL.CORTEXA57 instead of the slower but more portable C kernels (#1439). Not sure if "32bit arm64" even makes sense with the current kernel files, or if you would effectively need to build for ARMV7 (and "somebody" should update the ARMV7 assembly to use vfp4 FMA where available).
Unfortunately I still know very little about the ARM platforms @ashwinyes ?

@ashwinyes
Copy link
Contributor

I am not aware of Tizen platform. So unless some concrete information is given on the platform, cant comment on it.

ARM Cortex-A73&ARM Cortex-A53 - Both these are ARMv8 CPUs (64bit).

ARMv8 CPUs in general (not all) supports 32bit execution i.e binaries built for ARMv7 will run on ARMv8. So there is nothing called as a "separate ARMV8 32bit build". 32bit binaries on ARMv8 cannot take advantage of a lot of ARMv8 features for obvious reasons.

Is there any specific reason that you want to create 32bit binaries?

And can you run your tests with a larger problem size? Arrays with 2 elements are not good enough for performance testing. The setup overhead (which involves a lot of things) will clearly dominate the time spent. Run for array size of multiples of 1k and see.

And is the same compiler used on linux and Tizen ?

@ashwinyes
Copy link
Contributor

And one more point.

I hope when comparing Tizen to Linux, I hope both are running on the same processor.

@TianBinyu
Copy link
Author

@martin-frbg @ashwinyes
thank you very much for your help~
for tizen, the compiler is armv7l-tizen-linux-gnueabi-gcc & armv7l-tizen-linux-gnueabi-g++
for linux, the compiler is gcc & g++
Actually i want to porting the kaldi to Tizen platform, and kaldi need to use openBLAS.
I have built openBLAS with armv5 and armv7 and test it in the Tizen(CPU exynos7885)
To deal with 1 second wav file, if i use armv5 openblas binary, it will take 26s
To deal with same wav file, if i use armv7 openblas binary, it will take 16s
I think it increase a lot when we use armv7.
But when we deal with same wav file in ubuntu linux, it only take 1s-2s .
I am very sorry that my comparison is not valuable...
About "separate ARMV8 32bit build" ,that is because our CPUs are Cortex-A73 and Cortex-A53,which support neon, but the Tizen platform is 32 bit... :(
So if i want to enable neon for 32 bit platform, what is the openBLAS build setting?

@ashwinyes
Copy link
Contributor

Could you please give the build command that you are issuing ?

Also please give output of "gcc -dumpmachine" for linux and Tizen compilers.

@TianBinyu
Copy link
Author

@ashwinyes
linux:
gcc -dumpmachine
x86_64-linux-gnu
tizen:
gcc -dumpmachine
armv7l-tizen-linux-gnueabi

build command for armv5:
make TARGET=ARMV5

build command for armv7:
make TARGET=ARMV7

build command for armv8:
make TARGET=ARMV8
but it has build error like this:
make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/OpenBLAS/kernel'
make[1]: *** No rule to make target '../kernel/arm/amax.S', needed by 'samax_k.o'. Stop.

@ashwinyes
Copy link
Contributor

All the NEON optimizations that are in OpenBLAS till date for ARMV7 should be enabled if you build with TARGET=ARMV7. There is no extra flags needed.

Linux - x86_64-linux-gnu - Intel
Tizen - armv7l-tizen-linux-gnueabi -ARM

Intel will always be better as it has a better floating point unit assuming it is not a very old Intel CPU. So no point in comparing like this. Moreover, you are using the 32bit mode of CortexA73/A53. So you will not be able to use the advanced SIMD features of ARMv8 NEON which further limits the performance. So I dont think there is much we can do here.

Also, you cannot build TARGET=ARMV8 with the tizen 32bit compiler. It requires a separate compiler for Armv8.

@TianBinyu
Copy link
Author

@ashwinyes
When i choose openBLAS armv7 binary, if it can use the advanced SIMD features of ARMv8 NEON ?

@brada4
Copy link
Contributor

brada4 commented Jul 24, 2018

You need aarch64 to access NEON register file. Just like x86_64 for AVX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants