-
Notifications
You must be signed in to change notification settings - Fork 1.6k
performance is very bad when porting openblas #1693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
how to turn on ARM optimizations (NEON) when build TARGET=ARMV7? |
From kernel/arm/KERNEL.ARMV7 (which includes KERNEL.ARMV6) the SDOT kernel is sdot_vfp.S (so no NEON instructions, see #1483 for a related discussion). The performance difference between Tizen and Linux is strange, perhaps Tizen needs different alignment (or the assembler on Tizen misunderstands the .align instructions in the code - we had this problem with OSX until recently) |
@martin-frbg |
As I understand it, it is not necessary to choose it. (In the past, ARMV8 used to be a generic, plain C implementation, and assembly was only used in specific TARGETs like CORTEXA57 - nowadays the two are almost identical.) I am not sure about the extent and quality of NEON support in the arm64 assembly code however. (Most recent work focused on the thunderx2t99 target, but I am not sure if |
BTW was your comparison to Linux running on a comparable hardware, or to x86_64 (entirely different codebase) ? |
Tizen is linux as far as openblas.so is concerned. If you posess multiple evaluation boards, maybe get one running standard debian to get reference for comparisons. Note that some early spectre meltdown series patches did great performance damage... |
my comparison is to x86_64 (entirely different codebase) (this hardware needs armv8 and neon i think, after enable it, the performance will improve a lot~) |
I am actually not sure if building ARMV8 for 32bit (something like |
I am not aware of Tizen platform. So unless some concrete information is given on the platform, cant comment on it. ARM Cortex-A73&ARM Cortex-A53 - Both these are ARMv8 CPUs (64bit). ARMv8 CPUs in general (not all) supports 32bit execution i.e binaries built for ARMv7 will run on ARMv8. So there is nothing called as a "separate ARMV8 32bit build". 32bit binaries on ARMv8 cannot take advantage of a lot of ARMv8 features for obvious reasons. Is there any specific reason that you want to create 32bit binaries? And can you run your tests with a larger problem size? Arrays with 2 elements are not good enough for performance testing. The setup overhead (which involves a lot of things) will clearly dominate the time spent. Run for array size of multiples of 1k and see. And is the same compiler used on linux and Tizen ? |
And one more point. I hope when comparing Tizen to Linux, I hope both are running on the same processor. |
@martin-frbg @ashwinyes |
Could you please give the build command that you are issuing ? Also please give output of "gcc -dumpmachine" for linux and Tizen compilers. |
@ashwinyes build command for armv5: build command for armv7: build command for armv8: |
All the NEON optimizations that are in OpenBLAS till date for ARMV7 should be enabled if you build with TARGET=ARMV7. There is no extra flags needed. Linux - x86_64-linux-gnu - Intel Intel will always be better as it has a better floating point unit assuming it is not a very old Intel CPU. So no point in comparing like this. Moreover, you are using the 32bit mode of CortexA73/A53. So you will not be able to use the advanced SIMD features of ARMv8 NEON which further limits the performance. So I dont think there is much we can do here. Also, you cannot build TARGET=ARMV8 with the tizen 32bit compiler. It requires a separate compiler for Armv8. |
@ashwinyes |
You need aarch64 to access NEON register file. Just like x86_64 for AVX |
Dear Sir~
Now we want to porting openblas to Tizen platform, but the performance is very bad ... :(
For example ,we only test this api:
float a[2] = {1.0,2.0};
float b[2]={1.0,1.0};
clock_t START = clock();
printf ("%f\n",cblas_sdot(dim, a, 1, b, 0));
printf("time %d\n", clock() - START);
The result in Tizen platform is:
3.000000
time 945
But when we test same code in linux,result is:
3.000000
time 52
We have test many times, and little change.
It is not related with the hardware i think, because i test this code,too:
clock_t START = clock();
int i = 0;
while(i<99999999){
i++;
}
printf("time %d\n", clock() - START);
In Tizen platform, the result is:
time 292073
which in linux platform is:
time 157073
We build openBlas in tizen with this cmd:
make TARGET=ARMV7
If you can help me to check this issue?
Thank you very much~
The text was updated successfully, but these errors were encountered: