Skip to content

DPOTRF deadlocks on arm cortex A15 #844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emilienkofman opened this issue Apr 20, 2016 · 16 comments
Closed

DPOTRF deadlocks on arm cortex A15 #844

emilienkofman opened this issue Apr 20, 2016 · 16 comments

Comments

@emilienkofman
Copy link

emilienkofman commented Apr 20, 2016

Using the benchmarks I noticed that the potrf kernel would deadlock on my A15 arm board (odroid):

$ OPBLAS_NUM_THREADS=2 ./dpotrf.goto 
From :   1  To : 200 Step =   1 Uplo = U
       1 :       0.01 MFlops :      0.000 Sec : Test=F
       2 :       0.07 MFlops :      0.000 Sec : Test=F
       3 :       0.25 MFlops :      0.000 Sec : Test=F
       4 :       0.59 MFlops :      0.000 Sec : Test=F
       5 :       0.90 MFlops :      0.000 Sec : Test=F
       6 :       1.57 MFlops :      0.000 Sec : Test=F
       7 :       2.59 MFlops :      0.000 Sec : Test=F
       8 :       3.92 MFlops :      0.000 Sec : Test=F

and then it stalls. On contrary running it with OPENBLAS_NUM_THREADS=1 is fine.

I used the latest version of the code forked from here, gcc 5.2.1, the board runs Ubuntu 15.10 and here is some more information:

$ uname -a
Linux odroid 3.10.96+ #1 SMP PREEMPT Wed Mar 30 11:47:52 UTC 2016 armv7l armv7l armv7l GNU/Linux

Please tell me if I can add any other information that would make this reproducible. I don't know which difference in the setup causes the deadlock but on my laptop (Intel, 64bits, Debian), the multithreaded dpotrf runs just fine.

I observed the same with the dcholesky benchmark, but I guess they just use the same routine.

@brada4
Copy link
Contributor

brada4 commented Apr 20, 2016

Please post completion report from openblas build and /proc/cpuinfo

Can you do folowing diagnostics:
OPENBLAS_NUM_THREADS=2 gdb ./dpotrf.goto
gdb> r
gdb> t a a bt
gdb> t a a info reg
gdb> q
y/n> y
Please upload full report to https://gist.github.com/ and dont paste in the textbox.

@xianyi
Copy link
Collaborator

xianyi commented Apr 21, 2016

@emilienkofman , I didn't have odroid. I test dpotrf.goto on NVIDIA TK1 (A15). I cannot reproduce this error.

@emilienkofman
Copy link
Author

@brada4
Copy link
Contributor

brada4 commented Apr 21, 2016

Looks corrupt stack. 2 threads waiting on main thread that is not visible in backtrace.
Can you quickly recheck with CC=clang (if odroid has package for that)

@emilienkofman
Copy link
Author

Ok, for some reason that won't build with clang (https://gist.github.com/emilienkofman/5c707c6af1eaddcf5587a0429de27ae2#file-build-clang). Am I missing something?

@martin-frbg
Copy link
Collaborator

You found a small bug in cpuid_arm.c - or actually clang is more picky than gcc by default.
That line 77 of cpuid_arm.c should read "if( p == NULL ) return(0);"

@emilienkofman
Copy link
Author

Thank you, so I fixed the bug (and I will commit it if you want), but then the build fails again (see the updated gist https://gist.github.com/emilienkofman/5c707c6af1eaddcf5587a0429de27ae2#file-build-clang)

@brada4
Copy link
Contributor

brada4 commented Apr 21, 2016

It is 2 bugs already, this 2nd is clang wontfix https://llvm.org/bugs/show_bug.cgi?id=20424
Keep going...

@martin-frbg
Copy link
Collaborator

That second one seems to live in common_arm.h line 108 (where the PROLOGUE is defined), so hopefully needs only changing that one place...

@brada4
Copy link
Contributor

brada4 commented Apr 21, 2016

ġcc -O0 might give usable traces maybe

@emilienkofman
Copy link
Author

Ok so I've tried to understand how this PROLOGUE thing and assembly all fits together, and if I get rid of it clang then fails to compile some of the kernels in kernel/arm/ because of some labels (for instance in nrm2_vfpv3.S: beq KERNEL_F1_NEXT_\@). Not sure I'll be able to investigate much further as this becomes a bit complex! Is clang+arm supposed to be a valid setup to build openBLAS or only gcc+arm ?

@martin-frbg
Copy link
Collaborator

Guess I should have been more verbose, sorry. You need to delete only the line with the ".func" on it from the PROLOGUE section definition in common_arm.h, not the entire definition. Seems nobody tried to build with clang on arm before, or everybody who ran into these traps kept quiet about it.

@emilienkofman
Copy link
Author

Ah ok that was actually what I was thinking (that nobody tried or nobody reported something about it). I understood that I needed to remove the .func line in case it gets compiled with clang, but then other errors pop up so that's why I told that might be a lot of fixing, and I might not be able to do it! Anyway I'll try to investigate more a bit later and report if I was able to build with clang. Thanks for your help.

@brada4
Copy link
Contributor

brada4 commented Apr 21, 2016

gcc at least builds something TRY it with COMMON_OPT=-O0
Maybe maybe it makes output good enough for gdb to see all threads.

@martin-frbg
Copy link
Collaborator

I do not think any of us wants to trick you into debugging the clang build. :-)
You could also try brada4's other suggestion and retry the gcc build with the
COMMON_OPT in Makefile.rule uncommented and set to "-O0" in the hope that
avoiding compiler optimizations leads to "better" traces

@brada4
Copy link
Contributor

brada4 commented Apr 22, 2016

You have bad CPUID from kernel
Your CPU employs this technology:
https://en.wikipedia.org/wiki/ARM_big.LITTLE
In configuration N°3 and lies that 4 A7 cores support vfp4-d32 when actually they support -d16 which is currently not instrumented in OpenBLAS.
Solutions to fix your system is to switch to configuration 1 or configuration 2 in wikipedia pictures or change system altogether to one that shows correct varied CPUIDs, like linux kernel in expensive Samsung Galaxies does.
To motivate you more on changing to proper kernel worth mentioning your sample did not barf out with invalid instruction traps when it actually should have so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants