Skip to content

0.3.5 regression #1954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thrasibule opened this issue Jan 8, 2019 · 14 comments · Fixed by #1957
Closed

0.3.5 regression #1954

thrasibule opened this issue Jan 8, 2019 · 14 comments · Fixed by #1957
Milestone

Comments

@thrasibule
Copy link
Contributor

I'm getting segfaults with the newly released 0.3.5 for code that was running fine on 0.3.4. I've bisected it to this commit: bba1e67. I can try to debug further and get a small reproducing example but I wanted to start a bug report in case someone has an idea about what's causing it.

My code consists in a C library that I call using ctypes in python. The C code calls some openblas functions inside some openmp loops. Openblas is compiled with USE_TLS=1, USE_OPENMP=0.
I've added a printf here: https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/memory.c#L1079 to see the error code. What I observe is that it will print a bunch of 22 (I assume one per thread), and then the program crashes a couple seconds later.

I can also reproduce the crash if I run scipy test suite.

@martin-frbg
Copy link
Collaborator

That commit was made in response to #1720 (comment) (and tested against amurzeau's example (gist.github.com link) from that post. Probably the difference in your code is that it (somehow) tells openblas to shutdown all threads but does not dlclose the library, and the next call goes through some code path that expects the TLS key to have been set up already (which used to work previously when that was never deleted)

@thrasibule
Copy link
Contributor Author

That makes sense. So should I try to dlclose the library on my end manually, or the logic inside OpenBlas itself could be improved?

@martin-frbg
Copy link
Collaborator

Probably the logic inside OpenBLAS is wrong, the whole TLS stuff is still a bit fragile and what I did in my attempts to fix it may have made it worse. It would be good to know where the code crashes and what happened before (e.g. a fork() would also trigger a temporary shutdown that would release the TLS key)

@thrasibule
Copy link
Contributor Author

This is the error I get if I run it in gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff4361700 (LWP 2919)]
[New Thread 0x7fffebfff700 (LWP 2920)]
[New Thread 0x7ffff1b5f700 (LWP 2921)]
[New Thread 0x7ffff135e700 (LWP 2922)]
[New Thread 0x7ffff0b5d700 (LWP 2923)]
[New Thread 0x7fffe97fd700 (LWP 2924)]
[New Thread 0x7fffe8ffc700 (LWP 2925)]
[Thread 0x7ffff135e700 (LWP 2922) exited]
[Thread 0x7ffff0b5d700 (LWP 2923) exited]
[Thread 0x7ffff4361700 (LWP 2919) exited]
[Thread 0x7fffe97fd700 (LWP 2924) exited]
[Thread 0x7fffe8ffc700 (LWP 2925) exited]
[Thread 0x7ffff1b5f700 (LWP 2921) exited]
[Thread 0x7fffebfff700 (LWP 2920) exited]
[Detaching after fork from child process 2926]
[Detaching after fork from child process 2928]
[Detaching after fork from child process 2929]
[Detaching after fork from child process 2930]
[Detaching after fork from child process 2931]
[New Thread 0x7fffe8ffc700 (LWP 2932)]
[New Thread 0x7fffe97fd700 (LWP 2933)]
[New Thread 0x7ffff0b5d700 (LWP 2934)]
[Thread 0x7ffff0b5d700 (LWP 2934) exited]
[Thread 0x7fffe97fd700 (LWP 2933) exited]
[Thread 0x7fffe8ffc700 (LWP 2932) exited]
[New Thread 0x7fffe97fd700 (LWP 2935)]
[New Thread 0x7ffff0b5d700 (LWP 2936)]
[New Thread 0x7fffe8ffc700 (LWP 2937)]
[New Thread 0x7ffff135e700 (LWP 2938)]
[New Thread 0x7fffb79bf700 (LWP 2939)]
[New Thread 0x7fffb71be700 (LWP 2940)]
[New Thread 0x7fffb69bd700 (LWP 2941)]
[New Thread 0x7fffb617c700 (LWP 2942)]
[New Thread 0x7fffad97b700 (LWP 2943)]
[New Thread 0x7fffb597b700 (LWP 2944)]
[New Thread 0x7fffb517a700 (LWP 2945)]
[New Thread 0x7fffb4979700 (LWP 2946)]
[New Thread 0x7fffad17a700 (LWP 2947)]
[New Thread 0x7fffac979700 (LWP 2948)]
Fatal Python error: PyThreadState_Delete: NULL interp

Thread 0x00007ffff78c1600 (most recent call first):
  File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 70 in _launch
  File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 20 in __init__
  File "/usr/lib/python3.7/multiprocessing/context.py", line 277 in _Popen
  File "/usr/lib/python3.7/multiprocessing/process.py", line 112 in start
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 241 in _repopulate_pool
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 176 in __init__
  File "/usr/lib/python3.7/multiprocessing/context.py", line 119 in Pool
  File "/home/guillaume/projects/code/python/analytics/index_data.py", line 181 in build_curves_dist
  File "/home/guillaume/projects/code/python/anal
Thread 23 "python" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffb4979700 (LWP 2946)]
0x00007ffff7dffd7f in raise () from /usr/lib/libc.so.6

and the backtrace:

(gdb) bt
#0  0x00007ffff7dffd7f in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7dea672 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff7aba246 in ?? ()
   from /usr/lib/libpython3.7m.so.1.0
#3  0x00007ffff7aba2a5 in Py_FatalError ()
   from /usr/lib/libpython3.7m.so.1.0
#4  0x00007ffff7abc465 in ?? ()
   from /usr/lib/libpython3.7m.so.1.0
#5  0x00007ffff1ef86f2 in ?? ()
   from /usr/lib/python3.7/site-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so
#6  0x00007ffff7f92c51 in __nptl_deallocate_tsd.part.8 ()
   from /usr/lib/libpthread.so.0
#7  0x00007ffff7f93abf in start_thread ()
   from /usr/lib/libpthread.so.0
#8  0x00007ffff7ec3b23 in clone () from /usr/lib/libc.so.6

Not sure how informative it is. Looks like it's a bad interaction with PyThreadState_Delete from python. Appreciate any comments.
For my particular case, I can export OPENBLAS_NUM_THREADS=1 as a workaround and it works fine.

@martin-frbg
Copy link
Collaborator

Found https://bitbucket.org/cffi/cffi/issues/362/crash-on-thread-destruction-in which looks related, but not sure what to make of this...

@thrasibule
Copy link
Contributor Author

thrasibule commented Jan 9, 2019

Would it make sense to move pthread_key_delete inside goto_blas_quit instead? That would be closer to the fix suggested on the cups issue https://bugzilla.redhat.com/show_bug.cgi?id=1065695. I've tried that, and my code doesn't segfault anymore. I've compiled @amurzeau test_case, but it runs fine for me even on 0.3.4, so I can't tell if that would still fix the original issue.

@martin-frbg
Copy link
Collaborator

Hmm. Where exactly did you place it ? gotoblas_quit() calls blas_memory_cleanup() via blas_shutdown() so I would naively assume that the behaviour stays the same, but it seems I overlooked some code path where it would get called prematurely ?

@thrasibule
Copy link
Contributor Author

I've put it right after blas_shutdown, as so:

diff --git a/driver/others/memory.c b/driver/others/memory.c
index 6f7a7db8..5dc5f3d1 100644
--- a/driver/others/memory.c
+++ b/driver/others/memory.c
@@ -1073,11 +1073,6 @@ static volatile int memory_initialized = 0;
     }
     free(table);
   }
-#if defined(OS_WINDOWS)
-  TlsFree(local_storage_key);
-#else
-  pthread_key_delete(local_storage_key);
-#endif		
 }
 
 static void blas_memory_init(){
@@ -1490,7 +1485,10 @@ void DESTRUCTOR gotoblas_quit(void) {
   if (gotoblas_initialized == 0) return;
 
   blas_shutdown();
-
+  int test = pthread_key_delete(local_storage_key);
+  if(test != 0 ) {
+      printf("%d\n", test);
+  }
 #ifdef PROFILE
    moncontrol (0);
 #endif

The behaviour is the same if dlclose is called, but if threads get closed for some other reason, wouldn't blas_memory_cleanup be called as well? I'm not familiar with C thread programming, so I admit I'm shooting in the dark here... All I can say is that I can reliably reproduce the crash without this patch, and this fixes it for me. Does amurzeau test case fails for you on 0.3.4? I can't reproduce the failure, so I can't tell if this still would fix the original issue.

@martin-frbg
Copy link
Collaborator

Thanks. I am not that familiar with thread programming either, and in particular the tls stuff is new to me. You are probably right about the "threads getting closed for other reasons" scenario, though the only one that I can think of is in preparation of a fork(). (And I am fairly sure that I tried to check via printf's that the pthread_key_delete does not get called too early - quite obviously this did not work for whatever reason). At first glance your patch does solve the problem, I just need to countercheck that I still get the crash when I remove the pthread_key_delete completely...

@embray
Copy link
Contributor

embray commented Feb 1, 2019

@martin-frbg Would you welcome some refactoring/cleanup of driver/others/memory.c? I think I may have been hit by this bug as well (still working on confirming), and looking at that file it's quite a mess (which I don't say as a judgment, it just is what it is). I might be able to help with some of the TLS stuff too--though I don't consider myself by any means an expert I do have experience with it (as coauthor of PEP-0539).

@embray
Copy link
Contributor

embray commented Feb 1, 2019

Well, I don't think this was my problem after all since I did not actually compile with USE_TLS=1 so I guess it's something else. But my question still stands :)

@martin-frbg
Copy link
Collaborator

I do certainly welcome any help, in particular with the TLS stuff as it seems that never got a chance to live up to its promises (and the original contributor appears to have dropped out).
The current mess in memory.c is a combination of my attempts to make #1739 work, followed by my resignedly putting back the original, imperfect but somewhat more stable code from before 0.3.1. So memory.c is actually two different versions of the same file rolled into one with a big #ifdef USE_TLS around it.

@embray
Copy link
Contributor

embray commented Feb 1, 2019

So memory.c is actually two different versions of the same file rolled into one with a big #ifdef USE_TLS around it.

Yeah, I was getting really confused jumping around it until I realized that. That was the main thing I want to fix :) I'll see what I can do with it when I have a chance. Right now I'm working on one other bit of refactoring I've been meaning to do for a while...

@martin-frbg
Copy link
Collaborator

Of course this could/should be handled at the (c)makefile level, I had not expected this to be more than a short-term workaround. (It is probably better to open a new issue if/when you want to discuss refactoring)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants