-
Notifications
You must be signed in to change notification settings - Fork 1.6k
0.3.5 regression #1954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That commit was made in response to #1720 (comment) (and tested against amurzeau's example (gist.github.com link) from that post. Probably the difference in your code is that it (somehow) tells openblas to shutdown all threads but does not dlclose the library, and the next call goes through some code path that expects the TLS key to have been set up already (which used to work previously when that was never deleted) |
That makes sense. So should I try to dlclose the library on my end manually, or the logic inside OpenBlas itself could be improved? |
Probably the logic inside OpenBLAS is wrong, the whole TLS stuff is still a bit fragile and what I did in my attempts to fix it may have made it worse. It would be good to know where the code crashes and what happened before (e.g. a fork() would also trigger a temporary shutdown that would release the TLS key) |
This is the error I get if I run it in gdb:
and the backtrace:
Not sure how informative it is. Looks like it's a bad interaction with PyThreadState_Delete from python. Appreciate any comments. |
Found https://bitbucket.org/cffi/cffi/issues/362/crash-on-thread-destruction-in which looks related, but not sure what to make of this... |
Would it make sense to move pthread_key_delete inside goto_blas_quit instead? That would be closer to the fix suggested on the cups issue https://bugzilla.redhat.com/show_bug.cgi?id=1065695. I've tried that, and my code doesn't segfault anymore. I've compiled @amurzeau test_case, but it runs fine for me even on 0.3.4, so I can't tell if that would still fix the original issue. |
Hmm. Where exactly did you place it ? gotoblas_quit() calls blas_memory_cleanup() via blas_shutdown() so I would naively assume that the behaviour stays the same, but it seems I overlooked some code path where it would get called prematurely ? |
I've put it right after blas_shutdown, as so: diff --git a/driver/others/memory.c b/driver/others/memory.c
index 6f7a7db8..5dc5f3d1 100644
--- a/driver/others/memory.c
+++ b/driver/others/memory.c
@@ -1073,11 +1073,6 @@ static volatile int memory_initialized = 0;
}
free(table);
}
-#if defined(OS_WINDOWS)
- TlsFree(local_storage_key);
-#else
- pthread_key_delete(local_storage_key);
-#endif
}
static void blas_memory_init(){
@@ -1490,7 +1485,10 @@ void DESTRUCTOR gotoblas_quit(void) {
if (gotoblas_initialized == 0) return;
blas_shutdown();
-
+ int test = pthread_key_delete(local_storage_key);
+ if(test != 0 ) {
+ printf("%d\n", test);
+ }
#ifdef PROFILE
moncontrol (0);
#endif The behaviour is the same if dlclose is called, but if threads get closed for some other reason, wouldn't blas_memory_cleanup be called as well? I'm not familiar with C thread programming, so I admit I'm shooting in the dark here... All I can say is that I can reliably reproduce the crash without this patch, and this fixes it for me. Does amurzeau test case fails for you on 0.3.4? I can't reproduce the failure, so I can't tell if this still would fix the original issue. |
Thanks. I am not that familiar with thread programming either, and in particular the tls stuff is new to me. You are probably right about the "threads getting closed for other reasons" scenario, though the only one that I can think of is in preparation of a fork(). (And I am fairly sure that I tried to check via printf's that the pthread_key_delete does not get called too early - quite obviously this did not work for whatever reason). At first glance your patch does solve the problem, I just need to countercheck that I still get the crash when I remove the pthread_key_delete completely... |
@martin-frbg Would you welcome some refactoring/cleanup of |
Well, I don't think this was my problem after all since I did not actually compile with |
I do certainly welcome any help, in particular with the TLS stuff as it seems that never got a chance to live up to its promises (and the original contributor appears to have dropped out). |
Yeah, I was getting really confused jumping around it until I realized that. That was the main thing I want to fix :) I'll see what I can do with it when I have a chance. Right now I'm working on one other bit of refactoring I've been meaning to do for a while... |
Of course this could/should be handled at the (c)makefile level, I had not expected this to be more than a short-term workaround. (It is probably better to open a new issue if/when you want to discuss refactoring) |
I'm getting segfaults with the newly released 0.3.5 for code that was running fine on 0.3.4. I've bisected it to this commit: bba1e67. I can try to debug further and get a small reproducing example but I wanted to start a bug report in case someone has an idea about what's causing it.
My code consists in a C library that I call using ctypes in python. The C code calls some openblas functions inside some openmp loops. Openblas is compiled with USE_TLS=1, USE_OPENMP=0.
I've added a printf here: https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/memory.c#L1079 to see the error code. What I observe is that it will print a bunch of 22 (I assume one per thread), and then the program crashes a couple seconds later.
I can also reproduce the crash if I run scipy test suite.
The text was updated successfully, but these errors were encountered: