- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
This may have been fixed in #41441, but I don't have a reproducible and figured another hang example with some call stacks may add information. My issue is spurious, but occasionally I get hangs running (multiple independent) Flux fitting tasks over multiple threads. I don't believe it's directly Flux related, just that this use case puts a lot of stress on the GC. I'm running 1.6.2.
From LLDB attached after hang, threads overview:
* thread #1, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
thread #2, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15a3edb2 libc.so.6`__sigtimedwait + 162
libc.so.6`__sigtimedwait:
-> 0x7f4d15a3edb2 <+162>: testq %rbx, %rbx
0x7f4d15a3edb5 <+165>: setne %dl
0x7f4d15a3edb8 <+168>: cmpq $-0x1000, %rax ; imm = 0xF000
0x7f4d15a3edbe <+174>: ja 0x7f4d15a3ee00 ; <+240>
thread #3, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #4, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #5, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #6, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #7, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #8, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #9, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #10, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #11, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #12, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #13, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #14, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #15, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
thread #16, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14ca70e0 libjulia-internal.so.1`jl_gc_collect at gc.c:205
thread #17, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #18, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #19, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
thread #20, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #21, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #22, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #23, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #24, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #25, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #26, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #27, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c26 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
thread #28, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #29, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c36 libjulia-internal.so.1`jl_safepoint_wait_gc [inlined] _mm_pause at xmmintrin.h:1267
thread #30, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c36 libjulia-internal.so.1`jl_safepoint_wait_gc [inlined] _mm_pause at xmmintrin.h:1267
thread #31, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #32, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #33, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
thread #34, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #35, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #36, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #37, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #38, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #39, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #40, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #41, name = 'ZMQbg/Reaper', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15b1509e libc.so.6`epoll_wait + 94
libc.so.6`epoll_wait:
-> 0x7f4d15b1509e <+94>: cmpq $-0x1000, %rax ; imm = 0xF000
0x7f4d15b150a4 <+100>: ja 0x7f4d15b150d8 ; <+152>
0x7f4d15b150a6 <+102>: movl %r8d, %edi
0x7f4d15b150a9 <+105>: movl %eax, 0xc(%rsp)
thread #42, name = 'ZMQbg/IO/0', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15b1509e libc.so.6`epoll_wait + 94
libc.so.6`epoll_wait:
-> 0x7f4d15b1509e <+94>: cmpq $-0x1000, %rax ; imm = 0xF000
0x7f4d15b150a4 <+100>: ja 0x7f4d15b150d8 ; <+152>
0x7f4d15b150a6 <+102>: movl %r8d, %edi
0x7f4d15b150a9 <+105>: movl %eax, 0xc(%rsp)
thread #43, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15b0922f libc.so.6`__poll + 79
libc.so.6`__poll:
-> 0x7f4d15b0922f <+79>: cmpq $-0x1000, %rax ; imm = 0xF000
0x7f4d15b09235 <+85>: ja 0x7f4d15b09268 ; <+136>
0x7f4d15b09237 <+87>: movl %r8d, %edi
0x7f4d15b0923a <+90>: movl %eax, 0x8(%rsp)
thread #44, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #45, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #46, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #47, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #48, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
thread #49, name = 'julia', stop reason = signal SIGSTOP
frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
Threads that are in safepoint wait or GC I assume are playing nice.
Thread 2 looks like a signal watcher I think:
* thread #2, name = 'julia', stop reason = signal SIGSTOP
* frame #0: 0x00007f4d15a3edb2 libc.so.6`__sigtimedwait + 162
frame #1: 0x00007f4d14cb03e3 libjulia-internal.so.1`signal_listener(arg=<unavailable>) at signals-unix.c:615
frame #2: 0x00007f4d15bf2450 libpthread.so.0`start_thread(arg=0x00007f4d0f838640) at pthread_create.c:473:8
frame #3: 0x00007f4d15b14d53 libc.so.6`__clone + 67
A bunch of threads are in a futex wait, which appears to be in the task switching code. Example stack:
* thread #3, name = 'julia', stop reason = signal SIGSTOP
* frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
frame #1: 0x00007f4d15bff2d0 libpthread.so.0`__GI___futex_abstimed_wait_cancelable64(futex_word=0x00007f4cfe161000, expected=0, clockid=<unavailable>, abstime=0x0000000000000000, private=<unavailable>) at futex-internal.c:123
frame #2: 0x00007f4d15bf8540 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:504:10
frame #3: 0x00007f4d15bf8460 libpthread.so.0`__pthread_cond_wait(cond=0x00007f4cfe160fd8, mutex=0x00007f4cfe160fb0) at pthread_cond_wait.c:628
frame #4: 0x00007f4d14d86618 libjulia-internal.so.1`uv_cond_wait(cond=0x00007f4cfe160fd8, mutex=0x00007f4cfe160fb0) at thread.c:847
frame #5: 0x00007f4d14c9c52c libjulia-internal.so.1`jl_task_get_next(trypoptask=0x00007f4d05e00690, q=0x00007f45e08b8bb0) at partr.c:508
frame #6: 0x00007f4d03a08d2e sys.so`julia_poptask_30855 at task.jl:760
frame #7: 0x00007f4d03b246b5 sys.so`julia_wait_44039 at task.jl:768
frame #8: 0x00007f4c04091c2a
frame #9: 0x00007f4c04092893
frame #10: 0x00007f4c04092d16
frame #11: 0x00007f4d14c73c9b libjulia-internal.so.1`start_task [inlined] jl_apply(nargs=1, args=0x00007f4ab8ea02e0) at julia.h:1703
frame #12: 0x00007f4d14c73c87 libjulia-internal.so.1`start_task at task.c:839
Perhaps these are the threads that are failing to reach a safepoint? If that's the case, may well be fixed by #41441, and happy to close as such if others can confirm the likelihood of that.
Activity
vtjnash commentedon Sep 23, 2021
Are any of the threads unblocked? Seems all are stopped on something, so #41441 would not be likely to fix it.
ancapdev commentedon Sep 23, 2021
The background threads are running their usual loops I think, but the Julia worker threads seem to be all stuck, yes. I still have the process up if there's any particular lldb commands you suggest I run to examine something.
vtjnash commentedon Sep 23, 2021
Which thread was running (not sleeping or waiting for GC to finish)? I see thread 16 is doing GC, but what thread is it waiting for?
ancapdev commentedon Sep 23, 2021
Also for reference, the top level concurrent construct is a ThreadPools.jl
qforeach()
ancapdev commentedon Sep 23, 2021
The main thread spawns that parallel for each, and the main thread is unresponsive waiting for this execution to finish. I'm not sure which specific thread is failing to reach the safepoint, I thought all of the ones spinning on the task scheduler code.
ancapdev commentedon Sep 23, 2021
Actually, those are all after the
jl_gc_safe_enter()
statement https://github.com/JuliaLang/julia/blob/v1.6.2/src/partr.c#L505, so tat means GC shouldn't be waiting for them, correct?
ancapdev commentedon Sep 23, 2021
I think I've spotted it, and it's not a Julia runtime issue. Sorry for the wild goose chase.
I'm using Mongoc and have a client pool with connections, capped at a lower number than the number of concurrent threads / tasks. If GC happens while the pool is exhausted and a task is waiting for a connection (internal to libmongoc), it can deadlock since a thread is now stuck in the C code and can't reach a safepoint.
Is there a common pattern or way to insert a GC safe region around a ccall? This kind of resource management issue is probably a more general problem.
vtjnash commentedon Sep 23, 2021
There's some support for it, but I am not sure if it is enabled in any way
tkf commentedon Sep 23, 2021
I wonder if you can use
@threadcall
https://docs.julialang.org/en/v1/manual/multi-threading/#@threadcall
https://docs.julialang.org/en/v1/base/multi-threading/#Base.@threadcall
?
ancapdev commentedon Sep 24, 2021
@threadcall
was on my list of solutions. Can't do it just on its own though, I think you need to limit the number of concurrent@threadcall
operations since they can block and hog the thread pool threads. If there was some public API for what I presume the Julia runtime does internally to enter and leave GC safe regions that would be ideal.ancapdev commentedon Sep 24, 2021
I created a simple reproducible for
Mongoc
, and tried using@threadcall
for the resource acquisition. My test program producesconcurrency violation
, and I'm not sure why. Any pointers appreciated. Details here: felipenoris/Mongoc.jl#84 (comment)tkf commentedon Sep 24, 2021
By the way, if
ccall
ofmongoc_client_pool_pop
is the cause of the deadlock, does it mean that it needs other Julia task to set some conditional variables or something in libmongoc? Ifmongoc_client_pool_pop
can finish its thing without the help of other threads, wouldn't theccall
eventually ends and hit a safepoint?If a
ccall
itself won't deadlock, for it to cause a deadlock with interaction with GC, I think it needs some thing likeIf there's a pattern like that, you can just insert
GC.safepoint
The manual says the pool for
@threadcall
is 4 by default and you can set it viaUV_THREADPOOL_SIZE
. So, if you rely on@threadcall
to it's not really a good solution since other packages could be using@threadcall
and you don't know if the user setsUV_THREADPOOL_SIZE
to be smaller than you need. I think Go's scheduler has an unbounded thread pool specifically for external "ccall
" for this reason.ancapdev commentedon Sep 27, 2021
It requires a resource to be released by another part of the program, so if this part is stuck in GC it will deadlock. You can see my reproducible in felipenoris/Mongoc.jl#84
tkf commentedon Sep 27, 2021
Ah, I see. Makes sense.