Skip to content

GC hang waiting for threads #42364

@ancapdev

Description

@ancapdev
Contributor

This may have been fixed in #41441, but I don't have a reproducible and figured another hang example with some call stacks may add information. My issue is spurious, but occasionally I get hangs running (multiple independent) Flux fitting tasks over multiple threads. I don't believe it's directly Flux related, just that this use case puts a lot of stress on the GC. I'm running 1.6.2.

From LLDB attached after hang, threads overview:

* thread #1, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
  thread #2, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15a3edb2 libc.so.6`__sigtimedwait + 162
libc.so.6`__sigtimedwait:
->  0x7f4d15a3edb2 <+162>: testq  %rbx, %rbx
    0x7f4d15a3edb5 <+165>: setne  %dl
    0x7f4d15a3edb8 <+168>: cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x7f4d15a3edbe <+174>: ja     0x7f4d15a3ee00            ; <+240>
  thread #3, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #4, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #5, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #6, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #7, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #8, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #9, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #10, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #11, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #12, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #13, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #14, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #15, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
  thread #16, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14ca70e0 libjulia-internal.so.1`jl_gc_collect at gc.c:205
  thread #17, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #18, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #19, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
  thread #20, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #21, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #22, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #23, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #24, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #25, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #26, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #27, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c26 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
  thread #28, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #29, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c36 libjulia-internal.so.1`jl_safepoint_wait_gc [inlined] _mm_pause at xmmintrin.h:1267
  thread #30, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c36 libjulia-internal.so.1`jl_safepoint_wait_gc [inlined] _mm_pause at xmmintrin.h:1267
  thread #31, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #32, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #33, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d14cb0c20 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:162
  thread #34, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #35, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #36, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #37, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #38, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #39, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #40, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #41, name = 'ZMQbg/Reaper', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15b1509e libc.so.6`epoll_wait + 94
libc.so.6`epoll_wait:
->  0x7f4d15b1509e <+94>:  cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x7f4d15b150a4 <+100>: ja     0x7f4d15b150d8            ; <+152>
    0x7f4d15b150a6 <+102>: movl   %r8d, %edi
    0x7f4d15b150a9 <+105>: movl   %eax, 0xc(%rsp)
  thread #42, name = 'ZMQbg/IO/0', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15b1509e libc.so.6`epoll_wait + 94
libc.so.6`epoll_wait:
->  0x7f4d15b1509e <+94>:  cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x7f4d15b150a4 <+100>: ja     0x7f4d15b150d8            ; <+152>
    0x7f4d15b150a6 <+102>: movl   %r8d, %edi
    0x7f4d15b150a9 <+105>: movl   %eax, 0xc(%rsp)
  thread #43, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15b0922f libc.so.6`__poll + 79
libc.so.6`__poll:
->  0x7f4d15b0922f <+79>: cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x7f4d15b09235 <+85>: ja     0x7f4d15b09268            ; <+136>
    0x7f4d15b09237 <+87>: movl   %r8d, %edi
    0x7f4d15b0923a <+90>: movl   %eax, 0x8(%rsp)
  thread #44, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #45, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #46, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #47, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #48, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
  thread #49, name = 'julia', stop reason = signal SIGSTOP
    frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11

Threads that are in safepoint wait or GC I assume are playing nice.

Thread 2 looks like a signal watcher I think:

* thread #2, name = 'julia', stop reason = signal SIGSTOP
  * frame #0: 0x00007f4d15a3edb2 libc.so.6`__sigtimedwait + 162
    frame #1: 0x00007f4d14cb03e3 libjulia-internal.so.1`signal_listener(arg=<unavailable>) at signals-unix.c:615
    frame #2: 0x00007f4d15bf2450 libpthread.so.0`start_thread(arg=0x00007f4d0f838640) at pthread_create.c:473:8
    frame #3: 0x00007f4d15b14d53 libc.so.6`__clone + 67

A bunch of threads are in a futex wait, which appears to be in the task switching code. Example stack:

* thread #3, name = 'julia', stop reason = signal SIGSTOP
  * frame #0: 0x00007f4d15bff31e libpthread.so.0`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:74:11
    frame #1: 0x00007f4d15bff2d0 libpthread.so.0`__GI___futex_abstimed_wait_cancelable64(futex_word=0x00007f4cfe161000, expected=0, clockid=<unavailable>, abstime=0x0000000000000000, private=<unavailable>) at futex-internal.c:123
    frame #2: 0x00007f4d15bf8540 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:504:10
    frame #3: 0x00007f4d15bf8460 libpthread.so.0`__pthread_cond_wait(cond=0x00007f4cfe160fd8, mutex=0x00007f4cfe160fb0) at pthread_cond_wait.c:628
    frame #4: 0x00007f4d14d86618 libjulia-internal.so.1`uv_cond_wait(cond=0x00007f4cfe160fd8, mutex=0x00007f4cfe160fb0) at thread.c:847
    frame #5: 0x00007f4d14c9c52c libjulia-internal.so.1`jl_task_get_next(trypoptask=0x00007f4d05e00690, q=0x00007f45e08b8bb0) at partr.c:508
    frame #6: 0x00007f4d03a08d2e sys.so`julia_poptask_30855 at task.jl:760
    frame #7: 0x00007f4d03b246b5 sys.so`julia_wait_44039 at task.jl:768
    frame #8: 0x00007f4c04091c2a
    frame #9: 0x00007f4c04092893
    frame #10: 0x00007f4c04092d16
    frame #11: 0x00007f4d14c73c9b libjulia-internal.so.1`start_task [inlined] jl_apply(nargs=1, args=0x00007f4ab8ea02e0) at julia.h:1703
    frame #12: 0x00007f4d14c73c87 libjulia-internal.so.1`start_task at task.c:839

Perhaps these are the threads that are failing to reach a safepoint? If that's the case, may well be fixed by #41441, and happy to close as such if others can confirm the likelihood of that.

Activity

vtjnash

vtjnash commented on Sep 23, 2021

@vtjnash
SponsorMember

Are any of the threads unblocked? Seems all are stopped on something, so #41441 would not be likely to fix it.

ancapdev

ancapdev commented on Sep 23, 2021

@ancapdev
ContributorAuthor

The background threads are running their usual loops I think, but the Julia worker threads seem to be all stuck, yes. I still have the process up if there's any particular lldb commands you suggest I run to examine something.

vtjnash

vtjnash commented on Sep 23, 2021

@vtjnash
SponsorMember

Which thread was running (not sleeping or waiting for GC to finish)? I see thread 16 is doing GC, but what thread is it waiting for?

ancapdev

ancapdev commented on Sep 23, 2021

@ancapdev
ContributorAuthor

Also for reference, the top level concurrent construct is a ThreadPools.jl qforeach()

ancapdev

ancapdev commented on Sep 23, 2021

@ancapdev
ContributorAuthor

The main thread spawns that parallel for each, and the main thread is unresponsive waiting for this execution to finish. I'm not sure which specific thread is failing to reach the safepoint, I thought all of the ones spinning on the task scheduler code.

ancapdev

ancapdev commented on Sep 23, 2021

@ancapdev
ContributorAuthor

Actually, those are all after the jl_gc_safe_enter() statement https://github.com/JuliaLang/julia/blob/v1.6.2/src/partr.c#L505, so t
at means GC shouldn't be waiting for them, correct?

ancapdev

ancapdev commented on Sep 23, 2021

@ancapdev
ContributorAuthor

I think I've spotted it, and it's not a Julia runtime issue. Sorry for the wild goose chase.

I'm using Mongoc and have a client pool with connections, capped at a lower number than the number of concurrent threads / tasks. If GC happens while the pool is exhausted and a task is waiting for a connection (internal to libmongoc), it can deadlock since a thread is now stuck in the C code and can't reach a safepoint.

Is there a common pattern or way to insert a GC safe region around a ccall? This kind of resource management issue is probably a more general problem.

vtjnash

vtjnash commented on Sep 23, 2021

@vtjnash
SponsorMember

There's some support for it, but I am not sure if it is enabled in any way

ancapdev

ancapdev commented on Sep 24, 2021

@ancapdev
ContributorAuthor

@threadcall was on my list of solutions. Can't do it just on its own though, I think you need to limit the number of concurrent @threadcall operations since they can block and hog the thread pool threads. If there was some public API for what I presume the Julia runtime does internally to enter and leave GC safe regions that would be ideal.

ancapdev

ancapdev commented on Sep 24, 2021

@ancapdev
ContributorAuthor

I created a simple reproducible for Mongoc, and tried using @threadcall for the resource acquisition. My test program produces concurrency violation, and I'm not sure why. Any pointers appreciated. Details here: felipenoris/Mongoc.jl#84 (comment)

tkf

tkf commented on Sep 24, 2021

@tkf
Member

By the way, if ccall of mongoc_client_pool_pop is the cause of the deadlock, does it mean that it needs other Julia task to set some conditional variables or something in libmongoc? If mongoc_client_pool_pop can finish its thing without the help of other threads, wouldn't the ccall eventually ends and hit a safepoint?

If a ccall itself won't deadlock, for it to cause a deadlock with interaction with GC, I think it needs some thing like

while ccall(...) != 0
end

If there's a pattern like that, you can just insert GC.safepoint

while ccall(...) != 0
    GC.safepoint()
end

I think you need to limit the number of concurrent @threadcall operations since they can block and hog the thread pool threads

The manual says the pool for @threadcall is 4 by default and you can set it via UV_THREADPOOL_SIZE. So, if you rely on @threadcall to it's not really a good solution since other packages could be using @threadcall and you don't know if the user sets UV_THREADPOOL_SIZE to be smaller than you need. I think Go's scheduler has an unbounded thread pool specifically for external "ccall" for this reason.

ancapdev

ancapdev commented on Sep 27, 2021

@ancapdev
ContributorAuthor

By the way, if ccall of mongoc_client_pool_pop is the cause of the deadlock, does it mean that it needs other Julia task to set some conditional variables or something in libmongoc? If mongoc_client_pool_pop can finish its thing without the help of other threads, wouldn't the ccall eventually ends and hit a safepoint?

It requires a resource to be released by another part of the program, so if this part is stuck in GC it will deadlock. You can see my reproducible in felipenoris/Mongoc.jl#84

tkf

tkf commented on Sep 27, 2021

@tkf
Member

It requires a resource to be released by another part of the program, so if this part is stuck in GC it will deadlock.

Ah, I see. Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @tkf@vtjnash@ancapdev

        Issue actions

          GC hang waiting for threads · Issue #42364 · JuliaLang/julia