Skip to content

gc state transition support in codegen #33097

@NHDaly

Description

@NHDaly
Member

I'm opening this issue to discuss the results from our "one-thread allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#4

That experiment showcased a potential performance hazard: Having one thread running a long-running, tight loop with no allocations can essentially deadlock the rest of the program, as GC prevents any threads from scheduling tasks until every task has reached a GC-safepoint, and GC has completed.

This problem will show up if you have some long-running tasks that never allocate, and then GC is triggered on another thread. In that case, the entire program will wait until all tasks have completed. This is mostly a problem if you have an "unbalanced" workload, where some tasks are alloc-free, but other tasks allocate enough memory to trigger GC. (It could even be triggered by allocating the tasks themselves in an almost allocation-free program.)

Since, currently, GC requires all threads to be in a gc-safepoint before it will proceed, and since tasks cannot be preempted, once GC is triggered it will pause any thread that enters a GC-safepoint until all threads have entered a GC-safepoint.

Switching tasks is a gc-safepoint, so in the above benchmark workload, once GC is triggered, no new queries are scheduled to execute until all currently executing queries have completed.


Note that this problem is a consequence of having non-preemptable tasks and stop-the-world GC. Golang also suffers from this problem, as discussed here: golang/go#10958 (long thread)

In the situation I outlined above, it would simply run slowly, as the rest of the program pauses on the cpu-only thread. But one could easily imagine a deadlock if the cpu-bound task was waiting on e.g. an Atomic variable to be updated by one of the other paused tasks.


We discussed this result in-person today with @JeffBezanson and @raileywild, and I wanted to record some of our thoughts:

  1. A user could avoid this situation by manually adding yield() points in their tight-loop, but of course that would make their tight code slower.
  2. Instead, the user could manually add gc-safepoints, which would allow GC to proceed if it was in-progress, but be a noop otherwise. This would be significantly faster (though still a bit slow). We can do this via ccall(:jl_gc_safepoint).
    • TODO: Can we add a Base function to trigger a gc safepoint to make it seem safer / more normal? (add GC.safepoint() for compute-bound threads #33092)
  3. We could consider adding a mechanism to allow users to mark a region of code as "GC safe", so that GC can proceed while the code is executing... but this seems dangerous and is unlikely to happen.
    • Relatedly, consider allowing the compiler to figure this out, but that also seems hard.
  4. Currently, we're pretty sure the gc_time reported by @time and similar tools doesn't include the time spent waiting for all threads to reach a safepoint. It probably should.
    • TODO: Probably we should add that time to gc_time, or we should add another separate metric for, like, "gc synchronization time".
  5. Currently both "mark" and "sweep" happen during the stopped world. We could consider allowing the "sweep" phase to run in parallel with user code (ie resume the world), but the sweep phase is much shorter than the mark phase, so it doesn't buy much.
  6. We could also speed-up the time to do GC by multithreading the mark-phase to split the work up across all the threads. Right now, only one thread does the GC work while the others are all paused, waiting.
    • This would certainly improve performance on our multithreaded benchmarks, but it doesn't help with the main contention/synchronization problem, forcing all threads to synchronize every-so-often before proceeding.

Activity

JeffBezanson

JeffBezanson commented on Aug 28, 2019

@JeffBezanson
SponsorMember

See #33092 for manual safepoints.

vtjnash

vtjnash commented on Aug 29, 2019

@vtjnash
SponsorMember

Relatedly, consider allowing the compiler to [automatically insert gc-safe transitions] out

I think this is the way to go and think it can be done well. But since it’s not going to be immediately ready, #33092 seems like the way to go right now (export what we already have, and backport it to 1.3 branch)

vtjnash

vtjnash commented on Aug 29, 2019

@vtjnash
SponsorMember

We could also speed-up the time to do GC by multithreading the mark-phase

IIRC, the mark code (nearly?) supports this already, so we should do this.

concurrent sweep with allocations

This seems difficult (or slow?), but (if we’re not already), we may be able to get most threads to sweep their own heap in parallel.

changed the title [-]A long-running, tight, cpu-only loop in a thread can deadlock the rest of a program by preventing GC ("One-thread allocating garbage" Multithreading Benchmark results)[/-] [+]gc state transition support in codegen[/+] on May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    GCGarbage collectorcompiler:codegenGeneration of LLVM IR and native codemultithreadingBase.Threads and related functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @vtjnash@JeffBezanson@NHDaly

        Issue actions

          gc state transition support in codegen · Issue #33097 · JuliaLang/julia