Skip to content

gc state transition support in codegen #33097

@NHDaly

Description

@NHDaly

I'm opening this issue to discuss the results from our "one-thread allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#4

That experiment showcased a potential performance hazard: Having one thread running a long-running, tight loop with no allocations can essentially deadlock the rest of the program, as GC prevents any threads from scheduling tasks until every task has reached a GC-safepoint, and GC has completed.

This problem will show up if you have some long-running tasks that never allocate, and then GC is triggered on another thread. In that case, the entire program will wait until all tasks have completed. This is mostly a problem if you have an "unbalanced" workload, where some tasks are alloc-free, but other tasks allocate enough memory to trigger GC. (It could even be triggered by allocating the tasks themselves in an almost allocation-free program.)

Since, currently, GC requires all threads to be in a gc-safepoint before it will proceed, and since tasks cannot be preempted, once GC is triggered it will pause any thread that enters a GC-safepoint until all threads have entered a GC-safepoint.

Switching tasks is a gc-safepoint, so in the above benchmark workload, once GC is triggered, no new queries are scheduled to execute until all currently executing queries have completed.


Note that this problem is a consequence of having non-preemptable tasks and stop-the-world GC. Golang also suffers from this problem, as discussed here: golang/go#10958 (long thread)

In the situation I outlined above, it would simply run slowly, as the rest of the program pauses on the cpu-only thread. But one could easily imagine a deadlock if the cpu-bound task was waiting on e.g. an Atomic variable to be updated by one of the other paused tasks.


We discussed this result in-person today with @JeffBezanson and @raileywild, and I wanted to record some of our thoughts:

  1. A user could avoid this situation by manually adding yield() points in their tight-loop, but of course that would make their tight code slower.
  2. Instead, the user could manually add gc-safepoints, which would allow GC to proceed if it was in-progress, but be a noop otherwise. This would be significantly faster (though still a bit slow). We can do this via ccall(:jl_gc_safepoint).
  3. We could consider adding a mechanism to allow users to mark a region of code as "GC safe", so that GC can proceed while the code is executing... but this seems dangerous and is unlikely to happen.
    • Relatedly, consider allowing the compiler to figure this out, but that also seems hard.
  4. Currently, we're pretty sure the gc_time reported by @time and similar tools doesn't include the time spent waiting for all threads to reach a safepoint. It probably should.
    • TODO: Probably we should add that time to gc_time, or we should add another separate metric for, like, "gc synchronization time".
  5. Currently both "mark" and "sweep" happen during the stopped world. We could consider allowing the "sweep" phase to run in parallel with user code (ie resume the world), but the sweep phase is much shorter than the mark phase, so it doesn't buy much.
  6. We could also speed-up the time to do GC by multithreading the mark-phase to split the work up across all the threads. Right now, only one thread does the GC work while the others are all paused, waiting.
    • This would certainly improve performance on our multithreaded benchmarks, but it doesn't help with the main contention/synchronization problem, forcing all threads to synchronize every-so-often before proceeding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GCGarbage collectorcompiler:codegenGeneration of LLVM IR and native codemultithreadingBase.Threads and related functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions