gc state transition support in codegen

I'm opening this issue to discuss the results from our "one-thread allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#4

That experiment showcased a potential performance hazard: **Having one thread running a long-running, tight loop with no allocations can essentially deadlock the rest of the program, as GC prevents any threads from scheduling tasks until every task has reached a GC-safepoint, and GC has completed.**

This problem will show up if you have some long-running tasks that never allocate, and then GC is triggered on another thread. In that case, the entire program will wait until all tasks have completed. This is mostly a problem if you have an "unbalanced" workload, where some tasks are alloc-free, but other tasks allocate enough memory to trigger GC. (It could even be triggered by allocating the tasks themselves in an almost allocation-free program.)

Since, currently, GC requires all threads to be in a gc-safepoint before it will proceed, and since tasks cannot be preempted, once GC is triggered it will pause any thread that enters a GC-safepoint until _all_ threads have entered a GC-safepoint.

Switching tasks is a gc-safepoint, so in the above benchmark workload, once GC is triggered, no new queries are scheduled to execute until all currently executing queries have completed.

------------

Note that this problem is a consequence of having non-preemptable tasks and stop-the-world GC. Golang also suffers from this problem, as discussed here: https://github.com/golang/go/issues/10958 (long thread)

In the situation I outlined above, it would simply run slowly, as the rest of the program pauses on the cpu-only thread. But one could easily imagine a deadlock if the cpu-bound task was waiting on e.g. an Atomic variable to be updated by one of the other paused tasks.

------------

We discussed this result in-person today with @JeffBezanson and @raileywild, and I wanted to record some of our thoughts:

1. A user could avoid this situation by manually adding `yield()` points in their tight-loop, but of course that would make their tight code _slower_.
2. Instead, the user could manually add gc-safepoints, which would allow GC to proceed _if it was in-progress_, but be a noop otherwise. This would be significantly faster (though still a bit slow). We can do this via `ccall(:jl_gc_safepoint)`.
    - [ ] TODO: Can we add a Base function to trigger a gc safepoint to make it seem safer / more normal? (https://github.com/JuliaLang/julia/pull/33092)
3. We could consider adding a mechanism to allow users to mark a region of code as "GC safe", so that GC can proceed while the code is executing... but this seems dangerous and is unlikely to happen.
    - Relatedly, consider allowing the compiler to figure this out, but that also seems hard.
4. Currently, we're pretty sure the `gc_time` reported by `@time` and similar tools _doesn't include the time spent waiting for all threads to reach a safepoint_. It probably should.
    - [ ] TODO: Probably we should add that time to `gc_time`, or we should add another separate metric for, like, "gc synchronization time".
5. Currently both "mark" and "sweep" happen during the stopped world. We could consider allowing the "sweep" phase to run in parallel with user code (ie resume the world), but the sweep phase is much shorter than the mark phase, so it doesn't buy much.
6. We could also speed-up the time to do GC by multithreading the mark-phase to split the work up across all the threads. Right now, only one thread does the GC work while the others are all paused, waiting.
    - This would certainly improve performance on our multithreaded benchmarks, but it doesn't help with the main contention/synchronization problem, forcing all threads to synchronize every-so-often before proceeding.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gc state transition support in codegen #33097

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

gc state transition support in codegen #33097

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions