- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
I'm opening this issue to discuss the results from our "one-thread allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#4
That experiment showcased a potential performance hazard: Having one thread running a long-running, tight loop with no allocations can essentially deadlock the rest of the program, as GC prevents any threads from scheduling tasks until every task has reached a GC-safepoint, and GC has completed.
This problem will show up if you have some long-running tasks that never allocate, and then GC is triggered on another thread. In that case, the entire program will wait until all tasks have completed. This is mostly a problem if you have an "unbalanced" workload, where some tasks are alloc-free, but other tasks allocate enough memory to trigger GC. (It could even be triggered by allocating the tasks themselves in an almost allocation-free program.)
Since, currently, GC requires all threads to be in a gc-safepoint before it will proceed, and since tasks cannot be preempted, once GC is triggered it will pause any thread that enters a GC-safepoint until all threads have entered a GC-safepoint.
Switching tasks is a gc-safepoint, so in the above benchmark workload, once GC is triggered, no new queries are scheduled to execute until all currently executing queries have completed.
Note that this problem is a consequence of having non-preemptable tasks and stop-the-world GC. Golang also suffers from this problem, as discussed here: golang/go#10958 (long thread)
In the situation I outlined above, it would simply run slowly, as the rest of the program pauses on the cpu-only thread. But one could easily imagine a deadlock if the cpu-bound task was waiting on e.g. an Atomic variable to be updated by one of the other paused tasks.
We discussed this result in-person today with @JeffBezanson and @raileywild, and I wanted to record some of our thoughts:
- A user could avoid this situation by manually adding
yield()
points in their tight-loop, but of course that would make their tight code slower. - Instead, the user could manually add gc-safepoints, which would allow GC to proceed if it was in-progress, but be a noop otherwise. This would be significantly faster (though still a bit slow). We can do this via
ccall(:jl_gc_safepoint)
.- TODO: Can we add a Base function to trigger a gc safepoint to make it seem safer / more normal? (add
GC.safepoint()
for compute-bound threads #33092)To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- We could consider adding a mechanism to allow users to mark a region of code as "GC safe", so that GC can proceed while the code is executing... but this seems dangerous and is unlikely to happen.
- Relatedly, consider allowing the compiler to figure this out, but that also seems hard.
- Currently, we're pretty sure the
gc_time
reported by@time
and similar tools doesn't include the time spent waiting for all threads to reach a safepoint. It probably should.- TODO: Probably we should add that time to
gc_time
, or we should add another separate metric for, like, "gc synchronization time".To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- Currently both "mark" and "sweep" happen during the stopped world. We could consider allowing the "sweep" phase to run in parallel with user code (ie resume the world), but the sweep phase is much shorter than the mark phase, so it doesn't buy much.
- We could also speed-up the time to do GC by multithreading the mark-phase to split the work up across all the threads. Right now, only one thread does the GC work while the others are all paused, waiting.
- This would certainly improve performance on our multithreaded benchmarks, but it doesn't help with the main contention/synchronization problem, forcing all threads to synchronize every-so-often before proceeding.
Activity
JeffBezanson commentedon Aug 28, 2019
See #33092 for manual safepoints.
vtjnash commentedon Aug 29, 2019
I think this is the way to go and think it can be done well. But since it’s not going to be immediately ready, #33092 seems like the way to go right now (export what we already have, and backport it to 1.3 branch)
vtjnash commentedon Aug 29, 2019
IIRC, the mark code (nearly?) supports this already, so we should do this.
This seems difficult (or slow?), but (if we’re not already), we may be able to get most threads to sweep their own heap in parallel.
[-]A long-running, tight, cpu-only loop in a thread can deadlock the rest of a program by preventing GC ("One-thread allocating garbage" Multithreading Benchmark results)[/-][+]gc state transition support in codegen[/+]