runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
This is the first in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.
This issue is about contention in
runtime.gcFlushBgCredit
onwork.assistQueue.lock
.I hope it'll be of interest to @mknyszek . Also CC @golang/runtime
Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.
What operating system and processor architecture are you using (
go env
)?The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of
go env
from the app's runtime environment.What did you do?
A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:
The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.
What did you expect to see?
I expected to see very little contention on runtime-internal singleton locks.
What did you see instead?
Contention in
runtime.gcFlushBgCredit
makes it hard for theruntime.gcBgMarkWorker
goroutines to share their scan credit with mutator goroutines. Mutator goroutines pile up inruntime.gcParkAssist
, probably for longer than necessary because the scan credit that would allow them to proceed hasn't been able to get to the global shared / stealable pool.https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L678
https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L713
Data set 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.14 seconds, during which it collected samples corresponding to a total 352 seconds of on-CPU time (an average of 70 on-CPU threads) and 77.61+86.91 = 164.52 seconds of time (an average of 32 on-CPU threads) in calls from
runtime.gcFlushBgCredit
to lock and unlockwork.assistQueue.lock
.A goroutine profile taken a few moments before the start of the CPU profile shows a few goroutines are in the GCWaiting state.
Data set 82da472df6d99764df5636f26da9f15ad61dfe4f4b75a4d3da1dccc833bd2c8b includes an execution trace, and a concurrently-collected CPU profile. It's from a slightly different environment, where the machines have 64 hardware threads. The profile covers 6.23 seconds of wall-clock time and shows an average of 54 on-CPU threads. It shows an average of 5.8 on-CPU threads in calls from
runtime.gcFlushBgCredit
to lock and unlockwork.assistQueue.lock
.The execution trace shows many goroutines in the GCWaiting state, and that goroutines both enter and leave this state. It shows some magenta "MARK ASSIST (unfinished)" regions, and some chartreuse "MARK ASSIST" regions. Because this execution trace is large and
go tool trace
split it into several pieces, the Threads ribbon display didn't get an initial update in this view and so it shows up as absent, when in fact it's stuck at the max (64 for this app instance) for most of the displayed time window. Procs 0, 1, 2 are running the background mark worker; its display is broken in a similar way.It seems that the pacer started the GC Mark phase too late here, but that's a separate issue with separate solutions (including to move away from Go 1.19).
Here's another view from earlier in the GC Mark phase. (Procs 0–11, 17, and 18 appear to be dedicated GC workers.) There's a change around 3700 ms -- the mutator goroutines stop showing chartreuse "MARK ASSIST" regions, and the number of goroutines in the GCWaiting state grows from 0 to several hundred.
Here's my rough analysis: It looks to me like the design of
runtime.gcFlushBgCredit
results in metastability -- that when assist credit is available to steal, the background mark worker goroutines take the fast path when adding additional credit to the pool. But when that pool runs dry, the background mark worker goroutines start needing to acquire a lock in order to replenish it. And because of the contention on that lock, they spend their CPU cycles inruntime.futex
instead of doing useful scan work. If the assist queue lock isn't immediately available, maybe the worker could add its credit directly to the pool and rely on some subsequent call releasing it?Furthermore, if so many user goroutines are in GCWaiting, the app's Ps will end up doing GC work via gcBgMarkWorker instead of by gcAssistAlloc, which means they'll want to share any excess scan credit instead of keeping it for themselves, leading to more Ps contending for the assist queue lock.
Ideally the GC pacer would start earlier (and Go 1.20 should help), but it looks like once the runtime starts experiencing contention on this lock it has a hard time recovering until the GC Mark phase completes.
The text was updated successfully, but these errors were encountered: