runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

rhysh · 2023-07-18T23:43:13Z

This is the first in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.

This issue is about contention in runtime.gcFlushBgCredit on work.assistQueue.lock.

I hope it'll be of interest to @mknyszek . Also CC @golang/runtime

Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.

What version of Go are you using (`go version`)?

$ go version
go version go1.19.6 linux/amd64

Does this issue reproduce with the latest release?

I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.

What operating system and processor architecture are you using (`go env`)?

The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.

What did you do?

A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:

The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.

What did you expect to see?

I expected to see very little contention on runtime-internal singleton locks.

What did you see instead?

Contention in runtime.gcFlushBgCredit makes it hard for the runtime.gcBgMarkWorker goroutines to share their scan credit with mutator goroutines. Mutator goroutines pile up in runtime.gcParkAssist, probably for longer than necessary because the scan credit that would allow them to proceed hasn't been able to get to the global shared / stealable pool.

https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L678
https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L713

Data set 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.14 seconds, during which it collected samples corresponding to a total 352 seconds of on-CPU time (an average of 70 on-CPU threads) and 77.61+86.91 = 164.52 seconds of time (an average of 32 on-CPU threads) in calls from runtime.gcFlushBgCredit to lock and unlock work.assistQueue.lock.

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 11:04am (PDT)
Duration: 5.14s, Total samples = 352s (6854.68%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 164.96s, 46.86% of 352s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            77.61s 99.59% |   runtime.gcFlushBgCredit (inline)
                                             0.09s  0.12% |   runtime.(*mheap).allocSpan (inline)
                                             0.08s   0.1% |   runtime.chansend (inline)
                                             0.04s 0.051% |   runtime.newMarkBits (inline)
                                             0.04s 0.051% |   runtime.runOneTimer (inline)
                                             0.02s 0.026% |   runtime.mProf_Malloc (inline)
                                             0.02s 0.026% |   runtime.sellock (inline)
                                             0.01s 0.013% |   runtime.checkTimers (inline)
                                             0.01s 0.013% |   runtime.gcParkAssist (inline)
     0.01s 0.0028% 0.0028%     77.93s 22.14%                | runtime.lock
                                            77.92s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            86.91s 99.86% |   runtime.gcFlushBgCredit (inline)
                                             0.06s 0.069% |   runtime.(*mheap).allocSpan (inline)
                                             0.04s 0.046% |   runtime.chansend (inline)
                                             0.01s 0.011% |   runtime.selparkcommit (inline)
                                             0.01s 0.011% |   runtime.selunlock (inline)
         0     0% 0.0028%     87.03s 24.72%                | runtime.unlock
                                            87.03s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

A goroutine profile taken a few moments before the start of the CPU profile shows a few goroutines are in the GCWaiting state.

$ (cd 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 && go tool pprof -peek=gcAssistAlloc '-show_from=^runtime\.' /tmp/redacted pprof/goroutine)
File: redacted
Type: goroutine
Time: Jul 1, 2023 at 11:04am (PDT)
Active filters:
   show_from=^runtime\.
Showing nodes accounting for 3235, 97.53% of 3317 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                                32   100% |   runtime.mallocgc
         0     0%     0%         32  0.96%                | runtime.gcAssistAlloc
                                                28 87.50% |   runtime.Gosched (inline)
                                                 4 12.50% |   runtime.gcParkAssist
----------------------------------------------------------+-------------

Data set 82da472df6d99764df5636f26da9f15ad61dfe4f4b75a4d3da1dccc833bd2c8b includes an execution trace, and a concurrently-collected CPU profile. It's from a slightly different environment, where the machines have 64 hardware threads. The profile covers 6.23 seconds of wall-clock time and shows an average of 54 on-CPU threads. It shows an average of 5.8 on-CPU threads in calls from runtime.gcFlushBgCredit to lock and unlock work.assistQueue.lock.

$ (cd 82da472df6d99764df5636f26da9f15ad61dfe4f4b75a4d3da1dccc833bd2c8b && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile-during-trace)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 8:50am (PDT)
Duration: 6.23s, Total samples = 333.71s (5359.38%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 41.23s, 12.36% of 333.71s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.56s 87.22% |   runtime.gcFlushBgCredit (inline)
                                             0.84s  3.40% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             0.69s  2.79% |   runtime.gcParkAssist (inline)
                                             0.66s  2.67% |   runtime.sellock (inline)
                                             0.43s  1.74% |   runtime.(*mheap).reclaimChunk (inline)
                                             0.43s  1.74% |   runtime.chansend (inline)
                                             0.07s  0.28% |   runtime.(*mheap).allocSpan (inline)
                                             0.01s  0.04% |   runtime.(*mheap).reclaim (inline)
                                             0.01s  0.04% |   runtime.addspecial (inline)
                                             0.01s  0.04% |   runtime.mProf_Malloc (inline)
                                             0.01s  0.04% |   runtime.traceString (inline)
     0.01s 0.003% 0.003%     24.72s  7.41%                | runtime.lock
                                            24.71s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            14.59s 89.02% |   runtime.gcFlushBgCredit (inline)
                                             0.55s  3.36% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             0.41s  2.50% |   runtime.chansend (inline)
                                             0.29s  1.77% |   runtime.(*mheap).reclaimChunk (inline)
                                             0.18s  1.10% |   runtime.selunlock (inline)
                                             0.16s  0.98% |   runtime.parkunlock_c (inline)
                                             0.11s  0.67% |   runtime.selparkcommit (inline)
                                             0.03s  0.18% |   runtime.chansend.func1 (inline)
                                             0.02s  0.12% |   runtime.(*mheap).allocSpan (inline)
                                             0.02s  0.12% |   runtime.setprofilebucket (inline)
                                             0.01s 0.061% |   runtime.chanparkcommit (inline)
                                             0.01s 0.061% |   runtime.findRunnable (inline)
                                             0.01s 0.061% |   runtime.markrootSpans (inline)
         0     0% 0.003%     16.39s  4.91%                | runtime.unlock
                                            16.39s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

The execution trace shows many goroutines in the GCWaiting state, and that goroutines both enter and leave this state. It shows some magenta "MARK ASSIST (unfinished)" regions, and some chartreuse "MARK ASSIST" regions. Because this execution trace is large and go tool trace split it into several pieces, the Threads ribbon display didn't get an initial update in this view and so it shows up as absent, when in fact it's stuck at the max (64 for this app instance) for most of the displayed time window. Procs 0, 1, 2 are running the background mark worker; its display is broken in a similar way.

It seems that the pacer started the GC Mark phase too late here, but that's a separate issue with separate solutions (including to move away from Go 1.19).

Here's another view from earlier in the GC Mark phase. (Procs 0–11, 17, and 18 appear to be dedicated GC workers.) There's a change around 3700 ms -- the mutator goroutines stop showing chartreuse "MARK ASSIST" regions, and the number of goroutines in the GCWaiting state grows from 0 to several hundred.

Here's my rough analysis: It looks to me like the design of runtime.gcFlushBgCredit results in metastability -- that when assist credit is available to steal, the background mark worker goroutines take the fast path when adding additional credit to the pool. But when that pool runs dry, the background mark worker goroutines start needing to acquire a lock in order to replenish it. And because of the contention on that lock, they spend their CPU cycles in runtime.futex instead of doing useful scan work. If the assist queue lock isn't immediately available, maybe the worker could add its credit directly to the pool and rely on some subsequent call releasing it?

Furthermore, if so many user goroutines are in GCWaiting, the app's Ps will end up doing GC work via gcBgMarkWorker instead of by gcAssistAlloc, which means they'll want to share any excess scan credit instead of keeping it for themselves, leading to more Ps contending for the assist queue lock.

Ideally the GC pacer would start earlier (and Go 1.20 should help), but it looks like once the runtime starts experiencing contention on this lock it has a hard time recovering until the GC Mark phase completes.

The text was updated successfully, but these errors were encountered:

heschi · 2023-07-19T19:08:26Z

cc @golang/runtime

rhysh · 2024-07-31T19:17:24Z

@mknyszek should we bump this to the Go1.24 milestone?

Do you have thoughts on how to address the structure / the use of the mutex here? Otherwise, it's possible that something like #68578 could improve mutex-related problems like this, by reducing the likelihood that unlock2 needs to call futexwakeup, which would give those threads more time to do their other work and perhaps to finish the GC cycle more quickly.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 18, 2023

bcmills added the Performance label Jul 19, 2023

mknyszek added this to Go Compiler / Runtime Jul 19, 2023

heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 19, 2023

mknyszek added this to the Go1.22 milestone Jul 25, 2023

mknyszek self-assigned this Jul 25, 2023

mknyszek moved this to Todo in Go Compiler / Runtime Jul 25, 2023

mknyszek modified the milestones: Go1.22, Go1.23 Nov 28, 2023

rhysh mentioned this issue Jul 24, 2024

runtime: improve scaling of lock2 #68578

Closed

gopherbot modified the milestones: Go1.23, Go1.24 Aug 13, 2024

gopherbot modified the milestones: Go1.24, Go1.25 Feb 11, 2025

gabyhelp mentioned this issue Mar 18, 2025

runtime: goroutines blocked in mallocgc triggering gcStart #72932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

rhysh commented Jul 18, 2023

heschi commented Jul 19, 2023

rhysh commented Jul 31, 2024

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

Comments

rhysh commented Jul 18, 2023

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

heschi commented Jul 19, 2023

rhysh commented Jul 31, 2024

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?