runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work

### What version of Go are you using (`go version`)?

<pre>
$ go1.14rc1 version

go version go1.14rc1 linux/amd64
</pre>

### Does this issue reproduce with the latest release?

Yes, go1.14rc1.

### What operating system and processor architecture are you using (`go env`)?

Ubuntu 18.04.2 LTS (stock GCE image). 4 vCPUs, 15 GB.

<details><summary><code>go env</code> Output</summary><br><pre>
$ go1.14rc1 env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/sdk/go1.14rc1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/sdk/go1.14rc1/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build656951356=/tmp/go-build -gno-record-gcc-switches"

</pre></details>

### Background

A recent [blog post](https://blog.discordapp.com/why-discord-is-switching-from-go-to-rust-a190bbca2b1f) reported an issue they saw in the >2 year old Go 1.10 when there are "tens of millions" of entries in an LRU, along with CPU spikes and response time spikes every two minutes, with a p95 response time reportedly hitting 50ms-300ms.  

The two minute periodicity was explained within the blog by the following, which seems plausible:

> we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
> After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.

A separate recent [post](https://old.reddit.com/r/programming/comments/eyuebc/why_discord_is_switching_from_go_to_rust/fgk56y4/) from one of the engineers theorized the problem was due to a very long linked list, but I am less sure if this comment is a precise description of the actual root cause:

> we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.

An engineer from the same team might have also posted this roughly 2.5 years ago in https://github.com/golang/go/issues/14812#issuecomment-411589415

> I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

### What did you do?

Write a simple GC benchmark intended to emulate a LRU with a separate free list.

It's not 100% clear what their exact data structures were, including whether or not it was just an LRU that linked elements within the LRU to help with expiring the older items, vs. maybe there was an additional free list that was separate from the LRU, vs. something else entirely.

Last night, without too much data, I put together a simple gc benchmark of a LRU cache along with a separate free list.

The benchmark is here: 

https://play.golang.org/p/-Vo1hqIAoNz

Here is a sample run with 50M items in a LRU cache (a map + the items within the map also are on a linked list), as well as 1M items on separate free list, and a background allocation rate of about 10k/sec:

```
$ go1.14rc1 build 
$ GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms
```

### What did you expect to see?

Ideally, sub-ms worst-case latency impact of GC.

### What did you see instead?

This was fairly quick and dirty, and might not be at all related to the problem reported in the blog, but seemed to observe occasionally higher latency values.

It is important to note that the issue here does not seem to be anywhere near the scale reported in the blog:
  * The amplitude of the latency impact here (10ms-26ms) seems substantially smaller than what was reported on the blog (50ms-300ms).
  * The issue here is reporting worst-case latency, rather than p95 latency on the blog.

Here is an example where a single 26ms worst-case latency was observed when looking up an element in the LRU during a forced GC.

```
$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 5%: 0.066+4.1+0.023 ms clock, 0.26+0.30/3.0/6.9+0.095 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.022s 10%: 0.008+9.0+0.021 ms clock, 0.033+0.33/8.7/10+0.087 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.044s 10%: 0.012+13+0.020 ms clock, 0.049+0/12/23+0.083 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.087s 10%: 0.009+26+0.030 ms clock, 0.038+0/24/46+0.12 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.177s 10%: 0.010+49+0.023 ms clock, 0.042+0/47/89+0.095 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.369s 10%: 0.015+98+0.022 ms clock, 0.060+0/98/179+0.089 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.767s 10%: 0.012+195+0.021 ms clock, 0.049+0/194/380+0.086 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.586s 11%: 7.4+377+0.028 ms clock, 29+71/376/742+0.11 ms cpu, 426->430->398 MB, 427 MB goal, 4 P
gc 9 @3.343s 11%: 0.024+787+0.023 ms clock, 0.099+236/785/1501+0.094 ms cpu, 853->859->795 MB, 854 MB goal, 4 P
gc 10 @7.071s 11%: 0.023+1806+0.023 ms clock, 0.092+497/1803/3513+0.095 ms cpu, 1706->1719->1591 MB, 1707 MB goal, 4 P
gc 11 @14.905s 12%: 49+3796+0.023 ms clock, 196+1084/3796/7271+0.093 ms cpu, 3411->3439->3182 MB, 3412 MB goal, 4 P
gc 12 @31.292s 14%: 226+8162+0.043 ms clock, 904+5028/8161/14256+0.17 ms cpu, 6823->6866->6353 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=540000   at=64.026s  lookup=0.009ms
i=4096       bgAllocs=540100   at=64.028s  lookup=0.016ms
i=8944       bgAllocs=540100   at=64.030s  lookup=0.118ms
i=9410       bgAllocs=540100   at=64.031s  lookup=0.123ms
i=52120      bgAllocs=540300   at=64.053s  lookup=0.141ms
i=54295      bgAllocs=540300   at=64.054s  lookup=0.148ms
i=81167      bgAllocs=540400   at=64.068s  lookup=0.185ms
i=141525     bgAllocs=540700   at=64.099s  lookup=0.193ms
i=473297     bgAllocs=542400   at=64.271s  lookup=0.205ms
i=820934     bgAllocs=544200   at=64.451s  lookup=0.233ms
i=1045129    bgAllocs=545300   at=64.572s  lookup=0.323ms
i=49675599   bgAllocs=810200   at=91.600s  lookup=0.651ms
i=59934276   bgAllocs=863000   at=97.000s  lookup=0.661ms
i=63432689   bgAllocs=881700   at=98.900s  lookup=0.696ms
GC forced
i=173370432  bgAllocs=1476800   at=159.688s  lookup=3.407ms
i=173381145  bgAllocs=1476800   at=159.727s  lookup=26.235ms
gc 13 @159.687s 5%: 0.096+17356+0.020 ms clock, 0.38+0.65/17348/34668+0.081 ms cpu, 9642->9659->8463 MB, 12706 MB goal, 4 P
GC forced
gc 14 @297.047s 4%: 0.12+17250+0.021 ms clock, 0.49+30/17250/34376+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 15 @434.306s 4%: 0.10+16952+0.024 ms clock, 0.41+0.20/16949/33820+0.096 ms cpu, 8696->8712->8463 MB, 16927 MB goal, 4 P
```

I would guess that 26ms lookup latency is likely related to this 26ms 'GC (idle)' period where the main goroutine did not seem to be on a P:

![image](https://user-images.githubusercontent.com/20628140/74051363-c9113100-49a5-11ea-98dc-10c4c3000c13.png)

Another sample run:
```

$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 6%: 0.014+4.3+0.025 ms clock, 0.056+0.023/3.7/6.6+0.10 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.021s 10%: 0.009+8.4+0.026 ms clock, 0.038+0.34/7.8/11+0.10 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.043s 10%: 0.009+12+0.023 ms clock, 0.036+0/11/23+0.092 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.084s 11%: 0.010+24+0.023 ms clock, 0.041+0/24/42+0.093 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.171s 10%: 0.014+48+0.024 ms clock, 0.058+0/47/93+0.097 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.364s 10%: 0.017+97+0.022 ms clock, 0.070+0/96/177+0.088 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.758s 10%: 5.6+203+0.022 ms clock, 22+0.49/201/386+0.091 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.605s 10%: 0.019+411+0.023 ms clock, 0.078+0/409/796+0.095 ms cpu, 426->432->399 MB, 427 MB goal, 4 P
gc 9 @3.312s 11%: 28+790+0.031 ms clock, 114+194/787/1520+0.12 ms cpu, 853->860->795 MB, 854 MB goal, 4 P
gc 10 @6.964s 12%: 46+1586+0.033 ms clock, 185+730/1585/2988+0.13 ms cpu, 1706->1716->1587 MB, 1707 MB goal, 4 P
gc 11 @14.638s 13%: 79+3760+0.022 ms clock, 316+1164/3758/7168+0.088 ms cpu, 3411->3438->3182 MB, 3412 MB goal, 4 P
gc 12 @31.187s 18%: 196+6723+0.007 ms clock, 785+11516/6720/7379+0.028 ms cpu, 6823->6824->6311 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=530900   at=63.844s  lookup=0.006ms
i=100        bgAllocs=530900   at=63.844s  lookup=0.006ms
i=671        bgAllocs=530900   at=63.844s  lookup=0.028ms
i=2407       bgAllocs=530900   at=63.845s  lookup=0.030ms
i=4258       bgAllocs=530900   at=63.847s  lookup=0.120ms
i=80012      bgAllocs=531300   at=63.886s  lookup=0.634ms
i=134006913  bgAllocs=1258500   at=138.086s  lookup=0.679ms
i=145128912  bgAllocs=1321200   at=144.486s  lookup=0.703ms
GC forced
i=170136450  bgAllocs=1454900   at=158.133s  lookup=16.008ms
gc 13 @158.108s 6%: 0.11+17483+0.022 ms clock, 0.44+0.23/17480/34884+0.089 ms cpu, 9639->9656->8463 MB, 12622 MB goal, 4 P
GC forced
gc 14 @295.596s 5%: 0.13+16977+0.024 ms clock, 0.52+28/16977/33847+0.099 ms cpu, 8695->8712->8463 MB, 16927 MB goal, 4 P
GC forced
i=632749868  bgAllocs=3978800   at=432.596s  lookup=11.794ms
gc 15 @432.576s 4%: 0.073+17356+0.021 ms clock, 0.29+0.23/17355/34654+0.084 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
gc 16 @569.938s 4%: 0.17+16911+0.021 ms clock, 0.71+4.9/16908/33760+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 17 @706.859s 3%: 0.19+17107+0.023 ms clock, 0.76+0.41/17106/34143+0.093 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
i=1327474340 bgAllocs=7763400   at=843.987s  lookup=8.791ms
gc 18 @843.969s 3%: 0.28+17297+0.022 ms clock, 1.1+0.82/17295/34532+0.088 ms cpu, 8696->8713->8463 MB, 16927 MB goal, 4 P
i=1344317800 bgAllocs=7849300   at=861.294s  lookup=13.015ms
...
^C
```

I suspect this trace section corresponds to the `at=432.596s  lookup=11.794ms` observed latency:

![image](https://user-images.githubusercontent.com/20628140/74051285-a717ae80-49a5-11ea-89c4-c05189e5db61.png)

In both cases, the traces started roughly 100 seconds after the start of the process, so the timestamps between the log message vs. images are offset roughly by 100 seconds.

### Related issues 

This issue might very well be a duplicate. For example, there is a recent list of 5 known issues in https://github.com/golang/go/issues/14812#issuecomment-574203014. That said, if this issue is explained by one of the 5 items there, it is not immediately obvious to me which one, including some of the issues in that list are resolved now, but this happens in go1.14rc1. Also possibly related to #14179, #18155, or others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

Background

What did you do?

What did you expect to see?

What did you see instead?

Related issues

Comparing 1.9.2 to 1.14rc1

Go 1.9.2 counts of delays

Go 1.14rc1 counts of delays

5 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

Background

What did you do?

What did you expect to see?

What did you see instead?

Related issues

Activity

cagedmantis commented on Feb 7, 2020

heschi commented on Feb 9, 2020

RLH commented on Feb 9, 2020

randall77 commented on Feb 9, 2020

cagedmantis commented on Feb 10, 2020

aclements commented on Feb 10, 2020

thepudds commented on Feb 11, 2020

Comparing 1.9.2 to 1.14rc1

Go 1.9.2 counts of delays

Go 1.14rc1 counts of delays

thepudds commented on Feb 11, 2020

ianlancetaylor commented on Jun 16, 2020

5 remaining items

ianlancetaylor commented on Apr 23, 2021

mknyszek commented on Apr 23, 2021

gopherbot commented on Mar 19, 2022

mknyszek commented on Apr 26, 2022

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?