Description
What version of Go are you using (go version
)?
$ go1.14rc1 version go version go1.14rc1 linux/amd64
Does this issue reproduce with the latest release?
Yes, go1.14rc1.
What operating system and processor architecture are you using (go env
)?
Ubuntu 18.04.2 LTS (stock GCE image). 4 vCPUs, 15 GB.
go env
Output
$ go1.14rc1 envGO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/sdk/go1.14rc1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/sdk/go1.14rc1/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build656951356=/tmp/go-build -gno-record-gcc-switches"
Background
A recent blog post reported an issue they saw in the >2 year old Go 1.10 when there are "tens of millions" of entries in an LRU, along with CPU spikes and response time spikes every two minutes, with a p95 response time reportedly hitting 50ms-300ms.
The two minute periodicity was explained within the blog by the following, which seems plausible:
we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.
A separate recent post from one of the engineers theorized the problem was due to a very long linked list, but I am less sure if this comment is a precise description of the actual root cause:
we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.
An engineer from the same team might have also posted this roughly 2.5 years ago in #14812 (comment)
I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.
What did you do?
Write a simple GC benchmark intended to emulate a LRU with a separate free list.
It's not 100% clear what their exact data structures were, including whether or not it was just an LRU that linked elements within the LRU to help with expiring the older items, vs. maybe there was an additional free list that was separate from the LRU, vs. something else entirely.
Last night, without too much data, I put together a simple gc benchmark of a LRU cache along with a separate free list.
The benchmark is here:
https://play.golang.org/p/-Vo1hqIAoNz
Here is a sample run with 50M items in a LRU cache (a map + the items within the map also are on a linked list), as well as 1M items on separate free list, and a background allocation rate of about 10k/sec:
$ go1.14rc1 build
$ GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms
What did you expect to see?
Ideally, sub-ms worst-case latency impact of GC.
What did you see instead?
This was fairly quick and dirty, and might not be at all related to the problem reported in the blog, but seemed to observe occasionally higher latency values.
It is important to note that the issue here does not seem to be anywhere near the scale reported in the blog:
- The amplitude of the latency impact here (10ms-26ms) seems substantially smaller than what was reported on the blog (50ms-300ms).
- The issue here is reporting worst-case latency, rather than p95 latency on the blog.
Here is an example where a single 26ms worst-case latency was observed when looking up an element in the LRU during a forced GC.
$ go1.14rc1 build && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms
gc 1 @0.011s 5%: 0.066+4.1+0.023 ms clock, 0.26+0.30/3.0/6.9+0.095 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.022s 10%: 0.008+9.0+0.021 ms clock, 0.033+0.33/8.7/10+0.087 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.044s 10%: 0.012+13+0.020 ms clock, 0.049+0/12/23+0.083 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.087s 10%: 0.009+26+0.030 ms clock, 0.038+0/24/46+0.12 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.177s 10%: 0.010+49+0.023 ms clock, 0.042+0/47/89+0.095 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.369s 10%: 0.015+98+0.022 ms clock, 0.060+0/98/179+0.089 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.767s 10%: 0.012+195+0.021 ms clock, 0.049+0/194/380+0.086 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.586s 11%: 7.4+377+0.028 ms clock, 29+71/376/742+0.11 ms cpu, 426->430->398 MB, 427 MB goal, 4 P
gc 9 @3.343s 11%: 0.024+787+0.023 ms clock, 0.099+236/785/1501+0.094 ms cpu, 853->859->795 MB, 854 MB goal, 4 P
gc 10 @7.071s 11%: 0.023+1806+0.023 ms clock, 0.092+497/1803/3513+0.095 ms cpu, 1706->1719->1591 MB, 1707 MB goal, 4 P
gc 11 @14.905s 12%: 49+3796+0.023 ms clock, 196+1084/3796/7271+0.093 ms cpu, 3411->3439->3182 MB, 3412 MB goal, 4 P
gc 12 @31.292s 14%: 226+8162+0.043 ms clock, 904+5028/8161/14256+0.17 ms cpu, 6823->6866->6353 MB, 6824 MB goal, 4 P
LRU count: 50.00M
Free list count: 1.00M
BG allocation: 10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0 bgAllocs=540000 at=64.026s lookup=0.009ms
i=4096 bgAllocs=540100 at=64.028s lookup=0.016ms
i=8944 bgAllocs=540100 at=64.030s lookup=0.118ms
i=9410 bgAllocs=540100 at=64.031s lookup=0.123ms
i=52120 bgAllocs=540300 at=64.053s lookup=0.141ms
i=54295 bgAllocs=540300 at=64.054s lookup=0.148ms
i=81167 bgAllocs=540400 at=64.068s lookup=0.185ms
i=141525 bgAllocs=540700 at=64.099s lookup=0.193ms
i=473297 bgAllocs=542400 at=64.271s lookup=0.205ms
i=820934 bgAllocs=544200 at=64.451s lookup=0.233ms
i=1045129 bgAllocs=545300 at=64.572s lookup=0.323ms
i=49675599 bgAllocs=810200 at=91.600s lookup=0.651ms
i=59934276 bgAllocs=863000 at=97.000s lookup=0.661ms
i=63432689 bgAllocs=881700 at=98.900s lookup=0.696ms
GC forced
i=173370432 bgAllocs=1476800 at=159.688s lookup=3.407ms
i=173381145 bgAllocs=1476800 at=159.727s lookup=26.235ms
gc 13 @159.687s 5%: 0.096+17356+0.020 ms clock, 0.38+0.65/17348/34668+0.081 ms cpu, 9642->9659->8463 MB, 12706 MB goal, 4 P
GC forced
gc 14 @297.047s 4%: 0.12+17250+0.021 ms clock, 0.49+30/17250/34376+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 15 @434.306s 4%: 0.10+16952+0.024 ms clock, 0.41+0.20/16949/33820+0.096 ms cpu, 8696->8712->8463 MB, 16927 MB goal, 4 P
I would guess that 26ms lookup latency is likely related to this 26ms 'GC (idle)' period where the main goroutine did not seem to be on a P:
Another sample run:
$ go1.14rc1 build && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms
gc 1 @0.011s 6%: 0.014+4.3+0.025 ms clock, 0.056+0.023/3.7/6.6+0.10 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.021s 10%: 0.009+8.4+0.026 ms clock, 0.038+0.34/7.8/11+0.10 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.043s 10%: 0.009+12+0.023 ms clock, 0.036+0/11/23+0.092 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.084s 11%: 0.010+24+0.023 ms clock, 0.041+0/24/42+0.093 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.171s 10%: 0.014+48+0.024 ms clock, 0.058+0/47/93+0.097 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.364s 10%: 0.017+97+0.022 ms clock, 0.070+0/96/177+0.088 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.758s 10%: 5.6+203+0.022 ms clock, 22+0.49/201/386+0.091 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.605s 10%: 0.019+411+0.023 ms clock, 0.078+0/409/796+0.095 ms cpu, 426->432->399 MB, 427 MB goal, 4 P
gc 9 @3.312s 11%: 28+790+0.031 ms clock, 114+194/787/1520+0.12 ms cpu, 853->860->795 MB, 854 MB goal, 4 P
gc 10 @6.964s 12%: 46+1586+0.033 ms clock, 185+730/1585/2988+0.13 ms cpu, 1706->1716->1587 MB, 1707 MB goal, 4 P
gc 11 @14.638s 13%: 79+3760+0.022 ms clock, 316+1164/3758/7168+0.088 ms cpu, 3411->3438->3182 MB, 3412 MB goal, 4 P
gc 12 @31.187s 18%: 196+6723+0.007 ms clock, 785+11516/6720/7379+0.028 ms cpu, 6823->6824->6311 MB, 6824 MB goal, 4 P
LRU count: 50.00M
Free list count: 1.00M
BG allocation: 10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0 bgAllocs=530900 at=63.844s lookup=0.006ms
i=100 bgAllocs=530900 at=63.844s lookup=0.006ms
i=671 bgAllocs=530900 at=63.844s lookup=0.028ms
i=2407 bgAllocs=530900 at=63.845s lookup=0.030ms
i=4258 bgAllocs=530900 at=63.847s lookup=0.120ms
i=80012 bgAllocs=531300 at=63.886s lookup=0.634ms
i=134006913 bgAllocs=1258500 at=138.086s lookup=0.679ms
i=145128912 bgAllocs=1321200 at=144.486s lookup=0.703ms
GC forced
i=170136450 bgAllocs=1454900 at=158.133s lookup=16.008ms
gc 13 @158.108s 6%: 0.11+17483+0.022 ms clock, 0.44+0.23/17480/34884+0.089 ms cpu, 9639->9656->8463 MB, 12622 MB goal, 4 P
GC forced
gc 14 @295.596s 5%: 0.13+16977+0.024 ms clock, 0.52+28/16977/33847+0.099 ms cpu, 8695->8712->8463 MB, 16927 MB goal, 4 P
GC forced
i=632749868 bgAllocs=3978800 at=432.596s lookup=11.794ms
gc 15 @432.576s 4%: 0.073+17356+0.021 ms clock, 0.29+0.23/17355/34654+0.084 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
gc 16 @569.938s 4%: 0.17+16911+0.021 ms clock, 0.71+4.9/16908/33760+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 17 @706.859s 3%: 0.19+17107+0.023 ms clock, 0.76+0.41/17106/34143+0.093 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
i=1327474340 bgAllocs=7763400 at=843.987s lookup=8.791ms
gc 18 @843.969s 3%: 0.28+17297+0.022 ms clock, 1.1+0.82/17295/34532+0.088 ms cpu, 8696->8713->8463 MB, 16927 MB goal, 4 P
i=1344317800 bgAllocs=7849300 at=861.294s lookup=13.015ms
...
^C
I suspect this trace section corresponds to the at=432.596s lookup=11.794ms
observed latency:
In both cases, the traces started roughly 100 seconds after the start of the process, so the timestamps between the log message vs. images are offset roughly by 100 seconds.
Related issues
This issue might very well be a duplicate. For example, there is a recent list of 5 known issues in #14812 (comment). That said, if this issue is explained by one of the 5 items there, it is not immediately obvious to me which one, including some of the issues in that list are resolved now, but this happens in go1.14rc1. Also possibly related to #14179, #18155, or others.
Activity
cagedmantis commentedon Feb 7, 2020
/cc @aclements @rsc @randall77
heschi commentedon Feb 9, 2020
I think this is working as expected. At the heart of
github.com/golang/groupcache/lru
is a single large map. AFAIK the map has to be marked all in one go, which will take a long time. In the execution trace above you can see aMARK ASSIST
segment below G21. If you click on that I think you'll find that it spent all of its time marking a single object, presumably the map. Austin and others will know better, of course.Since a real server would need to lock the map, sharding it would be a good idea, which would also reduce the latency impact of a mark.
(cc @rsc, since I think @cagedmantis made a typo.)
RLH commentedon Feb 9, 2020
randall77 commentedon Feb 9, 2020
cagedmantis commentedon Feb 10, 2020
Thanks @heschik
aclements commentedon Feb 10, 2020
Thanks for trying to put something together to recreate this behavior, especially from the little scraps of information we have. :)
Based on the execution traces, I think you're probably right that the idle GC worker is the root of the problem here. The idle worker currently runs until a scheduler preemption, which happens only every 10 to 20ms. The trace shows a fifth goroutine that appears to enter the run queue a moment after the idle worker kicks in, which is presumably the user goroutine.
I agree with @randall77 that this is not likely to be related to the large map, since we do break up large objects when marking.
In general long linked lists are not very GC-friendly, since it has no choice but to walk the list sequentially. That will inherently serialize that part of marking (assuming it's the only work left), though I don't understand the quoted comment about this happening "with a global lock". I don't think there's been anything like a global lock here since we switched to a concurrent GC in Go 1.5.
thepudds commentedon Feb 11, 2020
The Discord blog post described above had said they saw GC-related latency problems with Go 1.9.2 (and also later said elsewhere they had tested but seemingly not deployed Go 1.10), so part of the intent here was to see if the pauses they reported could be reproduced in Go 1.9.2.
Here are some results from running that benchmark above in a loop overnight, with ten 15-minute runs first on 1.9.2 and then on 1.14rc1.
Comparing 1.9.2 to 1.14rc1
Each line here in these two tables shows the counts of delays in different ranges for one 15-minute run:
Go 1.9.2 counts of delays
Go 1.14rc1 counts of delays
I'm putting the raw results here including so that anyone running this benchmark in the future can see some of the current variability from run to run.
thepudds commentedon Feb 11, 2020
Also, the 5 worst lookups seen in 1.9.2 across those ten runs:
As I noted above, this is a simplified benchmark, including the lookups are all done sequentially, so that a GC-induced delay of >100ms might be recorded only in one measurement here, whereas in a real system with parallel inbound requests a GC-induced delay of >100ms might impact many requests at once and hence impact multiple response time measurements.
ianlancetaylor commentedon Jun 16, 2020
Rolling forward to 1.16.
5 remaining items
ianlancetaylor commentedon Apr 23, 2021
I'm guessing that this isn't done for 1.17. Should we move this to 1.18 or should we move it to Backlog?
mknyszek commentedon Apr 23, 2021
Looking over this quickly, I think it'll be helped a bit by #44163. It looks like the scheduler saw the chance to schedule an idle GC worker because the only piece of work hadn't been observed by the queue, and then it proceeded to hog the whole quantum.
I think there are other good reasons to remove the idle GC workers, and this is one more reason. I would really like to remove them early in cycle for 1.18.
gopherbot commentedon Mar 19, 2022
Change https://go.dev/cl/393394 mentions this issue:
runtime: reduce max idle mark workers during periodic GC cycles
mknyszek commentedon Apr 26, 2022
Re-reading the issue, https://go.dev/cl/393394 should solve the issue. Please feel free to reopen if that's not true.
runtime: reduce max idle mark workers during periodic GC cycles