Description
What version of Go are you using (go version
)?
$ go version go version go1.21.4 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE='' GOARCH='amd64' GOBIN='' GOCACHE='/home/builder/.cache/go-build' GOENV='/home/builder/.config/go/env' GOEXE='' GOEXPERIMENT='' GOFLAGS='' GOHOSTARCH='amd64' GOHOSTOS='linux' GOINSECURE='' GOMODCACHE='/home/builder/go/pkg/mod' GONOPROXY='' GONOSUMDB='' GOOS='linux' GOPATH='/home/builder/go' GOPRIVATE='' GOPROXY='https://proxy.golang.org,direct' GOROOT='/usr/local/go' GOSUMDB='sum.golang.org' GOTMPDIR='' GOTOOLCHAIN='auto' GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64' GOVCS='' GOVERSION='go1.21.4' GCCGO='gccgo' GOAMD64='v1' AR='ar' CC='gcc' CXX='g++' CGO_ENABLED='0' GOMOD='/dev/null' GOWORK='' CGO_CFLAGS='-O2 -g' CGO_CPPFLAGS='' CGO_CXXFLAGS='-O2 -g' CGO_FFLAGS='-O2 -g' CGO_LDFLAGS='-O2 -g' PKG_CONFIG='pkg-config' GOGCCFLAGS='-fPIC -m64 -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3335037910=/tmp/go-build -gno-record-gcc-switches'
What did you do?
Our production service was recently shut down by the OOM, which led us to inspect its memory usage in details.
We discovered that, when running, our process had a memory consumption (RSS) that grew over time and never shrank.
sample runtime.MemStats report :
"Alloc": 830731560,
"TotalAlloc": 341870177656,
"Sys": 8023999048,
"Lookups": 0,
"Mallocs": 3129044622,
"Frees": 3124956536,
"HeapAlloc": 830731560,
"HeapSys": 7836532736,
"HeapIdle": 6916292608,
"HeapInuse": 920240128,
"HeapReleased": 6703923200,
"HeapObjects": 4088086,
"StackInuse": 15204352,
"StackSys": 15204352,
"MSpanInuse": 8563968,
"MSpanSys": 17338944,
"MCacheInuse": 4800,
"MCacheSys": 15600,
"BuckHashSys": 5794138,
"GCSys": 146092920,
"OtherSys": 3020358,
"NextGC": 1046754240,
"LastGC": 1700579048506142728,
"PauseTotalNs": 108783964,
at that time, the reported RSS for our process was 3,19Gb.
We looked at our history in more details, and observed that we had a big change in production behavior when we upgraded our go version from 1.19.5 to 1.20.0 - we unfortunately didn't notice the issue at that time, because we upgrade (and restart) our service on a regular basis.
To confirm this theory, we have downgraded our go version back to 1.19.13, and our memory consumption is now small and stable again.
Here is a graph of the RSS of our service over the last 48h, the drop corresponds to our new deployment with go 1.19.13 :
It should be noted that our production kernel is a hardened kernel based on grsecurity 5.15.28, which may be related to this issue (randomized heap addresses ?)
What did you expect to see?
A constant and stable memory usage.
What did you see instead?
The go runtime seems to not release memory back to the system.
Unfortunately, we have only been able to observe this issue on our production system, in production conditions.
We were not yet able to reproduce the issue on other systems, or by running isolated features in test programs deployed on our production infrastructure.
Activity
seankhliao commentedon Nov 22, 2023
perhaps try looking at profiling
LeGEC commentedon Nov 22, 2023
@seankhliao : we did, the OOM killer incident was a combination of high memory usage and a bug on our side, which we fixed.
After that fix, memory profiles would report only small amounts for 'inuse_size', in the 300-400 Mb range when our process' RSS would be several GB, and more importantly: the RSS of our process would grow over time, but the profiles would report no extra used memory.
We could definitely optimize allocations at large, but the fact is: compiling the exact same code in 1.19 produces a binary which doesn't leak. [edit: we will watch for the "leak" part over the next few days, but the system memory footprint is definitely lower, and shrinks on occasions, which didn't happen with go 1.21]
LeGEC commentedon Nov 22, 2023
@seankhliao : I see the WaitingForInfo tag, do you have a more specific request in mind ?
vanackere commentedon Nov 22, 2023
@seankhliao (sidenote: I work with @LeGEC): we did extensive memory profiling of our processes during several weeks (and can provide full data privately if necessary) but from the point of view of the memory profiler nothing was leaking and nothing would explain the constant grow in memory usage, mostly unreleased to the OS.
This is definitely a runtime issue as running the exact same code with go 1.19.13 leads to a drastically better behavior... (see graph in first post)
We are also willing to test any intermediate compiler version or run any profiling command that would be useful to track down this issue.
vanackere commentedon Nov 22, 2023
And for completeness: we also tried various combination of debug.FreeOSMemory() / GOMEMLIMIT=1500MiB with
no success at all.
Downgrading our production service to a now unsupported version of the Go compiler was only done
as a last resort attempt since we have not managed to reproduce the issue outside of production. We'd like to make sure
that this issue at least gets fixed for go 1.22 and will provide all data necessary.
Unless someone from the go compiler / runtime team has a better suggestion : we are planning to make a test run with go 1.20.0 tomorrow and update this issue with the corresponding numbers.
mauri870 commentedon Nov 22, 2023
I remember some things changed regarding huge pages on linux and were backported to 1.21, that may be related.
#61718
Worth it to read the wiki for THP and GOGC https://go.dev/doc/gc-guide#Linux_transparent_huge_pages
mauri870 commentedon Nov 22, 2023
cc @mknyszek
vanackere commentedon Nov 22, 2023
@mauri870 on our production system:
If huge page are indeed involved: is there a way to disable transparent_hugepage for a single process / without root access ? (Alternatively, is there a way to patch the go compiler to prevent transparent_hugepage usage ?)
[-]go 1.20.0 -> 1.21.4: excessive memory consumption (possible runtime memory leak ?)[/-][+]runtime: excessive memory use between 1.19 -> 1.21[/+]LeGEC commentedon Nov 22, 2023
Memstats of our go1.19 program :
RSS of process: 1158688 Kb (so ~ 1,16 Gb)
mknyszek commentedon Nov 22, 2023
The huge page related changes can't be the culprit, because huge pages were only forced by the runtime between Go 1.21.0 and Go 1.21.3 (inclusive). Go 1.21.4 no longer adjusts huge page settings at all. I don't see any of those versions in the conversation above.
mknyszek commentedon Nov 22, 2023
How are you measuring RSS?
Also, as a side note, "Sys" is virtual memory footprint. It's not going to match up with RSS. "Sys - HeapReleased" will be much closer.
Hmm... is it possible this is related to a regression in a specific standard library package for instance? IIRC we didn't see anything like this back when Go 1.20 was released, so it's a bit surprising. It's still possible it's related to the compiler/runtime, so I don't mean to rule it out.
Is it possible to reproduce this on a smaller scale? The difference seems fairly big and "obvious" so if you can bisect down to the commit between Go 1.19 and Go 1.20 that causes it, that would immediately tell us what happened.
vanackere commentedon Nov 22, 2023
@mknyszek RSS was mesured by simple output of ps aux. The value mentionned in the inital post may have been slightly wrong indeed, but not by much : the RSS of our process right before restarting the service with our version compiled with go 1.19 was 4397688 (with a VSZ of 9107484)
The same production process with go 1.19.13 gives currently the following values, much more coherent with the actual allocations reported by profiling :
88 remaining items