Skip to content

runtime: excessive memory use between 1.21.0 -> 1.21.1 due to hugepages and the linux/amd64 max_ptes_none default of 512 #64332

Closed
@LeGEC

Description

@LeGEC

What version of Go are you using (go version)?

$ go version
go version go1.21.4 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/builder/.cache/go-build'
GOENV='/home/builder/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/builder/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/builder/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='0'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3335037910=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Our production service was recently shut down by the OOM, which led us to inspect its memory usage in details.

We discovered that, when running, our process had a memory consumption (RSS) that grew over time and never shrank.

sample runtime.MemStats report :

	"Alloc": 830731560,
	"TotalAlloc": 341870177656,
	"Sys": 8023999048,
	"Lookups": 0,
	"Mallocs": 3129044622,
	"Frees": 3124956536,
	"HeapAlloc": 830731560,
	"HeapSys": 7836532736,
	"HeapIdle": 6916292608,
	"HeapInuse": 920240128,
	"HeapReleased": 6703923200,
	"HeapObjects": 4088086,
	"StackInuse": 15204352,
	"StackSys": 15204352,
	"MSpanInuse": 8563968,
	"MSpanSys": 17338944,
	"MCacheInuse": 4800,
	"MCacheSys": 15600,
	"BuckHashSys": 5794138,
	"GCSys": 146092920,
	"OtherSys": 3020358,
	"NextGC": 1046754240,
	"LastGC": 1700579048506142728,
	"PauseTotalNs": 108783964,

at that time, the reported RSS for our process was 3,19Gb.

We looked at our history in more details, and observed that we had a big change in production behavior when we upgraded our go version from 1.19.5 to 1.20.0 - we unfortunately didn't notice the issue at that time, because we upgrade (and restart) our service on a regular basis.

To confirm this theory, we have downgraded our go version back to 1.19.13, and our memory consumption is now small and stable again.

Here is a graph of the RSS of our service over the last 48h, the drop corresponds to our new deployment with go 1.19.13 :

image

It should be noted that our production kernel is a hardened kernel based on grsecurity 5.15.28, which may be related to this issue (randomized heap addresses ?)

What did you expect to see?

A constant and stable memory usage.

What did you see instead?

The go runtime seems to not release memory back to the system.


Unfortunately, we have only been able to observe this issue on our production system, in production conditions.

We were not yet able to reproduce the issue on other systems, or by running isolated features in test programs deployed on our production infrastructure.

Activity

seankhliao

seankhliao commented on Nov 22, 2023

@seankhliao
Member

perhaps try looking at profiling

added
WaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.
on Nov 22, 2023
LeGEC

LeGEC commented on Nov 22, 2023

@LeGEC
Author

@seankhliao : we did, the OOM killer incident was a combination of high memory usage and a bug on our side, which we fixed.

After that fix, memory profiles would report only small amounts for 'inuse_size', in the 300-400 Mb range when our process' RSS would be several GB, and more importantly: the RSS of our process would grow over time, but the profiles would report no extra used memory.

We could definitely optimize allocations at large, but the fact is: compiling the exact same code in 1.19 produces a binary which doesn't leak. [edit: we will watch for the "leak" part over the next few days, but the system memory footprint is definitely lower, and shrinks on occasions, which didn't happen with go 1.21]

LeGEC

LeGEC commented on Nov 22, 2023

@LeGEC
Author

@seankhliao : I see the WaitingForInfo tag, do you have a more specific request in mind ?

vanackere

vanackere commented on Nov 22, 2023

@vanackere
Contributor

@seankhliao (sidenote: I work with @LeGEC): we did extensive memory profiling of our processes during several weeks (and can provide full data privately if necessary) but from the point of view of the memory profiler nothing was leaking and nothing would explain the constant grow in memory usage, mostly unreleased to the OS.
This is definitely a runtime issue as running the exact same code with go 1.19.13 leads to a drastically better behavior... (see graph in first post)
We are also willing to test any intermediate compiler version or run any profiling command that would be useful to track down this issue.

vanackere

vanackere commented on Nov 22, 2023

@vanackere
Contributor

And for completeness: we also tried various combination of debug.FreeOSMemory() / GOMEMLIMIT=1500MiB with
no success at all.
Downgrading our production service to a now unsupported version of the Go compiler was only done
as a last resort attempt since we have not managed to reproduce the issue outside of production. We'd like to make sure
that this issue at least gets fixed for go 1.22 and will provide all data necessary.

Unless someone from the go compiler / runtime team has a better suggestion : we are planning to make a test run with go 1.20.0 tomorrow and update this issue with the corresponding numbers.

mauri870

mauri870 commented on Nov 22, 2023

@mauri870
Member

I remember some things changed regarding huge pages on linux and were backported to 1.21, that may be related.

#61718

Worth it to read the wiki for THP and GOGC https://go.dev/doc/gc-guide#Linux_transparent_huge_pages

mauri870

mauri870 commented on Nov 22, 2023

@mauri870
Member
vanackere

vanackere commented on Nov 22, 2023

@vanackere
Contributor

@mauri870 on our production system:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

If huge page are indeed involved: is there a way to disable transparent_hugepage for a single process / without root access ? (Alternatively, is there a way to patch the go compiler to prevent transparent_hugepage usage ?)

changed the title [-]go 1.20.0 -> 1.21.4: excessive memory consumption (possible runtime memory leak ?)[/-] [+]runtime: excessive memory use between 1.19 -> 1.21[/+] on Nov 22, 2023
LeGEC

LeGEC commented on Nov 22, 2023

@LeGEC
Author

Memstats of our go1.19 program :

	"Alloc": 710580968,
	"TotalAlloc": 252397361888,
	"Sys": 1459456584,
	"Lookups": 0,
	"Mallocs": 2418596147,
	"Frees": 2414925477,
	"HeapAlloc": 710580968,
	"HeapSys": 1360199680,
	"HeapIdle": 517931008,
	"HeapInuse": 842268672,
	"HeapReleased": 372064256,
	"HeapObjects": 3670670,
	"StackInuse": 15532032,
	"StackSys": 15532032,
	"MSpanInuse": 7786368,
	"MSpanSys": 11976192,
	"MCacheInuse": 4800,
	"MCacheSys": 15600,
	"BuckHashSys": 4959765,
	"GCSys": 64211392,
	"OtherSys": 2561923,
	"NextGC": 992932464,
	"LastGC": 1700665950049685014,
	"PauseTotalNs": 101655886,

RSS of process: 1158688 Kb (so ~ 1,16 Gb)

mknyszek

mknyszek commented on Nov 22, 2023

@mknyszek
Contributor

The huge page related changes can't be the culprit, because huge pages were only forced by the runtime between Go 1.21.0 and Go 1.21.3 (inclusive). Go 1.21.4 no longer adjusts huge page settings at all. I don't see any of those versions in the conversation above.

mknyszek

mknyszek commented on Nov 22, 2023

@mknyszek
Contributor

How are you measuring RSS?

Also, as a side note, "Sys" is virtual memory footprint. It's not going to match up with RSS. "Sys - HeapReleased" will be much closer.

Hmm... is it possible this is related to a regression in a specific standard library package for instance? IIRC we didn't see anything like this back when Go 1.20 was released, so it's a bit surprising. It's still possible it's related to the compiler/runtime, so I don't mean to rule it out.

Is it possible to reproduce this on a smaller scale? The difference seems fairly big and "obvious" so if you can bisect down to the commit between Go 1.19 and Go 1.20 that causes it, that would immediately tell us what happened.

vanackere

vanackere commented on Nov 22, 2023

@vanackere
Contributor

@mknyszek RSS was mesured by simple output of ps aux. The value mentionned in the inital post may have been slightly wrong indeed, but not by much : the RSS of our process right before restarting the service with our version compiled with go 1.19 was 4397688 (with a VSZ of 9107484)

The same production process with go 1.19.13 gives currently the following values, much more coherent with the actual allocations reported by profiling :

image

88 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    WaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @LeGEC@vanackere@mknyszek@gopherbot@mauri870

        Issue actions

          runtime: excessive memory use between 1.21.0 -> 1.21.1 due to hugepages and the linux/amd64 `max_ptes_none` default of 512 · Issue #64332 · golang/go