Skip to content

runtime: memory corruption crashes since Go 1.9 #69855

Closed
@tedli

Description

@tedli

Go version

go version go1.21.11 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.11'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build747085668=/tmp/go-build -gno-record-gcc-switches'

What did you do?

A bare metal kubernetes cluster, runs for over half a year. Joined some new nodes into this cluster.
Found docker daemon on one new joined node hangs.
By restarting the docker daemon, it runs again, but after hours, it will hang again.
Built the same version docker daemon with the same golang toolchain as the official binary release (golang 1.21) with debug turing on, replaced the dockerd with the debug version.
Repeat restarting docker, hangs, restart, many times. Dumping the calling stack when hangs, it varies, and occasionally it panics (runtime, not app).
Trying to use the latest golang release (1.23.1), got no luck, nor did the latest docker release.
With godebug gctrace 1, when it hangs, the gc also stopped logging.

By searching the panic messages and the calling stack, found the some issues, but almost all of those are closed due to age.

#15658
this issue gave a reproduce code, and the code can stably reproduce on this node (golang 1.21, cgo off), and by setting GOMAXPROCS=1, the reproduce code no to crash any more.
Turn to use golang 1.9.3, which commented in this issue that includes a fix, to build the reproduce code, and with 1.9.3 built binary, it no to crash any more (without GOMAXPROCS=1).

#20427

// func nanotime1() int64
TEXT runtime·nanotime1(SB),NOSPLIT,$16-8
// We don't know how much stack space the VDSO code will need,
// so switch to g0.
// In particular, a kernel configured with CONFIG_OPTIMIZE_INLINING=n
// and hardening can use a full page of stack space in gettime_sym
// due to stack probes inserted to avoid stack/heap collisions.
// See issue #20427.

torvalds/linux@889b3c1#diff-c1a25be6ec9efccf08bb1dd54dd545b0ce4a12f6fc1aba602a78bff5a016a8a4L141

linux removed the CONFIG_OPTIMIZE_INLINING option since 5.4. I tried to follow this manual to rebuild the kernel by hardcoding the inline marco to always_inline (CONFIG_OPTIMIZE_INLINING=no), replace the always_inline kernel, go no luck. But reproduce code seemed live longer, without always inline, reproduce code crash within 10 seconds, it can live up to one minute with always inline. still the 1.9.3 built one never crash.

The poor wretch node is in same specs with others, and was setup using the same ansible script. A full memtest86+ is done shown all passed.
Other nodes works as expect, without any touch on any binary.

One thing that, these nodes is in an awful data center and lack of maintenance, thermal issue, dusts made troubles before on other nodes. But it's not seemed like a hardware issue, since it's only breaks golang programs, I can still ssh to login to do operations, the rest of all system components also works as expect.

What did you see happen?

Reproduce code in #15658, can stably reproduce on my machine.

What did you expect to see?

The reproduce code should not reproduce any more, as it fixed since 1.9.3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions