Skip to content

fatal error: bulkBarrierPreWrite: unaligned arguments (go 1.19.1, looks like regression) #55122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akovalenko opened this issue Sep 17, 2022 · 9 comments

Comments

@akovalenko
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/amd64

Does this issue reproduce with the latest release?

It appears with the latest release (go 1.19.1 at the time of writing), while everything is fine when compiling with go 1.18.6.

The problem is, I can't reproduce it reliably, as I have no testing environment with the same amount of real-life load.
However, while everything is the same, on 1.19.1 I get (1) first crash after 5 hours, (2) second crash after 6 hours, and on 1.18.6 I have 35+ hours without a panic.

More info might eventually appear at lightningnetwork/lnd#6923 if LND developers engage in testing on 1.19

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/akovalenko/.cache/go-build"
GOENV="/home/akovalenko/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/akovalenko/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/akovalenko/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1107032562=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I've built the LND project (see lightningnetwork/lnd#6923 for details) and used it under a significant load for several hours.

What did you expect to see?

No panic

What did you see instead?

Two panics in a row, after several hours each, triggered by the same line of seemingly innocent assignment

Sep 15 17:50:53 murphy lnd[2756008]: fatal error: bulkBarrierPreWrite: unaligned arguments
Sep 15 17:50:53 murphy lnd[2756008]: goroutine 10179 [running]:
Sep 15 17:50:53 murphy lnd[2756008]: runtime.throw({0x1c20812?, 0x9ba5f0?})
Sep 15 17:50:53 murphy lnd[2756008]:         runtime/panic.go:1047 +0x5d fp=0xc0024c2930 sp=0xc0024c2900 pc=0x43807d
Sep 15 17:50:53 murphy lnd[2756008]: runtime.bulkBarrierPreWrite(0x0?, 0x40db47?, 0x20?)
Sep 15 17:50:53 murphy lnd[2756008]:         runtime/mbitmap.go:582 +0x428 fp=0xc0024c29a0 sp=0xc0024c2930 pc=0x415068
Sep 15 17:50:53 murphy lnd[2756008]: runtime.typedmemmove(0x1867fc0, 0xc0083db3f1, 0xc0024c2a84)
Sep 15 17:50:53 murphy lnd[2756008]:         runtime/mbarrier.go:162 +0x45 fp=0xc0024c29d8 sp=0xc0024c29a0 pc=0x414245
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/lnwire.ReadElement({0x1f7b820, 0xc01151c7b0}, {0x18b19a0?, 0xc0083db3f1?})
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/lnwire/lnwire.go:477 +0x173c fp=0xc0024c2cc8 sp=0xc0024c29d8 pc=0x9b8b5c
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/lnwire.ReadElements(...)
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/lnwire/lnwire.go:879
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/lnwire.(*NodeAnnouncement).Decode(0x1f7b820?, {0x1f7b820, 0xc01151c7b0}, 0x2?)
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/lnwire/node_announcement.go:112 +0x150 fp=0xc0024c2d88 sp=0xc0024c2cc8 pc=0x9ba8b0
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/lnwire.ReadMessage({0x1f7b820, 0xc01151c7b0}, 0x1c8?)
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/lnwire/message.go:314 +0x91 fp=0xc0024c2de8 sp=0xc0024c2d88 pc=0x9ba1b1
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/peer.(*Brontide).readNextMessage.func1(0xc003d3c000)
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/peer/brontide.go:1114 +0x146 fp=0xc0024c2e60 sp=0xc0024c2de8 pc=0x164e026
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/pool.(*Read).Submit.func1({0x1f87e28?, 0xc002680ca0?})
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/pool/read.go:50 +0x37 fp=0xc0024c2e88 sp=0xc0024c2e60 pc=0x1648557
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/pool.(*Worker).spawnWorker(0xc0088eae60, 0xc002680c90)
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/pool/worker.go:232 +0x316 fp=0xc0024c2fc0 sp=0xc0024c2e88 pc=0x1649556
Sep 15 17:50:53 murphy lnd[2756008]: github.com/lightningnetwork/lnd/pool.(*Worker).requestHandler.func2()
Sep 15 17:50:53 murphy lnd[2756008]:         github.com/lightningnetwork/lnd/pool/worker.go:161 +0x2a fp=0xc0024c2fe0 sp=0xc0024c2fc0 pc=0x16491aa
Sep 15 17:50:53 murphy lnd[2756008]: runtime.goexit()
Sep 15 17:50:53 murphy lnd[2756008]:         runtime/asm_amd64.s:1594 +0x1 fp=0xc0024c2fe8 sp=0xc0024c2fe0 pc=0x46c241

@randall77
Copy link
Contributor

This is strange. Copying a NodeAlias, which is just a named [32]byte, shouldn't be calling typedmemmove, it should compile to a pair of 16-byte moves. Even if it called typedmemmove, that function should not call bulkBarrierPreWrite for a pointerless data type like NodeAlias is.
Are you using any special command-line flags when building?

@akovalenko
Copy link
Author

akovalenko commented Sep 17, 2022

@randall77 surely no special command-line flags from me, seems also nothing special from the build system (just trimpath and other harmless stuff).

While comparing -gcflags -S output, I see:

go 1.18.6:

        0x0b19 02841 (lnwire/lnwire.go:477)     MOVUPS  "".alias+172(SP), X0
        0x0b21 02849 (lnwire/lnwire.go:477)     MOVQ    ""..autotmp_1115+384(SP), CX
        0x0b29 02857 (lnwire/lnwire.go:477)     MOVUPS  X0, (CX)
        0x0b2c 02860 (lnwire/lnwire.go:477)     MOVUPS  "".alias+188(SP), X0
        0x0b34 02868 (lnwire/lnwire.go:477)     MOVUPS  X0, 16(CX)

go1.19.1:

        0x1701 05889 (lnwire/lnwire.go:477)     TESTB   AL, (BX)
        0x1703 05891 (lnwire/lnwire.go:477)     PCDATA  $0, $-2
        0x1703 05891 (lnwire/lnwire.go:477)     CMPL    runtime.writeBarrier(SB), $0
        0x170a 05898 (lnwire/lnwire.go:477)     JNE     5928
        0x170c 05900 (lnwire/lnwire.go:477)     MOVUPS  github.com/lightningnetwork/lnd/lnwire.alias+172(SP), X0
        0x1714 05908 (lnwire/lnwire.go:477)     MOVUPS  X0, (BX)
        0x1717 05911 (lnwire/lnwire.go:477)     MOVUPS  github.com/lightningnetwork/lnd/lnwire.alias+188(SP), X0
        0x171f 05919 (lnwire/lnwire.go:477)     MOVUPS  X0, 16(BX)
        0x1723 05923 (lnwire/lnwire.go:477)     JMP     6062
        0x1728 05928 (lnwire/lnwire.go:477)     LEAQ    type.github.com/lightningnetwork/lnd/lnwire.OpaqueReason(SB), AX
        0x172f 05935 (lnwire/lnwire.go:477)     LEAQ    github.com/lightningnetwork/lnd/lnwire.alias+172(SP), CX
        0x1737 05943 (lnwire/lnwire.go:477)     CALL    runtime.typedmemmove(SB)
        0x173c 05948 (lnwire/lnwire.go:477)     JMP     6062

(it's my first time with go's assembly output, sorry, I probably should have made it more informative but I don't yet know how).

These two pieces are directly from "go build ." under lnwire/, so nothing changes but the Go version, and no flags from makefiles and build scripts are used. Looks like much more interesting things happen in 1.19.1 for the same assignment, and somehow there is a branch with typedmemmove.

@randall77
Copy link
Contributor

That does look like a compiler bug. It seems it is trying to assign to a NodeAlias, which is a [32]byte underneath, from an OpaqueReason, which is a []byte underneath. But only on the write barrier path, maybe?
My immediate suspicion is something going wrong with the type switch. I will continue investigating.

@randall77
Copy link
Contributor

Reproducer:

package main

func main() {
	for i := 0; i < 10000; i++ {
		h(i)
		sink = make([]byte, 1024) // generate some garbage
	}
}

func h(iter int) {
	var x [32]byte
	for i := 0; i < 32; i++ {
		x[i] = 99
	}
	g(&x)
	if x == ([32]byte{}) {
		return
	}
	for i := 0; i < 32; i++ {
		println(x[i])
	}
	panic(iter)
}

//go:noinline
func g(x interface{}) {
	switch e := x.(type) {
	case *[32]byte:
		var c [32]byte
		*e = c
	case *[]byte:
		*e = nil
	}
}

var sink []byte

Fails on 1.19.1 and tip, doesn't fail for 1.19 and earlier.

@gopherbot please open a backport issue for 1.19.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #55124 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@randall77
Copy link
Contributor

Probably caused by https://go-review.googlesource.com/c/go/+/425076 . Or more correctly, that CL tickled an existing bug.

The problem is that the memmove -> OpMove rewrite uses the type of the first argument to decide what type it is that we're moving. That type is somewhat fragile, and gets broken during type switches on argument values.

We have 2 different IData ops for the same argument, each with different types (*[]byte and *[32]byte). After expand_calls, both of these IData ops are just copies of the input register, and that happens to get the type *[]byte, so the memmove thinks it is operating on a slice, not a [32]byte.

SSA passes shouldn't really depend on the base types of pointer types to make decisions like this. At least, as long as we make pointer casts just OpCopy, and rewrite always does (OpCopy x) -> x. I think the memmove call needs to carry along the type explicitly somewhere.

The CL referenced above was backported to both 1.19 and 1.18. Not sure why 1.18 isn't failing, but it may need the backport as well.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/431496 mentions this issue: cmd/compile: avoid using destination pointer base type in memmove optimization

@randall77
Copy link
Contributor

This bug does happen on 1.18.6, I have a test case in the CL above.

@gopherbot please open a backport issue for 1.18.

@cuonglm
Copy link
Member

cuonglm commented Sep 18, 2022

This bug does happen on 1.18.6, I have a test case in the CL above.

@gopherbot please open a backport issue for 1.18.

You have to create the backport manually, gopherbot only listen to the first backport comment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants