Skip to content

Multiple nodes panic in grpc at the same time #58327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skaco opened this issue Dec 29, 2020 · 3 comments
Closed

Multiple nodes panic in grpc at the same time #58327

skaco opened this issue Dec 29, 2020 · 3 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@skaco
Copy link

skaco commented Dec 29, 2020

Describe the problem

We just create a cluster with more than 90 nodes, which doesn't have much read/write operations. Some machines may go down because of network or disk problem occasionally. But multiple nodes will panic at the same time when one machine goes down.
We tried to compile with Go1.11.13 instead of Go1.10.6 (follow the guide of #42913), update gRPC deps(#32961) , but nothing changed.

Expected behavior
Don't panic at the same time.

Additional data / screenshots
Here are several typical logs.

W201229 06:12:49.088191 24427327 vendor/google.golang.org/grpc/server.go:603 grpc: Server.Serve failed to complete security handshake from "10.233.57.49:51465": EOF
W201229 06:12:50.104517 787955 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:50.104666 787955 vendor/google.golang.org/grpc/clientconn.go:1440 grpc: addrConn.transportMonitor exits due to: context canceled
W201229 06:12:54.946972 24428078 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: authentication handshake failed: read tcp 10.171.98.19:45762->10.171.92.18:25055: read: connection reset by peer". Reconnecting...
W201229 06:12:54.947130 24428078 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I201229 06:12:54.947157 24427881 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n59] tripped: failed to connect to n59 at szth-ecom-nova0491.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: read tcp 10.171.98.19:45762->10.171.92.18:25055: read: connection reset by peer"
I201229 06:12:54.947174 24427881 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n59] event: BreakerTripped
I201229 06:12:54.947220 24427622 rpc/nodedialer/nodedialer.go:189 [n18,ts-poll] unable to connect to n59: failed to connect to n59 at szth-ecom-nova0491.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: read tcp 10.171.98.19:45762->10.171.92.18:25055: read: connection reset by peer"
I201229 06:12:54.947251 24428104 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n59] tripped: failed to connect to n59 at szth-ecom-nova0491.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: read tcp 10.171.98.19:45762->10.171.92.18:25055: read: connection reset by peer"
I201229 06:12:54.947271 24428104 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n59] event: BreakerTripped
I201229 06:12:54.947283 24428104 rpc/nodedialer/nodedialer.go:189 [ct-client] unable to connect to n59: failed to connect to n59 at szth-ecom-nova0491.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: read tcp 10.171.98.19:45762->10.171.92.18:25055: read: connection reset by peer"
W201229 06:12:54.947387 24428078 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0491.szth.baidu.com:25055: context canceled; please retry.
W201229 06:12:55.261509 24429160 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.18:25055: connect: connection refused". Reconnecting...
W201229 06:12:56.261042 24429160 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:56.261101 24429160 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0491.szth.baidu.com:25055: context canceled; please retry.
W201229 06:12:56.461476 24429452 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.18:25055: connect: connection refused". Reconnecting...
W201229 06:12:56.739142 883545 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0855.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:56.739187 883545 vendor/google.golang.org/grpc/clientconn.go:1440 grpc: addrConn.transportMonitor exits due to: context canceled
W201229 06:12:56.861655 24429460 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0855.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.158:25055: connect: connection refused". Reconnecting...
I201229 06:12:56.861810 24429437 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n23] tripped: failed to connect to n23 at szth-ecom-nova0855.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.171.92.158:25055: connect: connection refused"
I201229 06:12:56.861829 24429437 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n18] circuitbreaker: rpc 0.0.0.0:25055 [n23] event: BreakerTripped
I201229 06:12:56.861853 24429437 rpc/nodedialer/nodedialer.go:189 [ct-client] unable to connect to n23: failed to connect to n23 at szth-ecom-nova0855.szth.baidu.com:25055: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.171.92.158:25055: connect: connection refused"
W201229 06:12:57.461069 24429452 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:57.461140 24429452 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0491.szth.baidu.com:25055: context canceled; please retry.
W201229 06:12:57.661422 24422601 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.18:25055: connect: connection refused". Reconnecting...
W201229 06:12:57.861236 24429460 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0855.szth.baidu.com:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:57.861335 24429460 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0855.szth.baidu.com:25055: context canceled; please retry.
unexpected fault address 0xa28e0f70
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa28e0f70 pc=0xa28e0f70]
goroutine 256470 [running]:
runtime.throw(0x3f969fa, 0x5)
/home/work/.jumbo/opt/go/src/runtime/panic.go:608 +0x72 fp=0xc00c0c4d40 sp=0xc00c0c4d10 pc=0xd772c2
runtime: unexpected return pc for runtime.sigpanic called from 0xa28e0f70
stack: frame={sp:0xc00c0c4d40, fp:0xc00c0c4d90} stack=[0xc00c0c4000,0xc00c0c5000)
000000c00c0c4c40: 0000000000da20b2 <runtime.writeErr+66> 0000000000000002
000000c00c0c4c50: 000000c00c0c4c88 0000000000d77c8e <runtime.recordForPanic+302>
000000c00c0c4c60: 0000000000da20b2 <runtime.writeErr+66> 0000000000000002
000000c00c0c4c70: 0000000003f9369a 0000000000000001
000000c00c0c4c80: 0000000000000001 000000c00c0c4cc0
000000c00c0c4c90: 0000000000d77eb8 <runtime.gwrite+280> 0000000003f9369a
000000c00c0c4ca0: 0000000000000001 0000000000000001
000000c00c0c4cb0: 000000c00c0c4d26 000000000000000a
000000c00c0c4cc0: 000000c00c0c4d10 0000000000d78648 <runtime.printstring+120>
000000c00c0c4cd0: 0000000000d77487 <runtime.fatalthrow+87> 000000c00c0c4ce0
000000c00c0c4ce0: 0000000000da3e50 <runtime.fatalthrow.func1+0> 000000c007719680
000000c00c0c4cf0: 0000000000d772c2 <runtime.throw+114> 000000c00c0c4d10
000000c00c0c4d00: 000000c00c0c4d30 0000000000d772c2 <runtime.throw+114>
000000c00c0c4d10: 000000c00c0c4d18 0000000000da3dd0 <runtime.throw.func1+0>
000000c00c0c4d20: 0000000003f969fa 0000000000000005
000000c00c0c4d30: 000000c00c0c4d80 0000000000d8cfc5 <runtime.sigpanic+629>
000000c00c0c4d40: <0000000003f969fa 0000000000000005
000000c00c0c4d50: 0000000000000000 0000000004ae3720
000000c00c0c4d60: 00000000a28e0f70 000000c007719680
000000c00c0c4d70: 0000000000000000 0000000000000000
000000c00c0c4d80: 0000000000000202 !00000000a28e0f70
000000c00c0c4d90: >000000000000000a 000000000000000a
000000c00c0c4da0: 0000000000000000 0000000000000391
000000c00c0c4db0: 000000c015242000 000000c00c0c5a60
000000c00c0c4dc0: 0000000000000000 0000000004ae3720
000000c00c0c4dd0: 000000c00010a040 ffffffffffffffff
000000c00c0c4de0: 000000c00c0c5a10 0000000000ddadf0 <syscall.Syscall+48>
000000c00c0c4df0: 0000000000000202 0000000000000033
000000c00c0c4e00: 0000000000000000 0000000000000000
000000c00c0c4e10: 0000000000000000 0000000000000000
000000c00c0c4e20: 000000c00c0c4f00 000000c010e1d980
000000c00c0c4e30: 000000000000003a 000000c00c0c4f10
000000c00c0c4e40: 00000000012536b3 <github.com/cockroachdb/cockroach/vendor/golang.org/x/net/http2.(*Framer).ReadFrame+163> 000000c00d366498
000000c00c0c4e50: 0000000000000009 0000000000000009
000000c00c0c4e60: 0000000004ae3260 0000000000000000
000000c00c0c4e70: 0000000000000000 0000000000000000
000000c00c0c4e80: 000000000000003a 000000000000003a
runtime.sigpanic()
/home/work/.jumbo/opt/go/src/runtime/signal_unix.go:397 +0x275 fp=0xc00c0c4d90 sp=0xc00c0c4d40 pc=0xd8cfc5
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.newHTTP2Client
/home/work/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/http2_client.go:270 +0xad7

following log is from another panic node.

W201229 06:12:51.273800 878879 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.com
:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:51.273827 878879 vendor/google.golang.org/grpc/clientconn.go:1440 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W201229 06:12:54.947167 20504135 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.c
om:25055 0 }. Err :connection error: desc = "transport: authentication handshake failed: read tcp 10.171.46.150:42703->10.171.92.18:25055: read: connection reset
by peer". Reconnecting...
W201229 06:12:54.947338 20504135 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.c
om:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:54.947400 20504135 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0491.szth.baidu.com:25055: context canceled; please retr
y.
W201229 06:12:55.030596 20507419 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.c
om:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.18:25055: connect: connection refused". Reconnecting...
W201229 06:12:56.030213 20507419 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.c
om:25055 0 }. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W201229 06:12:56.030276 20507419 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial szth-ecom-nova0491.szth.baidu.com:25055: context canceled; please retr
y.
W201229 06:12:56.230770 20498246 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {szth-ecom-nova0491.szth.baidu.c
om:25055 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.171.92.18:25055: connect: connection refused". Reconnecting...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x5 pc=0x1252e80]
goroutine 930119 [running]:
runtime: unexpected return pc for github.com/cockroachdb/cockroach/vendor/golang.org/x/net/http2.readFrameHeader called from 0x1
stack: frame={sp:0xc0181ded68, fp:0xc0181dedc8} stack=[0xc0181de000,0xc0181df000)
000000c0181dec68: 000000c0181decc0 0000000004b36f78
000000c0181dec78: 000000c0093ef250 0000001800000018
000000c0181dec88: 00000000012b98b7 <github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams+71> 0000000000000009
000000c0181dec98: 000000c0181def30 000000c0076d7500
000000c0181deca8: 0000000000db1558 <io.ReadAtLeast+136> 000000c0093ef220
000000c0181decb8: 000000c0076d7528 0000000000000000
000000c0181decc8: 0000000003b28b20 000000c0073c2e40
000000c0181decd8: 0000000000000000 0000000000000000
000000c0181dece8: 000000c0181ded08 0000000000d75f4e <runtime.panicmem+94>
000000c0181decf8: 0000000003c1e200 00000000068bbc20
000000c0181ded08: 000000c0181ded58 0000000000d8ced2 <runtime.sigpanic+386>
000000c0181ded18: 000000c016e276e0 000000c00567fed8
000000c0181ded28: 0000000000000009 0000000000000009
000000c0181ded38: 0000000000000009 000000c0076d7500
000000c0181ded48: 0000000000000000 0000000000000000
000000c0181ded58: 000000c0181dedb8 0000000001252e80 <github.com/cockroachdb/cockroach/vendor/golang.org/x/net/http2.readFrameHeader+176>
000000c0181ded68: <0000000004ae3260 000000c016e276e0
000000c0181ded78: 000000c00567fed8 0000000000000009
000000c0181ded88: 0000000000000009 0000000000000009
000000c0181ded98: 0000000000000000 0000000000000000
000000c0181deda8: 0000000000000005 000000c0181dfb08
000000c0181dedb8: 00007f0fb765aad0 !0000000000000001
000000c0181dedc8: >0000000000000000 000000c010c3e000
000000c0181dedd8: 0000000000000000 0000000000008000
000000c0181dede8: 0000000000000000 0000000000000000
000000c0181dedf8: 0000000000000000 0000000000000212
000000c0181dee08: 00000000a726b35e 000000000000000a
000000c0181dee18: 0000000004ac5493 0000000000000000
000000c0181dee28: 0000000000000bfb 000000c00ac58800
000000c0181dee38: 000000c0181dfb08 0000000000000000
000000c0181dee48: 000000000000001f ffffffffffffffe0
000000c0181dee58: ffffffffffffffff 000000c0181dfab8
000000c0181dee68: 0000000000ddadf0 <syscall.Syscall+48> 0000000000000212
000000c0181dee78: 0000000000000033 0000000000000000
000000c0181dee88: 0000000000000000 0000000000000000
000000c0181dee98: 0000000000000000 000000c0181def80
000000c0181deea8: 000000c01c5f3e30 000000c01c5f3e30
000000c0181deeb8: 0000000000000000 0000000000000000
encoding/binary.bigEndian.Uint32(...)
/home/work/.jumbo/opt/go/src/encoding/binary/binary.go:112
github.com/cockroachdb/cockroach/vendor/golang.org/x/net/http2.readFrameHeader(0x0, 0xc010c3e000, 0x0, 0x8000, 0x0, 0x0, 0x0, 0x212, 0xa726b35e)
/home/work/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/net/http2/frame.go:245 +0xb0
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleRawConn
/home/work/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:638 +0x5f8

Environment:

  • CockroachDB version: 2.1.9
  • Server OS: Linux
  • Go version: Go1.10.6/Go1.11.13
@blathers-crl
Copy link

blathers-crl bot commented Dec 29, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Dec 29, 2020
@RaduBerinde
Copy link
Member

CC @tbg

@tbg
Copy link
Member

tbg commented Jan 11, 2021

I'm not sure what this is but it looks a little similar to golang/go#36287 (comment)
I have vague memories of us fixing a bug related to signal handling a long time ago. Either way, we also no longer support CockroachDB v2.1.

@tbg tbg closed this as completed Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants