rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

bdarnell · 2019-12-03T16:36:18Z

(A customer encountered this issue in the wild; I'm filing this after the fact for future reference and evidence of why we should be careful about building old releases with new go versions)

The customer reported that after one node in their cluster crashed with a clock offset error (due to a VM migration), the restarted node was unable to successfully rejoin the cluster unless all the other nodes were restarted too.

The customer found #27731 which appeared to have similar symptoms, but we were able to rule that out because A) that issue was fixed in 2.0.6 (the cluster was running 2.0.7) and B) logs indicated that gossip was working.

Due to the clock offset issue, the node crashed in a loop until clock sync was restored. For the duration of this crash loop, the other nodes showed many "connection refused" errors, as expected. One of them, however, showed a single "read: connection reset by peer" error, after which the "connection refused" errors stopped:

2019-12-02 20:29:11.749467 +0000 UTC W191202 20:29:11.749184 1548 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.16.12.68:26257: connect: connection refused". Reconnecting...
2019-12-02 20:29:12.748829 +0000 UTC W191202 20:29:12.748536 1548 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
2019-12-02 20:29:12.74902 +0000 UTC W191202 20:29:12.748572 1548 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.16.12.68:26257: context canceled; please retry.
2019-12-02 20:29:13.162202 +0000 UTC W191202 20:29:13.161927 1587 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.16.12.68:26257: connection error: desc = "transport: authentication handshake failed: read tcp 172.16.19.72:59066->172.16.12.68:26257: read: connection reset by peer"; please retry.

This reminded me of grpc/grpc-go#1026, in which an uncommon error during the authentication handshake was incorrectly considered "permanent" and led grpc to stop retrying (although the older issue caused it to spam the logs with the same error instead of silently timing out as we were seeing now).

ECONNRESET was considered a "temporary" error in go 1.10, but not in go 1.11 (without release notes!): golang/go#24808. Our official 2.0.x binaries use go 1.10, but this customer built their own binaries from source with go 1.11.

Cockroach 2.0 used GRPC 1.9, which had subtle logic to distinguish "temporary" and "permanent" errors. Around the time of go 1.11's release, grpc 1.11 was also released with changes to get rid of this error classification and just retry everything (this was also not mentioned in release notes, but the change was grpc/grpc-go#1856). This change was necessary for compatibility with go 1.11's temporary error change.

Conclusion: CockroachDB 2.0 should only be built with go 1.10. We should probably be more prescriptive about using only qualified go versions for other releases.

The text was updated successfully, but these errors were encountered:

bdarnell closed this as completed Dec 3, 2019

bdarnell mentioned this issue Dec 3, 2019

Be more prescriptive about Go versions in build-from-source cockroachdb/docs#6097

Closed

skaco mentioned this issue Dec 29, 2020

Multiple nodes panic in grpc at the same time #58327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

bdarnell commented Dec 3, 2019

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

Comments

bdarnell commented Dec 3, 2019