Skip to content

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bdarnell opened this issue Dec 3, 2019 · 0 comments
Closed

rpc: v2.0.7 binary built with go 1.11 gets stuck after node crash #42913

bdarnell opened this issue Dec 3, 2019 · 0 comments

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Dec 3, 2019

(A customer encountered this issue in the wild; I'm filing this after the fact for future reference and evidence of why we should be careful about building old releases with new go versions)

The customer reported that after one node in their cluster crashed with a clock offset error (due to a VM migration), the restarted node was unable to successfully rejoin the cluster unless all the other nodes were restarted too.

The customer found #27731 which appeared to have similar symptoms, but we were able to rule that out because A) that issue was fixed in 2.0.6 (the cluster was running 2.0.7) and B) logs indicated that gossip was working.

Due to the clock offset issue, the node crashed in a loop until clock sync was restored. For the duration of this crash loop, the other nodes showed many "connection refused" errors, as expected. One of them, however, showed a single "read: connection reset by peer" error, after which the "connection refused" errors stopped:

2019-12-02 20:29:11.749467 +0000 UTC W191202 20:29:11.749184 1548 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.16.12.68:26257: connect: connection refused". Reconnecting...
2019-12-02 20:29:12.748829 +0000 UTC W191202 20:29:12.748536 1548 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
2019-12-02 20:29:12.74902 +0000 UTC W191202 20:29:12.748572 1548 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.16.12.68:26257: context canceled; please retry.
2019-12-02 20:29:13.162202 +0000 UTC W191202 20:29:13.161927 1587 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.16.12.68:26257: connection error: desc = "transport: authentication handshake failed: read tcp 172.16.19.72:59066->172.16.12.68:26257: read: connection reset by peer"; please retry.

This reminded me of grpc/grpc-go#1026, in which an uncommon error during the authentication handshake was incorrectly considered "permanent" and led grpc to stop retrying (although the older issue caused it to spam the logs with the same error instead of silently timing out as we were seeing now).

ECONNRESET was considered a "temporary" error in go 1.10, but not in go 1.11 (without release notes!): golang/go#24808. Our official 2.0.x binaries use go 1.10, but this customer built their own binaries from source with go 1.11.

Cockroach 2.0 used GRPC 1.9, which had subtle logic to distinguish "temporary" and "permanent" errors. Around the time of go 1.11's release, grpc 1.11 was also released with changes to get rid of this error classification and just retry everything (this was also not mentioned in release notes, but the change was grpc/grpc-go#1856). This change was necessary for compatibility with go 1.11's temporary error change.

Conclusion: CockroachDB 2.0 should only be built with go 1.10. We should probably be more prescriptive about using only qualified go versions for other releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant