You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(A customer encountered this issue in the wild; I'm filing this after the fact for future reference and evidence of why we should be careful about building old releases with new go versions)
The customer reported that after one node in their cluster crashed with a clock offset error (due to a VM migration), the restarted node was unable to successfully rejoin the cluster unless all the other nodes were restarted too.
The customer found #27731 which appeared to have similar symptoms, but we were able to rule that out because A) that issue was fixed in 2.0.6 (the cluster was running 2.0.7) and B) logs indicated that gossip was working.
Due to the clock offset issue, the node crashed in a loop until clock sync was restored. For the duration of this crash loop, the other nodes showed many "connection refused" errors, as expected. One of them, however, showed a single "read: connection reset by peer" error, after which the "connection refused" errors stopped:
2019-12-02 20:29:11.749467 +0000 UTC W191202 20:29:11.749184 1548 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.16.12.68:26257: connect: connection refused". Reconnecting...
2019-12-02 20:29:12.748829 +0000 UTC W191202 20:29:12.748536 1548 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {172.16.12.68:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
2019-12-02 20:29:12.74902 +0000 UTC W191202 20:29:12.748572 1548 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial 172.16.12.68:26257: context canceled; please retry.
2019-12-02 20:29:13.162202 +0000 UTC W191202 20:29:13.161927 1587 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial 172.16.12.68:26257: connection error: desc = "transport: authentication handshake failed: read tcp 172.16.19.72:59066->172.16.12.68:26257: read: connection reset by peer"; please retry.
This reminded me of grpc/grpc-go#1026, in which an uncommon error during the authentication handshake was incorrectly considered "permanent" and led grpc to stop retrying (although the older issue caused it to spam the logs with the same error instead of silently timing out as we were seeing now).
ECONNRESET was considered a "temporary" error in go 1.10, but not in go 1.11 (without release notes!): golang/go#24808. Our official 2.0.x binaries use go 1.10, but this customer built their own binaries from source with go 1.11.
Cockroach 2.0 used GRPC 1.9, which had subtle logic to distinguish "temporary" and "permanent" errors. Around the time of go 1.11's release, grpc 1.11 was also released with changes to get rid of this error classification and just retry everything (this was also not mentioned in release notes, but the change was grpc/grpc-go#1856). This change was necessary for compatibility with go 1.11's temporary error change.
Conclusion: CockroachDB 2.0 should only be built with go 1.10. We should probably be more prescriptive about using only qualified go versions for other releases.
The text was updated successfully, but these errors were encountered:
(A customer encountered this issue in the wild; I'm filing this after the fact for future reference and evidence of why we should be careful about building old releases with new go versions)
The customer reported that after one node in their cluster crashed with a clock offset error (due to a VM migration), the restarted node was unable to successfully rejoin the cluster unless all the other nodes were restarted too.
The customer found #27731 which appeared to have similar symptoms, but we were able to rule that out because A) that issue was fixed in 2.0.6 (the cluster was running 2.0.7) and B) logs indicated that gossip was working.
Due to the clock offset issue, the node crashed in a loop until clock sync was restored. For the duration of this crash loop, the other nodes showed many "connection refused" errors, as expected. One of them, however, showed a single "read: connection reset by peer" error, after which the "connection refused" errors stopped:
This reminded me of grpc/grpc-go#1026, in which an uncommon error during the authentication handshake was incorrectly considered "permanent" and led grpc to stop retrying (although the older issue caused it to spam the logs with the same error instead of silently timing out as we were seeing now).
ECONNRESET was considered a "temporary" error in go 1.10, but not in go 1.11 (without release notes!): golang/go#24808. Our official 2.0.x binaries use go 1.10, but this customer built their own binaries from source with go 1.11.
Cockroach 2.0 used GRPC 1.9, which had subtle logic to distinguish "temporary" and "permanent" errors. Around the time of go 1.11's release, grpc 1.11 was also released with changes to get rid of this error classification and just retry everything (this was also not mentioned in release notes, but the change was grpc/grpc-go#1856). This change was necessary for compatibility with go 1.11's temporary error change.
Conclusion: CockroachDB 2.0 should only be built with go 1.10. We should probably be more prescriptive about using only qualified go versions for other releases.
The text was updated successfully, but these errors were encountered: