This repository was archived by the owner on Feb 18, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 938
Analysis and action for UnreachableMasterWithLaggingReplicas #572
Merged
shlomi-noach
merged 23 commits into
master
from
unreachable-master-with-lagging-replicas
Aug 23, 2018
Merged
Analysis and action for UnreachableMasterWithLaggingReplicas #572
shlomi-noach
merged 23 commits into
master
from
unreachable-master-with-lagging-replicas
Aug 23, 2018
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…artSlave on master's direct replicas
TODO: subtlety, if we restart replication on all replicas, that in itself would cause a situation where all replicas are not replicating, at the same time. This would trigger a |
ggunson
reviewed
Aug 9, 2018
return found | ||
} | ||
|
||
// emergentlyRestartReplicationOnTopologyInstanceReplicas forces a stop slave + start slave on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to just do a stop/start of the IO thread, rather than both threads? On the off chance that the replica is behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea.
…:github/orchestrator into unreachable-master-with-lagging-replicas
…:github/orchestrator into unreachable-master-with-lagging-replicas
…leSlaves, AllMasterSlavesStale
…base_instance_recent_relaylog_history
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A new failure analysis:
UnreachableMasterWithLaggingReplicas
identifies the case of a master unreachable toorchestrator
, where all of its replicas are seemingly OK, but all lagging.Failure scenario
This is a known scenario in production. A well-known particular cause for this is the problem of
Too Many Connections
on a master. Say the master is overloaded, connections coming in, finally the master refuses to accept new connections.orchestrator
would suddenly be unable to reach the master. But long-time running replicas may enjoy the fact they're using good-old connections. They may actually still be able to replicate.The master may eventually refuse any/all writes. The replicas would still think everything's fine but they're not getting anything through replication stream.
If using a
pt-heartbeat
or similar, replication lag will be seen to increase even asSeconds_behind_master
may still indicate0
.Some notes:
slave_net_timeout
is configured and replicas are using heartbeats.pt-online-schema-change
(before moving to triggerlessgh-ost
).Analysis
To avoid false positives, the analysis checks:
orchestrator
ReplicationLagQuery
configuration, i.e. utilize a heartbeat mechanism such aspt-heartbeat
, and not trustSeconds_Behind_Master
to do the right thing (it doesn't).SQL_Delay
then they are in fact expected to lag).Action
There are two potential courses of action and we picked one over the other. One course of action would be to immediately initiate a failover. However, we chose another course of action. The reason is that this analysis is a bit on the gray zone. There could be a failure of
pt-heartbeat
on the master together with a very brief network isolation of the master fromorchestrator
. It's slim, but because this type of analysis is new, we choose to tread carefully and avoid false positive failovers.We choose a different action: Issue a
STOP SLAVE; START SLAVE
on all master's direct replicas, credit @tomkrouper.This would kick the connections on replicas and hopefully the re-authentication and re-connection process would make the replica realize the master is broken, same as
orchestrator
had, or any app connection had.That would shortly lead to all replicas being broken, which would lead to a
DeadMaster
analysis, and a failover action.Noteworthy that this analysis is re-generated every second or so, and that the action taken (restart replication on replicas) is not affected by
RecoveryPeriodBlockSeconds
. There is an internal throttling mechanism to avoid flooding the replicas withstop slave; start slave
operation.cc @github/database-infrastructure