Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
babf5da
terminology
Aug 6, 2018
e9b3272
CountDelayedReplicas
Aug 7, 2018
b354c5b
CountLaggingReplicas
Aug 7, 2018
0909784
new analysis: UnreachableMasterWithLaggingReplicas
Aug 8, 2018
6f0c630
tests for UnreachableMasterWithLaggingReplicas
Aug 8, 2018
55b5f87
Taking action on UnreachableMasterWithLaggingReplicas: forcing a Rest…
Aug 8, 2018
33305be
emergency operation graceful period
Aug 9, 2018
7bed7e3
Restarting just the IO thread
Aug 9, 2018
f874f8a
whoops, wrong implementation of RestartIOThread
Aug 9, 2018
3f8e5d1
reverting earlier change
Aug 9, 2018
53a332c
Merge branch 'master' into unreachable-master-with-lagging-replicas
Aug 12, 2018
e7381c6
using config.Config.ReasonableReplicationLagSeconds
Aug 14, 2018
787be50
Merge branch 'unreachable-master-with-lagging-replicas' of github.com…
Aug 14, 2018
b12207b
temporary debug message for visibility
Aug 19, 2018
250f019
Merge branch 'master' into unreachable-master-with-lagging-replicas
Aug 19, 2018
4e1ef35
debug messages more informative
Aug 19, 2018
c819f50
Merge branch 'unreachable-master-with-lagging-replicas' of github.com…
Aug 19, 2018
3bb1fa2
Documentation for UnreachableMasterWithLaggingReplicas
Aug 20, 2018
ebaf07a
Removed legacy (and long since non-existing) UnreachableMasterWithSta…
Aug 20, 2018
3825de3
removed use of legacy database_instance_binlog_files_history and data…
Aug 20, 2018
cb39182
too much sleep/retry time for orchestrator-client, reduced a bit
Aug 20, 2018
048146a
Merge branch 'master' into unreachable-master-with-lagging-replicas
Aug 23, 2018
5042119
analysis message cached
Aug 23, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/failure-detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Observe the following list of potential failures:
* DeadMasterAndSlaves
* DeadMasterAndSomeSlaves
* DeadMasterWithoutSlaves
* UnreachableMasterWithLaggingReplicas
* UnreachableMaster
* AllMasterSlavesNotReplicating
* AllMasterSlavesNotReplicatingOrDead
Expand Down Expand Up @@ -84,6 +85,16 @@ their time to figure out they were failing replication.

This makes for a potential recovery process.

#### `UnreachableMasterWithLaggingReplicas`:

1. Master cannot be reached
2. All of its immediate replicas (excluding SQL delayed) are lagging

This scenario can happen when the master is overloaded. Clients would see a "Too many connections", while the replicas, which have been connected since long ago, claim the master is fine. Similarly, if the master is locked due to some metadata operation, clients would be blocked on connection while replicas _may claim_ everything's fine. However, since apps cannot connect to the master, no actual data gets written, and when using a heartbeat mechanism such as `pt-heartbeat`, we can observe a growing lag on replicas.

`orchestrator` responds to this scenario by restarting replication on all of master's immediate replicas. This will close the old client connections on those replicas and attempt to initiate new ones. These may now fail to connect, leading to a complete replication failure on all replicas. This will next lead `orchestrator` to analyze a `DeadMaster`.


### Failures of no interest

The following scenarios are of no interest to `orchestrator`, and while the information and state are available to `orchestrator`, it does not recognize such scenarios as _failures_ per se; there's no detection hooks invoked and obviously no recoveries attempted:
Expand Down
2 changes: 0 additions & 2 deletions go/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,6 @@ type Configuration struct {
PostMasterFailoverProcesses []string // Processes to execute after doing a master failover (order of execution undefined). Uses same placeholders as PostFailoverProcesses
PostIntermediateMasterFailoverProcesses []string // Processes to execute after doing a master failover (order of execution undefined). Uses same placeholders as PostFailoverProcesses
PostGracefulTakeoverProcesses []string // Processes to execute after runnign a graceful master takeover. Uses same placeholders as PostFailoverProcesses
UnreachableMasterWithStaleSlavesProcesses []string // Processes to execute when detecting an UnreachableMasterWithStaleSlaves scenario.
CoMasterRecoveryMustPromoteOtherCoMaster bool // When 'false', anything can get promoted (and candidates are prefered over others). When 'true', orchestrator will promote the other co-master or else fail
DetachLostSlavesAfterMasterFailover bool // synonym to DetachLostReplicasAfterMasterFailover
DetachLostReplicasAfterMasterFailover bool // Should replicas that are not to be lost in master recovery (i.e. were more up-to-date than promoted replica) be forcibly detached
Expand Down Expand Up @@ -391,7 +390,6 @@ func newConfiguration() *Configuration {
PostFailoverProcesses: []string{},
PostUnsuccessfulFailoverProcesses: []string{},
PostGracefulTakeoverProcesses: []string{},
UnreachableMasterWithStaleSlavesProcesses: []string{},
CoMasterRecoveryMustPromoteOtherCoMaster: true,
DetachLostSlavesAfterMasterFailover: true,
ApplyMySQLPromotionAfterMasterFailover: true,
Expand Down
6 changes: 3 additions & 3 deletions go/inst/analysis.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,12 @@ const (
DeadMaster = "DeadMaster"
DeadMasterAndSlaves = "DeadMasterAndSlaves"
DeadMasterAndSomeSlaves = "DeadMasterAndSomeSlaves"
UnreachableMasterWithStaleSlaves = "UnreachableMasterWithStaleSlaves"
UnreachableMasterWithLaggingReplicas = "UnreachableMasterWithLaggingReplicas"
UnreachableMaster = "UnreachableMaster"
MasterSingleSlaveNotReplicating = "MasterSingleSlaveNotReplicating"
MasterSingleSlaveDead = "MasterSingleSlaveDead"
AllMasterSlavesNotReplicating = "AllMasterSlavesNotReplicating"
AllMasterSlavesNotReplicatingOrDead = "AllMasterSlavesNotReplicatingOrDead"
AllMasterSlavesStale = "AllMasterSlavesStale"
MasterWithoutSlaves = "MasterWithoutSlaves"
DeadCoMaster = "DeadCoMaster"
DeadCoMasterAndSomeSlaves = "DeadCoMasterAndSomeSlaves"
Expand Down Expand Up @@ -111,7 +110,6 @@ type ReplicationAnalysis struct {
CountValidReplicas uint
CountValidReplicatingReplicas uint
CountReplicasFailingToConnectToMaster uint
CountStaleReplicas uint
CountDowntimedReplicas uint
ReplicationDepth uint
SlaveHosts InstanceKeyMap
Expand All @@ -132,6 +130,8 @@ type ReplicationAnalysis struct {
CountMixedBasedLoggingReplicas uint
CountRowBasedLoggingReplicas uint
CountDistinctMajorVersionsLoggingReplicas uint
CountDelayedReplicas uint
CountLaggingReplicas uint
IsActionableRecovery bool
ProcessingNodeHostname string
ProcessingNodeToken string
Expand Down
128 changes: 67 additions & 61 deletions go/inst/analysis_dao.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ func initializeAnalysisDaoPostConfiguration() {
func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints) ([]ReplicationAnalysis, error) {
result := []ReplicationAnalysis{}

args := sqlutils.Args(ValidSecondsFromSeenToLastAttemptedCheck(), clusterName)
args := sqlutils.Args(ValidSecondsFromSeenToLastAttemptedCheck(), config.Config.ReasonableReplicationLagSeconds, clusterName)
analysisQueryReductionClause := ``
if config.Config.ReduceReplicationAnalysisCount {
analysisQueryReductionClause = `
Expand All @@ -64,23 +64,23 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
master_instance.last_checked <= master_instance.last_seen
and master_instance.last_attempted_check <= master_instance.last_seen + interval ? second
) = 1 /* AS is_last_check_valid */) = 0
OR (IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.slave_io_running = 0
AND slave_instance.last_io_error like '%error %connecting to master%'
AND slave_instance.slave_sql_running = 1),
OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.slave_io_running = 0
AND replica_instance.last_io_error like '%error %connecting to master%'
AND replica_instance.slave_sql_running = 1),
0) /* AS count_slaves_failing_to_connect_to_master */ > 0)
OR (IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen),
0) /* AS count_valid_slaves */ < COUNT(slave_instance.server_id) /* AS count_slaves */)
OR (IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.slave_io_running != 0
AND slave_instance.slave_sql_running != 0),
0) /* AS count_valid_replicating_slaves */ < COUNT(slave_instance.server_id) /* AS count_slaves */)
OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen),
0) /* AS count_valid_slaves */ < COUNT(replica_instance.server_id) /* AS count_slaves */)
OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.slave_io_running != 0
AND replica_instance.slave_sql_running != 0),
0) /* AS count_valid_replicating_slaves */ < COUNT(replica_instance.server_id) /* AS count_slaves */)
OR (MIN(
master_instance.slave_sql_running = 1
AND master_instance.slave_io_running = 0
AND master_instance.last_io_error like '%error %connecting to master%'
) /* AS is_failing_to_connect_to_master */)
OR (COUNT(slave_instance.server_id) /* AS count_slaves */ > 0)
OR (COUNT(replica_instance.server_id) /* AS count_slaves */ > 0)
`
args = append(args, ValidSecondsFromSeenToLastAttemptedCheck())
}
Expand Down Expand Up @@ -109,20 +109,20 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
':',
master_instance.port) = master_instance.cluster_name) AS is_cluster_master,
MIN(master_instance.gtid_mode) AS gtid_mode,
COUNT(slave_instance.server_id) AS count_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen),
COUNT(replica_instance.server_id) AS count_slaves,
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen),
0) AS count_valid_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.slave_io_running != 0
AND slave_instance.slave_sql_running != 0),
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.slave_io_running != 0
AND replica_instance.slave_sql_running != 0),
0) AS count_valid_replicating_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.slave_io_running = 0
AND slave_instance.last_io_error like '%%error %%connecting to master%%'
AND slave_instance.slave_sql_running = 1),
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.slave_io_running = 0
AND replica_instance.last_io_error like '%%error %%connecting to master%%'
AND replica_instance.slave_sql_running = 1),
0) AS count_slaves_failing_to_connect_to_master,
MIN(master_instance.replication_depth) AS replication_depth,
GROUP_CONCAT(concat(slave_instance.Hostname, ':', slave_instance.Port)) as slave_hosts,
GROUP_CONCAT(concat(replica_instance.Hostname, ':', replica_instance.Port)) as slave_hosts,
MIN(
master_instance.slave_sql_running = 1
AND master_instance.slave_io_running = 0
Expand All @@ -148,49 +148,53 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
master_instance.supports_oracle_gtid
) AS supports_oracle_gtid,
SUM(
slave_instance.oracle_gtid
replica_instance.oracle_gtid
) AS count_oracle_gtid_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.oracle_gtid != 0),
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.oracle_gtid != 0),
0) AS count_valid_oracle_gtid_slaves,
SUM(
slave_instance.binlog_server
replica_instance.binlog_server
) AS count_binlog_server_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.binlog_server != 0),
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.binlog_server != 0),
0) AS count_valid_binlog_server_slaves,
MIN(
master_instance.mariadb_gtid
) AS is_mariadb_gtid,
SUM(
slave_instance.mariadb_gtid
replica_instance.mariadb_gtid
) AS count_mariadb_gtid_slaves,
IFNULL(SUM(slave_instance.last_checked <= slave_instance.last_seen
AND slave_instance.mariadb_gtid != 0),
IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen
AND replica_instance.mariadb_gtid != 0),
0) AS count_valid_mariadb_gtid_slaves,
IFNULL(SUM(slave_instance.log_bin
AND slave_instance.log_slave_updates
AND slave_instance.binlog_format = 'STATEMENT'),
IFNULL(SUM(replica_instance.log_bin
AND replica_instance.log_slave_updates
AND replica_instance.binlog_format = 'STATEMENT'),
0) AS count_statement_based_loggin_slaves,
IFNULL(SUM(slave_instance.log_bin
AND slave_instance.log_slave_updates
AND slave_instance.binlog_format = 'MIXED'),
IFNULL(SUM(replica_instance.log_bin
AND replica_instance.log_slave_updates
AND replica_instance.binlog_format = 'MIXED'),
0) AS count_mixed_based_loggin_slaves,
IFNULL(SUM(slave_instance.log_bin
AND slave_instance.log_slave_updates
AND slave_instance.binlog_format = 'ROW'),
IFNULL(SUM(replica_instance.log_bin
AND replica_instance.log_slave_updates
AND replica_instance.binlog_format = 'ROW'),
0) AS count_row_based_loggin_slaves,
IFNULL(MIN(slave_instance.gtid_mode), '')
IFNULL(SUM(replica_instance.sql_delay > 0),
0) AS count_delayed_replicas,
IFNULL(SUM(replica_instance.slave_lag_seconds > ?),
0) AS count_lagging_replicas,
IFNULL(MIN(replica_instance.gtid_mode), '')
AS min_replica_gtid_mode,
IFNULL(MAX(slave_instance.gtid_mode), '')
IFNULL(MAX(replica_instance.gtid_mode), '')
AS max_replica_gtid_mode,
IFNULL(SUM(
replica_downtime.downtime_active is not null
and ifnull(replica_downtime.end_timestamp, now()) > now()),
0) AS count_downtimed_replicas,
COUNT(DISTINCT case
when slave_instance.log_bin AND slave_instance.log_slave_updates
then slave_instance.major_version
when replica_instance.log_bin AND replica_instance.log_slave_updates
then replica_instance.major_version
else NULL
end
) AS count_distinct_logging_major_versions
Expand All @@ -199,9 +203,9 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
LEFT JOIN
hostname_resolve ON (master_instance.hostname = hostname_resolve.hostname)
LEFT JOIN
database_instance slave_instance ON (COALESCE(hostname_resolve.resolved_hostname,
master_instance.hostname) = slave_instance.master_host
AND master_instance.port = slave_instance.master_port)
database_instance replica_instance ON (COALESCE(hostname_resolve.resolved_hostname,
master_instance.hostname) = replica_instance.master_host
AND master_instance.port = replica_instance.master_port)
LEFT JOIN
database_instance_maintenance ON (master_instance.hostname = database_instance_maintenance.hostname
AND master_instance.port = database_instance_maintenance.port
Expand All @@ -211,15 +215,11 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
AND master_instance.port = master_downtime.port
AND master_downtime.downtime_active = 1)
LEFT JOIN
database_instance_downtime as replica_downtime ON (slave_instance.hostname = replica_downtime.hostname
AND slave_instance.port = replica_downtime.port
database_instance_downtime as replica_downtime ON (replica_instance.hostname = replica_downtime.hostname
AND replica_instance.port = replica_downtime.port
AND replica_downtime.downtime_active = 1)
LEFT JOIN
cluster_alias ON (cluster_alias.cluster_name = master_instance.cluster_name)
LEFT JOIN
database_instance_recent_relaylog_history ON (
slave_instance.hostname = database_instance_recent_relaylog_history.hostname
AND slave_instance.port = database_instance_recent_relaylog_history.port)
WHERE
database_instance_maintenance.database_instance_maintenance_id IS NULL
AND ? IN ('', master_instance.cluster_name)
Expand Down Expand Up @@ -255,7 +255,6 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
a.CountValidReplicatingReplicas = m.GetUint("count_valid_replicating_slaves")
a.CountReplicasFailingToConnectToMaster = m.GetUint("count_slaves_failing_to_connect_to_master")
a.CountDowntimedReplicas = m.GetUint("count_downtimed_replicas")
a.CountStaleReplicas = 0
a.ReplicationDepth = m.GetUint("replication_depth")
a.IsFailingToConnectToMaster = m.GetBool("is_failing_to_connect_to_master")
a.IsDowntimed = m.GetBool("is_downtimed")
Expand Down Expand Up @@ -283,6 +282,17 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
a.CountRowBasedLoggingReplicas = m.GetUint("count_row_based_loggin_slaves")
a.CountDistinctMajorVersionsLoggingReplicas = m.GetUint("count_distinct_logging_major_versions")

a.CountDelayedReplicas = m.GetUint("count_delayed_replicas")
a.CountLaggingReplicas = m.GetUint("count_lagging_replicas")

if !a.LastCheckValid {
analysisMessage := fmt.Sprintf("analysis: IsMaster: %+v, LastCheckValid: %+v, LastCheckPartialSuccess: %+v, CountReplicas: %+v, CountValidReplicatingReplicas: %+v, CountLaggingReplicas: %+v, CountDelayedReplicas: %+v, ",
a.IsMaster, a.LastCheckValid, a.LastCheckPartialSuccess, a.CountReplicas, a.CountValidReplicatingReplicas, a.CountLaggingReplicas, a.CountDelayedReplicas,
)
if util.ClearToLog("analysis_dao", analysisMessage) {
log.Debugf(analysisMessage)
}
}
if a.IsMaster && !a.LastCheckValid && a.CountReplicas == 0 {
a.Analysis = DeadMasterWithoutSlaves
a.Description = "Master cannot be reached by orchestrator and has no slave"
Expand All @@ -299,9 +309,9 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
a.Analysis = DeadMasterAndSomeSlaves
a.Description = "Master cannot be reached by orchestrator; some of its replicas are unreachable and none of its reachable replicas is replicating"
//
} else if a.IsMaster && !a.LastCheckValid && a.CountStaleReplicas == a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableMasterWithStaleSlaves
a.Description = "Master cannot be reached by orchestrator and has running yet stale replicas"
} else if a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableMasterWithLaggingReplicas
a.Description = "Master cannot be reached by orchestrator and all of its replicas are lagging"
//
} else if a.IsMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableMaster
Expand All @@ -323,10 +333,6 @@ func GetReplicationAnalysis(clusterName string, hints *ReplicationAnalysisHints)
a.Analysis = AllMasterSlavesNotReplicatingOrDead
a.Description = "Master is reachable but none of its replicas is replicating"
//
} else if a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountStaleReplicas == a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
a.Analysis = AllMasterSlavesStale
a.Description = "Master is reachable but all of its replicas are stale, although attempting to replicate"
//
} else /* co-master */ if a.IsCoMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 {
a.Analysis = DeadCoMaster
a.Description = "Co-master cannot be reached by orchestrator and none of its replicas is replicating"
Expand Down
Loading