Skip to content

Conversation

kamil-holubicki
Copy link
Contributor

https://perconadev.atlassian.net/browse/DISTMYSQL-466

Problem:
When recovery is disabled globally, UnreachableMasterWithLaggingReplicas and UnreachableIntermediateMasterWithLaggingReplicas cases cause replica io thread to be restarted.

Cause:
Commit 98bd7f0 added the feature allowing global recovery disable

Commit fc33f3e moved the implementation to
executeCheckAndRecoverFunction

Commit b761fa3 introduced runEmergencyOperations() function. Its purpose was to read topology instance to speed up recovery. The instance was read, then recovery was skipped if disabled globally.

Commit 464a3c1 and 684d6e2 caused the regression. They introduced the call to emergentlyRestartReplicationOnTopologyInstance() from runEmergencyOperations().
openark/orchestrator#572 and openark/orchestrator#1005 provide the detailed explanation, why it was done.

Solution:
If recovery was disabled globally, and this is not forced discovery, skip restart of replicas.

Additionally fixed Instance object read from Orchestrator's backend DB. Such and object was missing QSP member (Query String Provider). As the consequence any query related to master/slave <-> source/replica could not be resolved and failed (because nil string query was executed)

Related issue: https://github.com/openark/orchestrator/issues/0123456789

Description

This PR [briefly explain what is does]

@kamil-holubicki
Copy link
Contributor Author

kamil-holubicki commented Nov 7, 2024

Copy link

@venkatesh-prasad-v venkatesh-prasad-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Check for recovery being disabled globally
if recerr != nil {
// Unexpected. Shouldn't get this
log.Errorf("Unable to determine if recovery is disabled globally: %v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we continue execution if recerr is not nil?

additionally, we log err but error returned from IsRecoveryDisabled is named recerr. I suggest renaming recerr to err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We should log recerr. Keeping it as recerr, just to do as less changes to this function as needed (we use err afterwards)
As for the flow, this is to keep the original one. If we can't determine if recovery is disabled globally or not (which should not happen), we are continuing as it was enabled (IsRecoveryDisabled returns false in case of error)

…cluster where recovery has been globally disabled

https://perconadev.atlassian.net/browse/DISTMYSQL-466

Problem:
When recovery is disabled globally, UnreachableMasterWithLaggingReplicas
and UnreachableIntermediateMasterWithLaggingReplicas cases cause
replica io thread to be restarted.

Cause:
Commit 98bd7f0 added the feature allowing global recovery disable

Commit fc33f3e moved the implementation to
executeCheckAndRecoverFunction

Commit b761fa3 introduced runEmergencyOperations() function.
Its purpose was to read topology instance to speed up recovery.
The instance was read, then recovery was skipped if disabled globally.

Commit 464a3c1 and 684d6e2 caused the regression. They introduced
the call to emergentlyRestartReplicationOnTopologyInstance() from
runEmergencyOperations().
openark/orchestrator#572 and
openark/orchestrator#1005 provide the detailed
explanation, why it was done.

Solution:
If recovery was disabled globally, and this is not forced discovery,
skip restart of replicas.

Additionally fixed Instance object read from Orchestrator's backend DB.
Such and object was missing QSP member (Query String Provider).
As the consequence any query related to master/slave <-> source/replica
could not be resolved and failed (because nil string query was executed)
@kamil-holubicki kamil-holubicki merged commit 999e977 into percona:master Nov 12, 2024
2 checks passed
kamil-holubicki referenced this pull request in kamil-holubicki/orchestrator Nov 15, 2024
DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants