DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled #51

kamil-holubicki · 2024-11-07T14:07:00Z

https://perconadev.atlassian.net/browse/DISTMYSQL-466

Problem:
When recovery is disabled globally, UnreachableMasterWithLaggingReplicas and UnreachableIntermediateMasterWithLaggingReplicas cases cause replica io thread to be restarted.

Cause:
Commit 98bd7f0 added the feature allowing global recovery disable

Commit fc33f3e moved the implementation to
executeCheckAndRecoverFunction

Commit b761fa3 introduced runEmergencyOperations() function. Its purpose was to read topology instance to speed up recovery. The instance was read, then recovery was skipped if disabled globally.

Commit 464a3c1 and 684d6e2 caused the regression. They introduced the call to emergentlyRestartReplicationOnTopologyInstance() from runEmergencyOperations().
openark/orchestrator#572 and openark/orchestrator#1005 provide the detailed explanation, why it was done.

Solution:
If recovery was disabled globally, and this is not forced discovery, skip restart of replicas.

Additionally fixed Instance object read from Orchestrator's backend DB. Such and object was missing QSP member (Query String Provider). As the consequence any query related to master/slave <-> source/replica could not be resolved and failed (because nil string query was executed)

Related issue: https://github.com/openark/orchestrator/issues/0123456789

Description

This PR [briefly explain what is does]

kamil-holubicki · 2024-11-07T14:09:32Z

https://ps80.cd.percona.com/job/mysql-orchestrator-pipeline/45

venkatesh-prasad-v

LGTM

egegunes · 2024-11-08T13:14:20Z

go/logic/topology_recovery.go

+	// Check for recovery being disabled globally
+	if recerr != nil {
+		// Unexpected. Shouldn't get this
+		log.Errorf("Unable to determine if recovery is disabled globally: %v", err)


should we continue execution if recerr is not nil?

additionally, we log err but error returned from IsRecoveryDisabled is named recerr. I suggest renaming recerr to err.

Good catch. We should log recerr. Keeping it as recerr, just to do as less changes to this function as needed (we use err afterwards)
As for the flow, this is to keep the original one. If we can't determine if recovery is disabled globally or not (which should not happen), we are continuing as it was enabled (IsRecoveryDisabled returns false in case of error)

…cluster where recovery has been globally disabled https://perconadev.atlassian.net/browse/DISTMYSQL-466 Problem: When recovery is disabled globally, UnreachableMasterWithLaggingReplicas and UnreachableIntermediateMasterWithLaggingReplicas cases cause replica io thread to be restarted. Cause: Commit 98bd7f0 added the feature allowing global recovery disable Commit fc33f3e moved the implementation to executeCheckAndRecoverFunction Commit b761fa3 introduced runEmergencyOperations() function. Its purpose was to read topology instance to speed up recovery. The instance was read, then recovery was skipped if disabled globally. Commit 464a3c1 and 684d6e2 caused the regression. They introduced the call to emergentlyRestartReplicationOnTopologyInstance() from runEmergencyOperations(). openark/orchestrator#572 and openark/orchestrator#1005 provide the detailed explanation, why it was done. Solution: If recovery was disabled globally, and this is not forced discovery, skip restart of replicas. Additionally fixed Instance object read from Orchestrator's backend DB. Such and object was missing QSP member (Query String Provider). As the consequence any query related to master/slave <-> source/replica could not be resolved and failed (because nil string query was executed)

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled

kamil-holubicki requested review from egegunes and venkatesh-prasad-v November 7, 2024 14:07

venkatesh-prasad-v approved these changes Nov 8, 2024

View reviewed changes

egegunes reviewed Nov 8, 2024

View reviewed changes

kamil-holubicki force-pushed the DISTMYSQL-466 branch from a6a83be to 1eb8047 Compare November 12, 2024 09:32

kamil-holubicki merged commit 999e977 into percona:master Nov 12, 2024
2 checks passed

kamil-holubicki referenced this pull request in kamil-holubicki/orchestrator Nov 15, 2024

Merge pull request #51 from kamil-holubicki/DISTMYSQL-466

422dbb7

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled #51

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled #51

Uh oh!

kamil-holubicki commented Nov 7, 2024

Uh oh!

kamil-holubicki commented Nov 7, 2024 •

edited

Loading

Uh oh!

venkatesh-prasad-v left a comment

Uh oh!

egegunes Nov 8, 2024

Uh oh!

kamil-holubicki Nov 12, 2024

Uh oh!

Uh oh!

Uh oh!

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled #51

DISTMYSQL-466: RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled #51

Uh oh!

Conversation

kamil-holubicki commented Nov 7, 2024

Description

Uh oh!

kamil-holubicki commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

venkatesh-prasad-v left a comment

Choose a reason for hiding this comment

Uh oh!

egegunes Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

kamil-holubicki Nov 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kamil-holubicki commented Nov 7, 2024 •

edited

Loading