Skip to content

Supervised failover #4265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 21, 2024
Merged

Supervised failover #4265

merged 8 commits into from
Jun 21, 2024

Conversation

andreyaksenov
Copy link
Contributor

@andreyaksenov andreyaksenov commented Jun 5, 2024

  1. Created a sample app with the supervised failover: supervised_failover.
  2. Created a new Supervised failover topic.
  3. Updated the Configuration reference:
  4. Added the --failover option to the tarantool CLI reference. Note that this reference is moved from the existing Starting and stopping instances topic.
  5. Added the tt cluster failover command reference:
    • A new failover subsection in the tt cluster command reference.
    • New timeout and wait options (applicable to tt cluster failover): Options

@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 2 times, most recently from a0b0108 to a13464d Compare June 10, 2024 13:51
@andreyaksenov andreyaksenov linked an issue Jun 10, 2024 that may be closed by this pull request
@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 5 times, most recently from ab931e5 to 352dd4f Compare June 14, 2024 08:42
@andreyaksenov andreyaksenov linked an issue Jun 14, 2024 that may be closed by this pull request
@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 12 times, most recently from 6d9bc82 to 18283bb Compare June 18, 2024 13:52
@andreyaksenov andreyaksenov marked this pull request as ready for review June 18, 2024 13:57
@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 2 times, most recently from ea4ef0d to 06cdab2 Compare June 18, 2024 14:04
Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the links in the pull request description. It is really convenient to look over the parts that actually needs a review.

I highlighted several points in the failover algorithm overview. I think it worth to discuss f2f how to better describe it in a consice way.

At least, I see that we can discuss the following points:

  • How to better divide it into subsections.
  • Should we divide the service into more-or-less simple blocks and describe them separately or describe how the service works at whole?
  • Should we highlight which actions are performed by which actor (coordinator, instance) and when exactly (on appoint request, on switching to active mode and so on).

I would also like to provide some in-depth materials like ones that are placed in https://github.com/tarantool/enterprise_doc/issues/253 (plus pictures from my presentations). I open a pull request with some drafts, but I likely need your assistance to make it ready for the website.


.. code-block:: console

$ tt cluster failover switch URI INSTANCE_NAME [OPTION ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it worth to mention -w here explicitly, because this option is likely often used together with this command.

@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 6 times, most recently from ff4dc5a to 71f58ec Compare June 20, 2024 13:10
@andreyaksenov andreyaksenov force-pushed the supervised-failover branch 4 times, most recently from a8e7524 to a69199e Compare June 20, 2024 15:13
Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the concise description and the reference!

@xuniq xuniq self-requested a review June 20, 2024 17:18
Copy link
Contributor

@p7nov p7nov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the reference part. Looks good, just a couple comments.

.. _tarantool_cli:
.. _configuration_command_options:

tarantool command-line options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no introduction/definition of the tarantool executable.
I'd add a sentence (possibily a note) explaining that tarantool is an executable for running a single instance and there is tt that covers much more scenarios.

``tt cluster failover switch`` appoints the specified instance to be a master.
This command accepts the following arguments and options:

- ``URI``: A :ref:`URI <tt-cluster-uri>` of the cluster configuration storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a possible issue that applies to the whole page.
In the tt cluster reference, the URI arguments refers to the config storage URI, while in the rest of the tt reference it's the instance URI. This can confuse readers. Perhaps we should change the argument name on this page to CONFIG_URI or smth like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the URI argument to CONFIG_URI

.. code-block:: console

$ tt cluster failover switch http://localhost:2379/myapp storage-a-002
To check the switching status, run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the output needs an explanation. At first, I decided that it's a part of the switch-status reference pasted here by mistake.

This command accepts the following arguments:

- ``URI``: A :ref:`URI <tt-cluster-uri>` of the cluster configuration storage.
- ``TASK_ID``: An identifier of the task used to switch a master instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never introduced. Let's explain it here or above (see the comment on failover switch example).


$ tt cluster failover switch-status URI TASK_ID

``tt cluster failover switch-status`` shows the status of switching a master instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What statuses do we have?

@@ -378,12 +447,24 @@ Options

Skip validation when publishing. Default: `false` (validation is enabled).

.. option:: --t, --timeout UINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. option:: --t, --timeout UINT
.. option:: -t, --timeout UINT

-t with one hyphen?

Copy link
Contributor

@xuniq xuniq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@andreyaksenov andreyaksenov merged commit 8ff1dd9 into latest Jun 21, 2024
1 check passed
@andreyaksenov andreyaksenov deleted the supervised-failover branch June 21, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tt cluster failover commands
4 participants