Skip to content

design doc: interop monitoring #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

design doc: interop monitoring #222

wants to merge 4 commits into from

Conversation

tynes
Copy link
Contributor

@tynes tynes commented Mar 19, 2025

Monitoring service is a critical service for detecting and responding to potential invalid messages across the superchain.

Begins to flesh out a plan for how we go about monitoring and alerting
for the interop release.
Comment on lines +115 to +116
If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally
have its own alerts associated in addition to the prior expectation of an operator monitoring the situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably the runbook would say this should be escalated and the bridge paused (unless it can be established that the invalid message is benign)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there are no benign versions of this situation. This suggestion here is essentially for our operators to directly watch that Invalid Block replacement always happens. If it ever didn't, that would be a protocol level failure, and depending on the nature of the issue, could be dealing with an Unsafe Chain Stall or failure of FP games.

In any case, yes this would be a situation to escalate to an engineering manager who can help coordinate a larger response, yes.

I think I am being a little over protective in the operational suggestions here, especially if we deal with invalid messages regularly. But at least for the first few times it happens, I want a human ready to sound the alarm if Cross Safety should fail.


### Resource Usage

This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is to be stateless, doesn't this make it difficult to fully track the status of an x-msg (executing message) over time, if (for example) the service is restarted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is one weakness as designed, the service should be running to collect new jobs in the first place, and would drop ongoing jobs if restarted.

The question can be expanded to: even if persisted, if the Monitoring Service is restarted, must it also backfill the gap?

I could see this going either way- the complexity of persisting and backfilling jobs seems like it would make this a lot more work to build. But, the risk of having blind or incomplete metrics is scary! Maybe there is a middle-ground where we can run monitors in redundancy and be tolerant of them failing.

### Resource Usage

This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node
for the Superchain it is monitoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be good to add more detail about how many xmsg-mon instances should be deployed in total, and to use the "full validation stack" terminology if appropriate.

For example, should the nodes the xmsg-mon connects to be managed by the same supervisor (I guess yes)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya good call, will include!

I'm assuming a monitor would connect to one full-validator yeah, but you could get creative with it if you wanted, and assign nodes from different validators. Or theoretically only attach a subset of the chains, in which case only a subset of jobs would be tracked (why would you do this? idk).

Anyway, best practice is probably one consistent set of nodes , "full validator"

Co-authored-by: George Knee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants