-
Notifications
You must be signed in to change notification settings - Fork 44
design doc: interop monitoring #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Begins to flesh out a plan for how we go about monitoring and alerting for the interop release.
If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally | ||
have its own alerts associated in addition to the prior expectation of an operator monitoring the situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably the runbook would say this should be escalated and the bridge paused (unless it can be established that the invalid message is benign)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO there are no benign versions of this situation. This suggestion here is essentially for our operators to directly watch that Invalid Block replacement always happens. If it ever didn't, that would be a protocol level failure, and depending on the nature of the issue, could be dealing with an Unsafe Chain Stall or failure of FP games.
In any case, yes this would be a situation to escalate to an engineering manager who can help coordinate a larger response, yes.
I think I am being a little over protective in the operational suggestions here, especially if we deal with invalid messages regularly. But at least for the first few times it happens, I want a human ready to sound the alarm if Cross Safety should fail.
|
||
### Resource Usage | ||
|
||
This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is to be stateless, doesn't this make it difficult to fully track the status of an x-msg (executing message) over time, if (for example) the service is restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is one weakness as designed, the service should be running to collect new jobs in the first place, and would drop ongoing jobs if restarted.
The question can be expanded to: even if persisted, if the Monitoring Service is restarted, must it also backfill the gap?
I could see this going either way- the complexity of persisting and backfilling jobs seems like it would make this a lot more work to build. But, the risk of having blind or incomplete metrics is scary! Maybe there is a middle-ground where we can run monitors in redundancy and be tolerant of them failing.
### Resource Usage | ||
|
||
This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node | ||
for the Superchain it is monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be good to add more detail about how many xmsg-mon
instances should be deployed in total, and to use the "full validation stack" terminology if appropriate.
For example, should the nodes the xmsg-mon
connects to be managed by the same supervisor (I guess yes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya good call, will include!
I'm assuming a monitor would connect to one full-validator yeah, but you could get creative with it if you wanted, and assign nodes from different validators. Or theoretically only attach a subset of the chains, in which case only a subset of jobs would be tracked (why would you do this? idk).
Anyway, best practice is probably one consistent set of nodes , "full validator"
Co-authored-by: George Knee <[email protected]>
Monitoring service is a critical service for detecting and responding to potential invalid messages across the superchain.