design doc: interop monitoring #222

tynes · 2025-03-19T23:06:34Z

Monitoring service is a critical service for detecting and responding to potential invalid messages across the superchain.

Begins to flesh out a plan for how we go about monitoring and alerting for the interop release.

protocol/interop-monitoring.md

geoknee · 2025-06-04T12:31:03Z

protocol/interop-monitoring.md

+If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally
+have its own alerts associated in addition to the prior expectation of an operator monitoring the situation.


Presumably the runbook would say this should be escalated and the bridge paused (unless it can be established that the invalid message is benign)?

IMO there are no benign versions of this situation. This suggestion here is essentially for our operators to directly watch that Invalid Block replacement always happens. If it ever didn't, that would be a protocol level failure, and depending on the nature of the issue, could be dealing with an Unsafe Chain Stall or failure of FP games.

In any case, yes this would be a situation to escalate to an engineering manager who can help coordinate a larger response, yes.

I think I am being a little over protective in the operational suggestions here, especially if we deal with invalid messages regularly. But at least for the first few times it happens, I want a human ready to sound the alarm if Cross Safety should fail.

geoknee · 2025-06-04T12:33:36Z

protocol/interop-monitoring.md

+
+### Resource Usage
+
+This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node


If it is to be stateless, doesn't this make it difficult to fully track the status of an x-msg (executing message) over time, if (for example) the service is restarted?

Yes, that is one weakness as designed, the service should be running to collect new jobs in the first place, and would drop ongoing jobs if restarted.

The question can be expanded to: even if persisted, if the Monitoring Service is restarted, must it also backfill the gap?

I could see this going either way- the complexity of persisting and backfilling jobs seems like it would make this a lot more work to build. But, the risk of having blind or incomplete metrics is scary! Maybe there is a middle-ground where we can run monitors in redundancy and be tolerant of them failing.

geoknee · 2025-06-04T12:38:43Z

protocol/interop-monitoring.md

+### Resource Usage
+
+This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node
+for the Superchain it is monitoring.


Be good to add more detail about how many xmsg-mon instances should be deployed in total, and to use the "full validation stack" terminology if appropriate.

For example, should the nodes the xmsg-mon connects to be managed by the same supervisor (I guess yes)?

Ya good call, will include!

I'm assuming a monitor would connect to one full-validator yeah, but you could get creative with it if you wanted, and assign nodes from different validators. Or theoretically only attach a subset of the chains, in which case only a subset of jobs would be tracked (why would you do this? idk).

Anyway, best practice is probably one consistent set of nodes , "full validator"

protocol/interop-monitoring.md

Co-authored-by: George Knee <[email protected]>

design doc: interop monitoring

b6aab04

Begins to flesh out a plan for how we go about monitoring and alerting for the interop release.

This was referenced Mar 27, 2025

Interop op-supervisor FMA #225

Closed

[Tracker]: Add impl redundancy to superchain mempool ingress checks ethereum-optimism/optimism#15128

Closed

axelKingsley reviewed May 7, 2025

View reviewed changes

protocol/interop-monitoring.md Outdated Show resolved Hide resolved

Editorial Pass ; Specify Monitor Design

69fddd5

tynes commented May 7, 2025

View reviewed changes

protocol/interop-monitoring.md Outdated Show resolved Hide resolved

tynes commented May 7, 2025

View reviewed changes

protocol/interop-monitoring.md Show resolved Hide resolved

comments

344040c

alfonso-op mentioned this pull request May 20, 2025

WS-24: Interop High Availability ethereum-optimism/optimism#15992

Open

axelKingsley mentioned this pull request Jun 2, 2025

Interop Monitoring: Monitoring Service Design Doc ethereum-optimism/optimism#16224

Open

geoknee self-requested a review June 2, 2025 20:28

geoknee approved these changes Jun 4, 2025

View reviewed changes

in-review typo fixes

5010778

Co-authored-by: George Knee <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

design doc: interop monitoring #222

design doc: interop monitoring #222

Uh oh!

tynes commented Mar 19, 2025 •

edited by axelKingsley

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoknee Jun 4, 2025

Uh oh!

axelKingsley Jun 4, 2025

Uh oh!

geoknee Jun 4, 2025

Uh oh!

axelKingsley Jun 4, 2025

Uh oh!

geoknee Jun 4, 2025

Uh oh!

axelKingsley Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally
		have its own alerts associated in addition to the prior expectation of an operator monitoring the situation.


		### Resource Usage

		This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node

design doc: interop monitoring #222

Are you sure you want to change the base?

design doc: interop monitoring #222

Uh oh!

Conversation

tynes commented Mar 19, 2025 • edited by axelKingsley Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoknee Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

axelKingsley Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

geoknee Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

axelKingsley Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

geoknee Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

axelKingsley Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tynes commented Mar 19, 2025 •

edited by axelKingsley

Loading