-
Notifications
You must be signed in to change notification settings - Fork 3
feat(chain-orchestractor): support retry on failure #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you have not implemented the retry logic on some critical components. Specifically:
- persist_l1_consolidated_blocks
- consolidate_validated_l2_blocks
- handle_l1_reorg
I also have some concerns about this pattern. Lets say for example that we implement the retry mechanic on:
/// Handles a reorganization event by deleting all indexed data which is greater than the
/// provided block number.
async fn handle_l1_reorg(
database: Arc<Database>,
chain_spec: Arc<ChainSpec>,
l1_block_number: u64,
l2_client: Arc<P>,
current_chain: Arc<Mutex<Chain>>,
) -> Result<Option<ChainOrchestratorEvent>, ChainOrchestratorError> {
let txn = database.tx().await?;
let UnwindResult { l1_block_number, queue_index, l2_head_block_number, l2_safe_block_info } =
txn.unwind(chain_spec.genesis_hash(), l1_block_number).await?;
txn.commit().await?;
let l2_head_block_info = if let Some(block_number) = l2_head_block_number {
// Fetch the block hash of the new L2 head block.
let block_hash = l2_client
.get_block_by_number(block_number.into())
.await?
.expect("L2 head block must exist")
.header
.hash_slow();
// Remove all blocks in the in-memory chain that are greater than the new L2 head block.
let mut current_chain_headers = current_chain.lock().await;
current_chain_headers.inner_mut().retain(|h| h.number <= block_number);
Some(BlockInfo { number: block_number, hash: block_hash })
} else {
None
};
Ok(Some(ChainOrchestratorEvent::L1Reorg {
l1_block_number,
queue_index,
l2_head_block_info,
l2_safe_block_info,
}))
}
Let's say we commit the database transaction but then we fail on get_block_by_number
and attempt to retry. On the second retry the response from txn.unwind
will be different and as such the second retry will yield an incorrect result.
We have two options to address this:
- Hold the database transaction open until we have the final result ready. This is bad practice as we shouldn't hold a database lock whilst performing other IO operations.
- Only do read operations at the start of the function and then finaly commit all changes to the database at the end of the function when we have computed the result.
I think we should migrate to pattern 2.
Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure retry_with_backoff
is the strategy we want, it might be a better to have a wrapper around each provider (RPC, DB,...) that implements a retry strategy.
2e14fa7
to
9d65609
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still have the previous idempotency issue.
Edit: never mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I left some comments inline. I am wondering if we need to differentiate between the types of errors that are returned from functions? I think for now it's fine to leave as it is but in the future maybe we should consider error returned before retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR supports retry on part of chain orchestractor failures.
Currently including failure types:
Corresponding issue: #319