Skip to content

Conversation

yiweichi
Copy link
Member

This PR supports retry on part of chain orchestractor failures.

Currently including failure types:

  • DatabaseError
  • NetworkRequestError
  • RpcError

Corresponding issue: #319

Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have not implemented the retry logic on some critical components. Specifically:

  • persist_l1_consolidated_blocks
  • consolidate_validated_l2_blocks
  • handle_l1_reorg

I also have some concerns about this pattern. Lets say for example that we implement the retry mechanic on:

    /// Handles a reorganization event by deleting all indexed data which is greater than the
    /// provided block number.
    async fn handle_l1_reorg(
        database: Arc<Database>,
        chain_spec: Arc<ChainSpec>,
        l1_block_number: u64,
        l2_client: Arc<P>,
        current_chain: Arc<Mutex<Chain>>,
    ) -> Result<Option<ChainOrchestratorEvent>, ChainOrchestratorError> {
        let txn = database.tx().await?;
        let UnwindResult { l1_block_number, queue_index, l2_head_block_number, l2_safe_block_info } =
            txn.unwind(chain_spec.genesis_hash(), l1_block_number).await?;
        txn.commit().await?;
        let l2_head_block_info = if let Some(block_number) = l2_head_block_number {
            // Fetch the block hash of the new L2 head block.
            let block_hash = l2_client
                .get_block_by_number(block_number.into())
                .await?
                .expect("L2 head block must exist")
                .header
                .hash_slow();
            // Remove all blocks in the in-memory chain that are greater than the new L2 head block.
            let mut current_chain_headers = current_chain.lock().await;
            current_chain_headers.inner_mut().retain(|h| h.number <= block_number);
            Some(BlockInfo { number: block_number, hash: block_hash })
        } else {
            None
        };
        Ok(Some(ChainOrchestratorEvent::L1Reorg {
            l1_block_number,
            queue_index,
            l2_head_block_info,
            l2_safe_block_info,
        }))
    }

Let's say we commit the database transaction but then we fail on get_block_by_number and attempt to retry. On the second retry the response from txn.unwind will be different and as such the second retry will yield an incorrect result.

We have two options to address this:

  • Hold the database transaction open until we have the final result ready. This is bad practice as we shouldn't hold a database lock whilst performing other IO operations.
  • Only do read operations at the start of the function and then finaly commit all changes to the database at the end of the function when we have computed the result.

I think we should migrate to pattern 2.

Let me know your thoughts.

Copy link
Collaborator

@greged93 greged93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure retry_with_backoff is the strategy we want, it might be a better to have a wrapper around each provider (RPC, DB,...) that implements a retry strategy.

@yiweichi yiweichi force-pushed the feat-chain-orchestrator-retry-on-failure branch from 2e14fa7 to 9d65609 Compare September 21, 2025 18:58
Copy link
Collaborator

@greged93 greged93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still have the previous idempotency issue.
Edit: never mind

Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left some comments inline. I am wondering if we need to differentiate between the types of errors that are returned from functions? I think for now it's fine to leave as it is but in the future maybe we should consider error returned before retrying.

Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment inline

@yiweichi yiweichi requested a review from frisitano September 23, 2025 07:47
Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments inline.

@yiweichi yiweichi requested a review from frisitano September 23, 2025 09:27
Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yiweichi yiweichi merged commit 84e8cfe into main Sep 23, 2025
13 checks passed
@yiweichi yiweichi deleted the feat-chain-orchestrator-retry-on-failure branch September 23, 2025 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants