feat(chain-orchestractor): support retry on failure #322

yiweichi · 2025-09-16T06:47:59Z

This PR supports retry on part of chain orchestractor failures.

Currently including failure types:

DatabaseError
NetworkRequestError
RpcError

Corresponding issue: #319

frisitano

It looks like you have not implemented the retry logic on some critical components. Specifically:

persist_l1_consolidated_blocks
consolidate_validated_l2_blocks
handle_l1_reorg

I also have some concerns about this pattern. Lets say for example that we implement the retry mechanic on:

    /// Handles a reorganization event by deleting all indexed data which is greater than the
    /// provided block number.
    async fn handle_l1_reorg(
        database: Arc<Database>,
        chain_spec: Arc<ChainSpec>,
        l1_block_number: u64,
        l2_client: Arc<P>,
        current_chain: Arc<Mutex<Chain>>,
    ) -> Result<Option<ChainOrchestratorEvent>, ChainOrchestratorError> {
        let txn = database.tx().await?;
        let UnwindResult { l1_block_number, queue_index, l2_head_block_number, l2_safe_block_info } =
            txn.unwind(chain_spec.genesis_hash(), l1_block_number).await?;
        txn.commit().await?;
        let l2_head_block_info = if let Some(block_number) = l2_head_block_number {
            // Fetch the block hash of the new L2 head block.
            let block_hash = l2_client
                .get_block_by_number(block_number.into())
                .await?
                .expect("L2 head block must exist")
                .header
                .hash_slow();
            // Remove all blocks in the in-memory chain that are greater than the new L2 head block.
            let mut current_chain_headers = current_chain.lock().await;
            current_chain_headers.inner_mut().retain(|h| h.number <= block_number);
            Some(BlockInfo { number: block_number, hash: block_hash })
        } else {
            None
        };
        Ok(Some(ChainOrchestratorEvent::L1Reorg {
            l1_block_number,
            queue_index,
            l2_head_block_info,
            l2_safe_block_info,
        }))
    }

Let's say we commit the database transaction but then we fail on get_block_by_number and attempt to retry. On the second retry the response from txn.unwind will be different and as such the second retry will yield an incorrect result.

We have two options to address this:

Hold the database transaction open until we have the final result ready. This is bad practice as we shouldn't hold a database lock whilst performing other IO operations.
Only do read operations at the start of the function and then finaly commit all changes to the database at the end of the function when we have computed the result.

I think we should migrate to pattern 2.

Let me know your thoughts.

crates/chain-orchestrator/src/lib.rs

greged93

I'm not sure retry_with_backoff is the strategy we want, it might be a better to have a wrapper around each provider (RPC, DB,...) that implements a retry strategy.

crates/chain-orchestrator/src/lib.rs

crates/chain-orchestrator/src/retry.rs

greged93

I think we still have the previous idempotency issue.
Edit: never mind

frisitano

Looks good. I left some comments inline. I am wondering if we need to differentiate between the types of errors that are returned from functions? I think for now it's fine to leave as it is but in the future maybe we should consider error returned before retrying.

frisitano

Left a comment inline

crates/chain-orchestrator/src/retry.rs

frisitano

Left a few comments inline.

crates/chain-orchestrator/src/lib.rs

crates/chain-orchestrator/src/error.rs

frisitano

lgtm

yiweichi requested review from frisitano and greged93 September 16, 2025 12:02

frisitano reviewed Sep 17, 2025

View reviewed changes

crates/chain-orchestrator/src/lib.rs Outdated Show resolved Hide resolved

greged93 reviewed Sep 17, 2025

View reviewed changes

crates/chain-orchestrator/src/lib.rs Outdated Show resolved Hide resolved

crates/chain-orchestrator/src/lib.rs Show resolved Hide resolved

feat: database support retry

9d65609

yiweichi force-pushed the feat-chain-orchestrator-retry-on-failure branch from 2e14fa7 to 9d65609 Compare September 21, 2025 18:58

yiweichi and others added 6 commits September 22, 2025 02:58

Merge branch 'main' into feat-chain-orchestrator-retry-on-failure

63f8b0e

fmt

bca9803

rename

d3c8a8f

feat: l2 el provider retry

aabc09a

fix: db operation retry

6cba505

Merge branch 'main' into feat-chain-orchestrator-retry-on-failure

9f0bff7

yiweichi requested review from frisitano and greged93 September 22, 2025 05:59

yiweichi and others added 2 commits September 22, 2025 14:26

feat: network_client retry

b2fbf71

Merge branch 'main' into feat-chain-orchestrator-retry-on-failure

6ed0644

frisitano reviewed Sep 22, 2025

View reviewed changes

crates/chain-orchestrator/src/lib.rs Outdated Show resolved Hide resolved

crates/chain-orchestrator/src/retry.rs Show resolved Hide resolved

greged93 reviewed Sep 22, 2025

View reviewed changes

frisitano reviewed Sep 22, 2025

View reviewed changes

yiweichi and others added 2 commits September 22, 2025 23:31

fix: refactor

06ef253

Merge branch 'main' into feat-chain-orchestrator-retry-on-failure

0e1ddf9

frisitano reviewed Sep 22, 2025

View reviewed changes

crates/chain-orchestrator/src/retry.rs Show resolved Hide resolved

yiweichi and others added 2 commits September 23, 2025 15:43

address comment

c78d0bb

Merge branch 'main' into feat-chain-orchestrator-retry-on-failure

53ca27c

yiweichi requested a review from frisitano September 23, 2025 07:47

frisitano requested changes Sep 23, 2025

View reviewed changes

crates/chain-orchestrator/src/lib.rs Outdated Show resolved Hide resolved

crates/chain-orchestrator/src/error.rs Outdated Show resolved Hide resolved

address comment

df3922b

yiweichi requested a review from frisitano September 23, 2025 09:27

frisitano approved these changes Sep 23, 2025

View reviewed changes

yiweichi merged commit 84e8cfe into main Sep 23, 2025
13 checks passed

yiweichi deleted the feat-chain-orchestrator-retry-on-failure branch September 23, 2025 10:35

frisitano mentioned this pull request Sep 23, 2025

[Chain Orchestrator] Retry on failure #319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chain-orchestractor): support retry on failure #322

feat(chain-orchestractor): support retry on failure #322

Uh oh!

yiweichi commented Sep 16, 2025

Uh oh!

frisitano left a comment

Uh oh!

Uh oh!

greged93 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greged93 left a comment •

edited

Loading

Uh oh!

frisitano left a comment

Uh oh!

frisitano left a comment

Uh oh!

Uh oh!

frisitano left a comment

Uh oh!

Uh oh!

Uh oh!

frisitano left a comment

Uh oh!

Uh oh!

Uh oh!

feat(chain-orchestractor): support retry on failure #322

feat(chain-orchestractor): support retry on failure #322

Uh oh!

Conversation

yiweichi commented Sep 16, 2025

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greged93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greged93 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greged93 left a comment •

edited

Loading