-
Notifications
You must be signed in to change notification settings - Fork 404
Implement concurrent broadcast tolerance for distributed watchtowers #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement concurrent broadcast tolerance for distributed watchtowers #679
Conversation
6ebfabf
to
431e0e8
Compare
Codecov Report
@@ Coverage Diff @@
## master #679 +/- ##
==========================================
- Coverage 91.92% 91.91% -0.02%
==========================================
Files 35 35
Lines 20082 20154 +72
==========================================
+ Hits 18461 18524 +63
- Misses 1621 1630 +9
Continue to review full report at Codecov.
|
lightning/src/ln/channelmonitor.rs
Outdated
self.current_local_commitment_number = 0xffff_ffff_ffff - ((((sequence & 0xffffff) << 3*8) | (locktime as u64 & 0xffffff)) ^ self.commitment_transaction_number_obscure_factor); | ||
mem::swap(&mut new_local_commitment_tx, &mut self.current_local_commitment_tx); | ||
self.prev_local_signed_commitment_tx = Some(new_local_commitment_tx); | ||
if self.local_tx_signed { | ||
return Err(MonitorUpdateError("Latest local commitment signed has already been signed, update is rejected")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating before returning an Err probably needs a documentation update to indicate to the user that they should still write out the new version to disk even in case of err (which is a super surprising API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new comment a the top-level API : ManyChannelMonitor::update_monitor
I know this API is surprising, but you rarely have distributed system driven both by a central coordinator and a p2p network.
lightning/src/ln/channelmonitor.rs
Outdated
if let Err(_) = self.onchain_tx_handler.provide_latest_local_tx(commitment_tx) { | ||
return Err(MonitorUpdateError("Local commitment signed has already been signed, no further update of LOCAL commitment transaction is allowed")); | ||
} | ||
self.onchain_tx_handler.provide_latest_local_tx(commitment_tx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert somewhere in here that prev_local_signed_commitment_tx must have been None before this? I suppose return Err and debug_assert!(false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't explicitly prune out prev_local_signed_commitment=None.
We could add a round-trip between coordinator and watchtowers with a ChannelMonitorUpdateStep::PrunePrevLocalCommitment
but that would be a latency hit for nothing, assuming we never return an Err if we broadcast a local commitment and coordinator fail advancing forward the channel by releasing the secret. Note the broadcast-commitment-then-reject-update is covered in new test (watchtower_alice
failing the update_monitor
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right.
lightning/src/ln/functional_tests.rs
Outdated
|
||
#[test] | ||
fn test_concurrent_monitor_claim() { | ||
// Watchtower Alice receives block 135, broadcasts state X, rejects state Y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this comment more chronological? I think this is what threw me off the other day,
Watchtower A receives block, broadcasts state,
then channel receives new state Y, sending it to both watchtowers,
Bob accepts Y, then receives block and broadcasts the latest state Y,
Alice rejects state Y, but Bob has already broadcasted it,
Y confirms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read through the test yet, but I get what you're going for more. A few relatively minor comments but conceptually looks good.
431e0e8
to
963aaa6
Compare
@@ -823,6 +830,10 @@ pub struct ChannelMonitor<ChanSigner: ChannelKeys> { | |||
// Set once we've signed a local commitment transaction and handed it over to our | |||
// OnchainTxHandler. After this is set, no future updates to our local commitment transactions | |||
// may occur, and we fail any such monitor updates. | |||
// | |||
// In case of update rejection due to a locally already signed commitment transaction, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
Note that even when we fail a local commitment transaction update, we still store the update to ensure we can claim from it in case a duplicate copy of this ChannelMonitor broadcasts it.
@@ -887,6 +898,11 @@ pub trait ManyChannelMonitor: Send + Sync { | |||
/// | |||
/// Any spends of outputs which should have been registered which aren't passed to | |||
/// ChannelMonitors via block_connected may result in FUNDS LOSS. | |||
/// | |||
/// In case of distributed watchtowers deployment, even if an Err is return, the new version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more detail about the exact setup of a distributed ChannelMonitor, as several are possible (I think its a misnomer to call it a watchtower as that implies something specific about availability of keys which isn't true today), and probably this comment should be on ChannelMonitor::update_monitor, not the trait (though it could be referenced in the trait), because the trait's return value is only relevant for ChannelManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know writing a documentation for a distributed ChannelMonitor is tracked by #604. I intend to do it post-anchor as there are implications for bumping utxo management, namely you should also duplicate those keys across distributed monitors.
lightning/src/ln/channelmonitor.rs
Outdated
/// | ||
/// Should also be used to indicate a failure to update the local persisted copy of the channel | ||
/// monitor. | ||
/// At reception of this error, channel manager MUST force-close the channel and return at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to describe this in these terms - what ChannelManager MUST do isn't interesting to our users, its what we do! Instead, we should describe it as ChannelManager
will do X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
lightning/src/ln/channelmonitor.rs
Outdated
/// This failure may also signal a failure to update the local persisted copy of one of | ||
/// the channel monitor instance. | ||
/// | ||
/// In case of a distributed watchtower deployment, failures sources can be divergent and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean also because there's no use in any more granular errors :).
764f245
to
aaf8c83
Compare
Updated, for complete documentation, I will do it post-anchor/chaininterface refactoring. |
aaf8c83
to
2ad5aa1
Compare
lightning/src/ln/channelmonitor.rs
Outdated
/// This failure may also signal a failure to update the local persisted copy of one of | ||
/// the channel monitor instance. | ||
/// | ||
/// Note that even when we fail a local commitment transaction update, we still store the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be phrased as "you must" and probably be more clear (not "we" in the sense of "rust-lightning does"). Also we should distinguish between "ChannelMonitor returned" and "you are returning from a ManyChannelMonitor/chain::Watch", though if you want to just make @jkczyz deal with this in #649 thats ok too.
bb73439
to
7cd09c5
Compare
With a distrbuted watchtowers deployment, where each monitor is plugged to its own chain view, there is no guarantee that block are going to be seen in same order. Watchtower may diverge in their acceptance of a submitted `commitment_signed` update due to a block timing-out a HTLC and provoking a subset but yet not seen by the other watchtower subset. Any update reject by one of the watchtower must block offchain coordinator to move channel state forward and release revocation secret for previous state. In this case, we want any watchtower from the rejection subset to still be able to claim outputs if the concurrent state, has accepted by the other subset, is confirming. This improve overall watchtower system fault-tolerance. This change stores local commitment transaction unconditionally and fail the update if there is knowledge of an already signed commitment transaction (ChannelMonitor.local_tx_signed=true).
Watchower Alice receives block 134, broadcasts state X, rejects state Y. Watchtower Bob accepts state Y, receives blocks 135, broadcasts state Y. State Y confirms onchain. Alice must be able to claim outputs.
Sources of the failure may be multiple in case of distributed watchtower deployment. In either case, the channel manager must return a final update asking to its channel monitor(s) to broadcast the lastest state available. Revocation secret must not be released for the faultive channel. In the future, we may return wider type of failures to take more fine-grained processing decision (e.g if local disk failure and redudant remote channel copy available channel may still be processed forward).
7cd09c5
to
6622ea7
Compare
Implement concurrent broadcast tolerance for distributed watchtowers
Implement concurrent broadcast tolerance for distributed watchtowers
With a distrbuted watchtowers deployment, where each monitor is plugged
to its own chain view, there is no guarantee that block are going to be
seen in same order. Watchtower may diverge in their acceptance of a
submitted
commitment_signed
update due to a block timing-out a HTLCand provoking a subset but yet not seen by the other watchtower subset.
Any update reject by one of the watchtower must block offchain coordinator
to move channel state forward and release revocation secret for previous
state.
In this case, we want any watchtower from the rejection subset to still
be able to claim outputs if the concurrent state, has accepted by the
other subset, is confirming. This improve overall watchtower system
fault-tolerance.
This change stores local commitment transaction unconditionally and fail
the update if there is knowledge of an already signed commitment
transaction (ChannelMonitor.local_tx_signed=true).
Follow-up of #667, which let us easily implement concurrent tolerance and get ride off of confusing OnchainTxHandler/ChannelMonitor comments.