Implement concurrent broadcast tolerance for distributed watchtowers #679

ariard · 2020-08-28T01:03:22Z

Implement concurrent broadcast tolerance for distributed watchtowers

With a distrbuted watchtowers deployment, where each monitor is plugged
to its own chain view, there is no guarantee that block are going to be
seen in same order. Watchtower may diverge in their acceptance of a
submitted commitment_signed update due to a block timing-out a HTLC
and provoking a subset but yet not seen by the other watchtower subset.
Any update reject by one of the watchtower must block offchain coordinator
to move channel state forward and release revocation secret for previous
state.

In this case, we want any watchtower from the rejection subset to still
be able to claim outputs if the concurrent state, has accepted by the
other subset, is confirming. This improve overall watchtower system
fault-tolerance.

This change stores local commitment transaction unconditionally and fail
the update if there is knowledge of an already signed commitment
transaction (ChannelMonitor.local_tx_signed=true).

Follow-up of #667, which let us easily implement concurrent tolerance and get ride off of confusing OnchainTxHandler/ChannelMonitor comments.

codecov · 2020-08-28T15:49:10Z

Codecov Report

Merging #679 into master will decrease coverage by 0.01%.
The diff coverage is 98.70%.

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
- Coverage   91.92%   91.91%   -0.02%     
==========================================
  Files          35       35              
  Lines       20082    20154      +72     
==========================================
+ Hits        18461    18524      +63     
- Misses       1621     1630       +9

Impacted Files	Coverage Δ
lightning/src/ln/functional_tests.rs	`96.98% <98.63%> (-0.16%)`	⬇️
lightning/src/ln/channelmonitor.rs	`94.99% <100.00%> (ø)`
lightning/src/ln/onchaintx.rs	`93.89% <100.00%> (-0.02%)`	⬇️
lightning/src/ln/channel.rs	`86.23% <0.00%> (+0.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 343aacc...6622ea7. Read the comment docs.

TheBlueMatt · 2020-08-28T18:33:05Z

lightning/src/ln/channelmonitor.rs

 		self.current_local_commitment_number = 0xffff_ffff_ffff - ((((sequence & 0xffffff) << 3*8) | (locktime as u64 & 0xffffff)) ^ self.commitment_transaction_number_obscure_factor);
 		mem::swap(&mut new_local_commitment_tx, &mut self.current_local_commitment_tx);
 		self.prev_local_signed_commitment_tx = Some(new_local_commitment_tx);
+		if self.local_tx_signed {
+			return Err(MonitorUpdateError("Latest local commitment signed has already been signed, update is rejected"));


Updating before returning an Err probably needs a documentation update to indicate to the user that they should still write out the new version to disk even in case of err (which is a super surprising API).

Added a new comment a the top-level API : ManyChannelMonitor::update_monitor

I know this API is surprising, but you rarely have distributed system driven both by a central coordinator and a p2p network.

TheBlueMatt · 2020-08-28T18:33:50Z

lightning/src/ln/channelmonitor.rs

-		if let Err(_) = self.onchain_tx_handler.provide_latest_local_tx(commitment_tx) {
-			return Err(MonitorUpdateError("Local commitment signed has already been signed, no further update of LOCAL commitment transaction is allowed"));
-		}
+		self.onchain_tx_handler.provide_latest_local_tx(commitment_tx);


Can we assert somewhere in here that prev_local_signed_commitment_tx must have been None before this? I suppose return Err and debug_assert!(false).

We don't explicitly prune out prev_local_signed_commitment=None.
We could add a round-trip between coordinator and watchtowers with a ChannelMonitorUpdateStep::PrunePrevLocalCommitment but that would be a latency hit for nothing, assuming we never return an Err if we broadcast a local commitment and coordinator fail advancing forward the channel by releasing the secret. Note the broadcast-commitment-then-reject-update is covered in new test (watchtower_alice failing the update_monitor).

TheBlueMatt · 2020-08-28T18:36:47Z

lightning/src/ln/functional_tests.rs

+
+#[test]
+fn test_concurrent_monitor_claim() {
+	// Watchtower Alice receives block 135, broadcasts state X, rejects state Y.


Can you make this comment more chronological? I think this is what threw me off the other day,

Watchtower A receives block, broadcasts state, then channel receives new state Y, sending it to both watchtowers, Bob accepts Y, then receives block and broadcasts the latest state Y, Alice rejects state Y, but Bob has already broadcasted it, Y confirms.

TheBlueMatt

I didn't read through the test yet, but I get what you're going for more. A few relatively minor comments but conceptually looks good.

TheBlueMatt · 2020-08-31T15:49:37Z

lightning/src/ln/channelmonitor.rs

@@ -823,6 +830,10 @@ pub struct ChannelMonitor<ChanSigner: ChannelKeys> {
 	// Set once we've signed a local commitment transaction and handed it over to our
 	// OnchainTxHandler. After this is set, no future updates to our local commitment transactions
 	// may occur, and we fail any such monitor updates.
+	//
+	// In case of update rejection due to a locally already signed commitment transaction, we


Maybe:
Note that even when we fail a local commitment transaction update, we still store the update to ensure we can claim from it in case a duplicate copy of this ChannelMonitor broadcasts it.

TheBlueMatt · 2020-08-31T15:52:09Z

lightning/src/ln/channelmonitor.rs

@@ -887,6 +898,11 @@ pub trait ManyChannelMonitor: Send + Sync {
 	///
 	/// Any spends of outputs which should have been registered which aren't passed to
 	/// ChannelMonitors via block_connected may result in FUNDS LOSS.
+	///
+	/// In case of distributed watchtowers deployment, even if an Err is return, the new version


This needs more detail about the exact setup of a distributed ChannelMonitor, as several are possible (I think its a misnomer to call it a watchtower as that implies something specific about availability of keys which isn't true today), and probably this comment should be on ChannelMonitor::update_monitor, not the trait (though it could be referenced in the trait), because the trait's return value is only relevant for ChannelManager.

I know writing a documentation for a distributed ChannelMonitor is tracked by #604. I intend to do it post-anchor as there are implications for bumping utxo management, namely you should also duplicate those keys across distributed monitors.

TheBlueMatt · 2020-08-31T16:01:36Z

lightning/src/ln/channelmonitor.rs

 	///
-	/// Should also be used to indicate a failure to update the local persisted copy of the channel
-	/// monitor.
+	/// At reception of this error, channel manager MUST force-close the channel and return at


I don't think it makes sense to describe this in these terms - what ChannelManager MUST do isn't interesting to our users, its what we do! Instead, we should describe it as ChannelManager will do X.

TheBlueMatt · 2020-08-31T16:01:58Z

lightning/src/ln/channelmonitor.rs

+	/// This failure may also signal a failure to update the local persisted copy of one of
+	/// the channel monitor instance.
+	///
+	/// In case of a distributed watchtower deployment, failures sources can be divergent and


I mean also because there's no use in any more granular errors :).

ariard · 2020-08-31T19:05:04Z

Updated, for complete documentation, I will do it post-anchor/chaininterface refactoring.

TheBlueMatt · 2020-09-08T21:48:59Z

lightning/src/ln/channelmonitor.rs

+	/// This failure may also signal a failure to update the local persisted copy of one of
+	/// the channel monitor instance.
+	///
+	/// Note that even when we fail a local commitment transaction update, we still store the


This should be phrased as "you must" and probably be more clear (not "we" in the sense of "rust-lightning does"). Also we should distinguish between "ChannelMonitor returned" and "you are returning from a ManyChannelMonitor/chain::Watch", though if you want to just make @jkczyz deal with this in #649 thats ok too.

With a distrbuted watchtowers deployment, where each monitor is plugged to its own chain view, there is no guarantee that block are going to be seen in same order. Watchtower may diverge in their acceptance of a submitted `commitment_signed` update due to a block timing-out a HTLC and provoking a subset but yet not seen by the other watchtower subset. Any update reject by one of the watchtower must block offchain coordinator to move channel state forward and release revocation secret for previous state. In this case, we want any watchtower from the rejection subset to still be able to claim outputs if the concurrent state, has accepted by the other subset, is confirming. This improve overall watchtower system fault-tolerance. This change stores local commitment transaction unconditionally and fail the update if there is knowledge of an already signed commitment transaction (ChannelMonitor.local_tx_signed=true).

Watchower Alice receives block 134, broadcasts state X, rejects state Y. Watchtower Bob accepts state Y, receives blocks 135, broadcasts state Y. State Y confirms onchain. Alice must be able to claim outputs.

Sources of the failure may be multiple in case of distributed watchtower deployment. In either case, the channel manager must return a final update asking to its channel monitor(s) to broadcast the lastest state available. Revocation secret must not be released for the faultive channel. In the future, we may return wider type of failures to take more fine-grained processing decision (e.g if local disk failure and redudant remote channel copy available channel may still be processed forward).

Implement concurrent broadcast tolerance for distributed watchtowers

…, lightningdevkit#683.

ariard force-pushed the 2020-08-concurrent-watchtowers branch 2 times, most recently from 6ebfabf to 431e0e8 Compare August 28, 2020 15:19

TheBlueMatt reviewed Aug 28, 2020

View reviewed changes

ariard force-pushed the 2020-08-concurrent-watchtowers branch from 431e0e8 to 963aaa6 Compare August 28, 2020 20:19

TheBlueMatt reviewed Aug 31, 2020

View reviewed changes

ariard force-pushed the 2020-08-concurrent-watchtowers branch from 764f245 to aaf8c83 Compare August 31, 2020 19:03

ariard force-pushed the 2020-08-concurrent-watchtowers branch from aaf8c83 to 2ad5aa1 Compare August 31, 2020 19:06

TheBlueMatt reviewed Sep 8, 2020

View reviewed changes

TheBlueMatt mentioned this pull request Sep 14, 2020

Document consistency and availability assumption of watchtower infrastructure #604

Open

ariard force-pushed the 2020-08-concurrent-watchtowers branch 2 times, most recently from bb73439 to 7cd09c5 Compare September 15, 2020 22:12

Antoine Riard added 3 commits September 15, 2020 18:17

Add test_concurrent_monitor_claim

e706c67

Watchower Alice receives block 134, broadcasts state X, rejects state Y. Watchtower Bob accepts state Y, receives blocks 135, broadcasts state Y. State Y confirms onchain. Alice must be able to claim outputs.

ariard force-pushed the 2020-08-concurrent-watchtowers branch from 7cd09c5 to 6622ea7 Compare September 15, 2020 22:17

TheBlueMatt merged commit 44734be into lightningdevkit:master Sep 16, 2020

TheBlueMatt added a commit that referenced this pull request Sep 16, 2020

Merge pull request #679 from ariard/2020-08-concurrent-watchtowers

e9d3033

Implement concurrent broadcast tolerance for distributed watchtowers

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020

Update bindings after merge of lightningdevkit#633, lightningdevkit#679…

9b7f98c

…, lightningdevkit#683.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020

Update bindings after merge of lightningdevkit#633, lightningdevkit#679…

7ab4273

…, lightningdevkit#683.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020

Update bindings after merge of lightningdevkit#633, lightningdevkit#679…

49e0a8c

…, lightningdevkit#683.

valentinewallace mentioned this pull request Sep 22, 2020

Add ChannelDataPersister trait and point ChainMonitor to it. #681

Merged

5 tasks

Implement concurrent broadcast tolerance for distributed watchtowers #679

Implement concurrent broadcast tolerance for distributed watchtowers #679

Uh oh!

Conversation

ariard commented Aug 28, 2020

Uh oh!

codecov bot commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariard Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariard Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariard Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariard commented Aug 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Aug 28, 2020 •

edited

Loading

ariard Aug 28, 2020 •

edited

Loading

ariard Aug 28, 2020 •

edited

Loading

TheBlueMatt Aug 28, 2020 •

edited

Loading

ariard Aug 31, 2020 •

edited

Loading