-
Notifications
You must be signed in to change notification settings - Fork 407
Add ChannelDataPersister trait and point ChainMonitor to it. #681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ChannelDataPersister trait and point ChainMonitor to it. #681
Conversation
Codecov Report
@@ Coverage Diff @@
## master #681 +/- ##
==========================================
+ Coverage 91.34% 91.36% +0.01%
==========================================
Files 37 37
Lines 21974 22067 +93
==========================================
+ Hits 20073 20162 +89
- Misses 1901 1905 +4
Continue to review full report at Codecov.
|
just confirming that there's no plan to persist anything before the funding tx is constructed. i.e. if the node restarts after a channel open but before |
Channel data is persisted at the same time as before this PR. Looking in |
595db8a
to
5bb24fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK.
I still need to think about the consistency requirements with regards to higher layer of channel monitors but overall the flexibility offered is super cool :)
if let Some(orig_monitor) = monitors.get_mut(&outpoint) { | ||
log_trace!(self.logger, "Updating Channel Monitor for channel {}", log_funding_info!(orig_monitor)); | ||
if let Err(e) = orig_monitor.update_monitor(&update, &self.broadcaster, &self.logger) { | ||
return Err(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per changes at https://github.com/rust-bitcoin/rust-lightning/pull/679/files#diff-dfe4c687889c83110048d0bb7a250ee2R147 we shouldn't return early, instead opting to store the new updated ChannelMonitor. Somewhat awkwardly, we probably need to only do so on some potential MonitorUpdateErrors - We should probably migrate towards those errors containing a flag to indicate if the situation implies a bogus update or simply a locked channel.e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, I interpret this as: it'd be a good idea to add that "bogus update or locked channel" flag you mention in this PR.
Looking great so far to me, bu definitely curious about the persistence pipeline for memory -> disk during state negotiation procedures. I think that is where revisiting incremental updates would be very helpful, as those ought to be way less performance-heavy. |
This is great. My progress had been blocked with the lack of this feature in RL. Previously I could not know precisely when I should take SCB, because when a channel is closed there is no way to know about it besides polling the ChannelManager. |
5bb24fb
to
bd7e4b9
Compare
Hmm this PR isn't doesn't have the feature of notifying you on channel close (well, technically you'll get a callback on force close, but not on coop close). I filed #695 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running out of steam tonight. Will continue reviewing tomorrow.
35b34be
to
3ae7184
Compare
ff43ee6
to
627871c
Compare
lightning/src/ln/data_persister.rs
Outdated
/// supposed to be persisting bytes of data, backup sources may be considered | ||
/// "in consensus" if a majority of them agree on the current state and are on | ||
/// the highest known commitment number. See individual methods for more info. | ||
pub trait ChannelDataPersister: Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of being pedantic: how about simply ChannelPersister
? In other words, would removing the word "data" lead to any ambiguity? Or should this rather be called ChannelMonitorPersister
if it needs to differentiate from persisting Channel
s?
From a different angle, data_persister
as used here as a module name and later as a variable name is ambiguous when reading code. Generally speaking, "data" usually can be substituted with some abstraction (e.g. channel) to provide clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should we factor in that we intend to persist other data through this module in the future as well (e.g. ChannelManager
s, Channel
s, etc)? Maybe it should just be called the Persister
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that ChannelMonitor
and Channel
should be persisted atomically together, otherwise inconsistency could result. So that could be part of this trait.
Perhaps the non-channel-specific things in ChannelManager
should be persisted using a different trait. That could be named e.g. NodePersister
or such, since it includes non-channel relevant things like the announcement serial number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. The thing is that iiuc, stuff that needs to be persisted is pretty much divided into ChannelMonitor
s (which need to be time sensitively persisted whenever a channel is updated) and everything else, which should be persisted sometimes and would suck to lose but isn't at the same level of sensitivity.
How often the user wants to persist the less-sensitive stuff is subjective. We could make that decision for them (like the ChannelDataPersister
does) but different users may have different fault-tolerance preferences. Not sure what the solution is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok discussed offline, update:
There's only 2 structs that need to be persisted: (1) the ChannelMonitor
s, and (2) the ChannelManager
. The ChannelMonitor
s are sensitive data, and them being out of date could result in funds lost. However, the ChannelManager
is not sensitive, and if e.g. a node starts up with an out of date ChannelManager
, this will result in channels being force closed, but there's no chance for loss of funds.
So basically this PR is gonna focus on persisting ChannelMonitor
s, and we'll figure out how ChannelManager
persistence is gonna work later. And the remaining open question is what to name the ChannelDataPersister
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see several fields in Channel
that might be critical for correct functioning so need persistence, including:
channel_state
channel_outbound
value_to_self_msat
short_channel_id
(also needed to restoreChannelHolder
correctly)
and maybe others. They probably won't cause loss of funds if out of date, but the system just won't work when restarted and restored from disk.
Also, ChannelHolder
might have information that is required in order to be able to claim HTLCs, so if those are lost we can lose the HTLC value.
If we lose ChannelManager.last_node_announcement_serial
, what is the impact?
Lastly, the KeysManager
inside the ChannelManager
has some state, including channel_child_index
and rand_bytes_child_index
. I'm not sure what is the impact if we end up with the same keyset for two channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe others. They probably won't cause loss of funds if out of date, but the system just won't work when restarted and restored from disk.
You need to store the whole channel policy as announced to your counterparty, including its own policy, otherwise you may have channel policy discrepancies and your counterparty closing the channel unexpectedly (e.g a max_accepted_htlcs
violation).
Also, ChannelHolder might have information that is required in order to be able to claim HTLCs, so if those are lost we can lose the HTLC value.
Good point, you need to store HTLC-relay data too, otherwise you may lost an in-flight update_fail_htlc
form an downstream channel to an upstream one.
I don't think we need to store the Events as they should be re-generated if we re-apply the state correctly to every component.
If we lose ChannelManager.last_node_announcement_serial, what is the impact?
BOLT7:
if node_id is NOT previously known from a channel_announcement message, OR if timestamp is NOT greater than the last-received node_announcement from this node_id: SHOULD ignore the message.
Lastly, the KeysManager inside the ChannelManager has some state, including channel_child_index and rand_bytes_child_index. I'm not sure what is the impact if we end up with the same keyset for two channels.
Fund safety, so we should be careful storing it.
lightning/src/ln/data_persister.rs
Outdated
/// | ||
/// [`update_id`]: struct.ChannelMonitorUpdate.html#structfield.update_id | ||
/// [`ChannelMonitor::write_for_disk`]: ../../chain/channelmonitor/struct.ChannelMonitor.html#method.write_for_disk | ||
fn persist_channel_data(&self, id: OutPoint, data: &ChannelMonitor<Self::Keys>) -> Result<(), ChannelMonitorUpdateErr>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the outcome of the above naming discussion, consider naming this persist_new_channel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our discussion in the call, it seemed like our current idea of what our sample module will look like is: on persist_channel_data
, the monitor update will be somehow persisted, and persistence of the whole ChannelManager
will be kicked off in the background.
Given this, maybe Persister
is the best name, and then indeed we can rename these methods persist_new_channel
/ update_channel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, ChainMonitor
will use the persister as currently implemented in this PR. ChannelManager
can fire an event for persisting the ChannelManager
that users can then process in the background. The latter can be implemented in a follow-up PR.
Regarding naming, Persister
is probably fine so long as we don't need another trait to persist something else. Or possibly simply Persist
to follow the loose rust trait naming conventions, which I went with in the chain
module traits. Additionally, this trait could be moved to channelmonitor.rs
which would make for a less ambiguous, module-qualified name: channelmonitor::Persist
.
I don't have strong opinion yet on the function names. IIRC, we even discussed renaming ChannelMonitor
(which could affect these function names), but I'm forgetting where that conversation went. I think I was opposed to it but can't recall the alternative name. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to the names you suggested and channelmonitor::Persist
. I think it's much nicer.
pub struct MonitorUpdateError { | ||
/// The human-readable error message string. | ||
pub msg: &'static str, | ||
/// A monitor update may have an error but still need to be persisted to disk. | ||
pub persist_updated_monitor: bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this error an enum instead of adding a bool field. Not sure about naming but maybe something akin to the variants in ChannelMonitorUpdateErr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lmk what you think of the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, w/r/t the naming of ChannelMonitorUpdateErr
and MonitorUpdateError
-- what do you think of just renaming MonitorUpdateError
to InternalMonitorErr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our offline discussion, I'm starting to become of the mind that these should be reduced to a single error type. My reasoning is as follows:
ChannelMonitor
's methods can return errors.chain::Watch
's interface is defined in terms ofChannelMonitor
,ChannelMonitorUpdate
, andChannelMonitorUpdateErr
.ChainMonitor
is an implementation ofchain::Watch
, but there may be other implementations of the trait.
Therefore, whether a ChannelMonitor
must be persisted isn't necessarily always an internal detail handled by ChainMonitor
; another implementation of chain::Watch
or someone using ChannelMonitor
directly would also need to do so.
My thought then would be to have one error type where the PermanentFailure
variant has a bool indicating if it should be persisted.
That said, I'd like to get a better idea about where in the code that the "failure but persist" error will occur. I couldn't quite make that out from #681 (comment). Additionally, the docs for PermanentFailure
indicate it can be used in two different ways. This makes me wonder if it should then be two different variants (and if so would this would remove the need for the bool?).
Thoughts? Let me know if I misunderstood anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think they're definitely very different in their purpose, though I don't necessarily think that implies they can't be the same type -
MonitorUpdateError
is an error type returned by RL to the user to indicate that the update provided to the monitor was somehow bogus. It instructs the user whether they should still persist the updated object to disk, but in any case the update was not fully applied and ready to move forward.ChannelMonitorUpdateErr
is an error type generated by the user instructing theChannel{,Manager}
what the current state of the monitor update is. It is maybe more aptly namedChannelMonitorUpdateStatus
(and replacing theResult
with just a general done/in-flight/failed enum).
The should-persist flag is definitely very awkward, and we should take pains in the documentation to point out that it is only relevant if you are running in a multi-primary redundant ChannelMonitor
setup where each ChannelMonitor
can unilaterally broadcast a transaction. In other words, for most folks, we can pretend its not there. See #679 for the one case where its ever set/matters. For normal folks, the MonitorUpdateError
is just a human-readable error string that's arguably only useful for opening an issue on our github.
Indeed, for more complicated setups, arguably, they start to blend a bit more, but I do think there's value in them being different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good observation. I like where you're going with making ChannelMonitorUpdateErr
a status instead of an error.
May need to think on this more, but just to throw an idea out there: would it make sense to have the chain::Watch
interface return something like Result<ChannelMonitorUpdateStatus, MonitorUpdateError>
? Possibly renaming these in some way, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result<ChannelMonitorUpdateStatus, MonitorUpdateError>
Hmm, interesting point. I suppose we could, though it doesn't mean anything to the ChannelManager
which receives it - Err(_)
would be converted to a PermanentFailure
status either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result<ChannelMonitorUpdateStatus, MonitorUpdateError>
Hmm, interesting point. I suppose we could, though it doesn't mean anything to the
ChannelManager
which receives it -Err(_)
would be converted to aPermanentFailure
status either way.
Not 100% sure if I follow that last point. Presumably, ChannelManager
would process either the status if the Result
is ok
or the error if it is err
using handle_monitor_err!
.
It's the ChainMonitor
that currently converts any MonitorUpdateError
to a PermanentFailure
. In what I'm floating, ChainMonitor
would examine the Result
and persist if needed and then simply return the Result
as is back to ChannelManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my point was only that that would be communicating extra information to the ChannelManager
that it doesnt care about - PermanentFailure
is the same to it as Err(_)
.
edb2855
to
4ff66ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code looks pretty good, a few comments and I still need to thoroughly scan the new documentation, but can do after 649. Don't have strong feelings on naming.
} | ||
if should_persist { | ||
match self.data_persister.update_channel_data(outpoint, &update, orig_monitor) { | ||
Err(e) => return Err(e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must not "upgrade" a PermanentFailure
in res to a TemporaryFailure
.
lightning-data-persister/src/lib.rs
Outdated
{ | ||
let mut f = fs::File::create(&tmp_filename)?; | ||
monitor.write_for_disk(&mut f)?; | ||
f.sync_all()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to be sensible on OSX, it looks like you need at least rust stdlib 1.36, at least going by this commit: rust-lang/rust@d602a6b. This should be documented somwehre.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I documented this but only by leaving a comment. Do you think it'd be ideal to add a README for gotchas like this? (It'd be somewhat redundant to the comments in the file.)
5cc5671
to
2d88541
Compare
ef92009
to
9ca8891
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely happy where this is now! 👍 Happy to discuss the testing strategy more offline.
// Test failure to rename in the process of atomically creating a channel | ||
// monitor's file. We induce this failure by making the `tmp` file a | ||
// directory. | ||
// Explanation: given "from" = the file being renamed, "to" = the | ||
// renamee that already exists: Windows should fail because it'll fail | ||
// whenever "to" is a directory, and Unix should fail because if "from" is a | ||
// file, then "to" is also required to be a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this actually test temp file creation (line 81) rather than the rename (line 85)? It's a good test to have regardless, though I think in both cases creating a read-only version of the temp and final files, respectively, may be sufficient to handle any platform... shooting from the hip on the last part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in both cases creating a read-only version of the temp and final files, respectively, may be sufficient to handle any platform
I wasn't sure what this part meant.
(Btw I remembered last night you're on vacation right now, I'm just responding to your responses for posterity). But, good catch and I kept this test and added an actual rename faliure unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking for either test, the presence of a read-only file ("to" or "from" depending on the test) might be sufficient to trigger the case without using a directory. Using a directory is an extra detail for the reader (different type vs invalid permission), so if possible it may be better to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem work, at least on unix. This is what I tried changing the test to: https://i.imgur.com/yOHqQB5.png (sorry not the most readable). Lmk if i'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really needs to be fixed, otherwise remove_dir_all
is wiping out all the intent of persister as it doesn't check for emptiness ?
New tests sound good.
fc5d755
to
d546c91
Compare
Yeah really good catch that Jeff also noticed, fixed. |
07de61f
to
ed6101f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested ACK ed6101f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few nits, only really worth addressing if you're squashing/making other changes anyway. What say you @jkczyz
The utilities will still be part of cfg(test). The purpose of this is to enable other crates in the workspace to use the test utilities.
Implementors of Persist are responsible for persisting new ChannelMonitors and ChannelMonitor updates to disk/backups.
- The ChainMonitor should: Whenever a new channel is added or updated, these updates should be conveyed to the persister and persisted to disk. Even if the update errors while it's being applied, the updated monitor still needs to be persisted.
d3aeaf8
to
8a9e1f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few nits, only really worth addressing if you're squashing/making other changes anyway. What say you @jkczyz
LGTM. Sorry about the delay! Feel free to address any comments in a follow-up, if needed.
// Test failure to rename in the process of atomically creating a channel | ||
// monitor's file. We induce this failure by making the `tmp` file a | ||
// directory. | ||
// Explanation: given "from" = the file being renamed, "to" = the | ||
// renamee that already exists: Windows should fail because it'll fail | ||
// whenever "to" is a directory, and Unix should fail because if "from" is a | ||
// file, then "to" is also required to be a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking for either test, the presence of a read-only file ("to" or "from" depending on the test) might be sufficient to trigger the case without using a directory. Using a directory is an extra detail for the reader (different type vs invalid permission), so if possible it may be better to avoid.
// Set up a dummy channel and force close. This will produce a monitor | ||
// that we can then use to test persistence. | ||
let chanmon_cfgs = create_chanmon_cfgs(2); | ||
let node_cfgs = create_node_cfgs(2, &chanmon_cfgs); | ||
let node_chanmgrs = create_node_chanmgrs(2, &node_cfgs, &[None, None]); | ||
let nodes = create_network(2, &node_cfgs, &node_chanmgrs); | ||
let chan = create_announced_chan_between_nodes(&nodes, 0, 1, InitFeatures::known(), InitFeatures::known()); | ||
nodes[1].node.force_close_channel(&chan.2); | ||
let mut added_monitors = nodes[1].chain_monitor.added_monitors.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this setup, could you test the persister using write_channel_data
like the earlier tests? Likewise in the next test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's nice to be able to test that it returns a PermanentError
(since ultimately that's what we care about)
Intended to be a cross-platform implementation of the channelmonitor::Persist trait. This adds a new lightning-persister crate, that uses the newly exposed lightning crate's test utilities. Notably, this crate is pretty small right now. However, due to future plans to add more data persistence (e.g. persisting the ChannelManager, etc) and a desire to avoid pulling in filesystem usage into the core lightning package, it is best for it to be separated out. Note: Windows necessitates the use of OpenOptions with the `write` permission enabled to `sync_all` on a newly opened channel's data file.
If a persister returns a temporary failure, the channel monitor should be able to be put on ice and then revived later. If a persister returns a permanent failure, the channel should be force closed.
This function does not necessarily write to disk, it can serialize to anything that implements Writer.
8a9e1f3
to
cda8fb7
Compare
One or two tests may need a small follow-up, but looks like mostly all comments addressed :) |
…gdevkit#681) Previously we'd ignored generic arguments in traits, leading to bogus code generation after the Persister trait was added in lightningdevkit#681. This adds minimal support for it, fixing code generation on latest upstream.
…gdevkit#681) Previously we'd ignored generic arguments in traits, leading to bogus code generation after the Persister trait was added in lightningdevkit#681. This adds minimal support for it, fixing code generation on latest upstream.
Bindings Updates for #681 and more Clones
Intended to be a simple way for users to know where and how to put their
backup and persistence logic.
Sanity check/architecture reviews appreciated.
TODOS:
std::path
fsync
'ing the parent directory on writeDepends on #713 and #649.