Skip to content

Rewrite lightning-net-tokio using async/await and tokio 0.2 #472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

This is a rather major rewrite, using async/await and tokio 0.2,
which cleans up the code a ton as well as adds significantly to
readability.

CC #423.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch 3 times, most recently from b34f482 to f0178a4 Compare February 1, 2020 22:46
@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from f0178a4 to 2dd2273 Compare February 3, 2020 23:48
@TheBlueMatt
Copy link
Collaborator Author

Pushed with a few fixes, plus a workaround for tokio-rs/tokio#2213 (not sure whether to call it a bug, will wait for the tokio folks to comment).

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 2dd2273 to 6bc40a0 Compare February 14, 2020 06:55
@TheBlueMatt
Copy link
Collaborator Author

Tokio upstream fixed 2213, so I've reverted the awkward fix for it, though this is now blocked until the next tokio release.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm reviewing this as entirely new code since #339 was not reviewed by anyone.

}

/// Process incoming messages and feed outgoing messages on the provided socket generated by
/// accepting an incoming connection (by scheduling futures with tokio::spawn).
///
/// You should poll the Receive end of event_notify and call get_and_clear_pending_events() on
/// ChannelManager and ChannelMonitor objects.
pub fn setup_inbound<CMH: ChannelMessageHandler + 'static>(peer_manager: Arc<peer_handler::PeerManager<SocketDescriptor<CMH>, Arc<CMH>>>, event_notify: mpsc::Sender<()>, stream: TcpStream) {
let (reader, us) = Self::new(event_notify, stream);
pub async fn setup_inbound<CMH: ChannelMessageHandler + 'static>(peer_manager: Arc<peer_handler::PeerManager<SocketDescriptor, Arc<CMH>>>, event_notify: mpsc::Sender<()>, stream: TcpStream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand the design decisions: what's the reasoning behind not returning a Connection? It seems like it could be an ideal abstraction for users to interact with.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made Connection non-pub as it didn't make sense to do so. Let me know if its clearer now.

pub struct Connection {
writer: Option<mpsc::Sender<bytes::Bytes>>,
writer: Option<io::WriteHalf<TcpStream>>,
event_notify: mpsc::Sender<()>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the purpose of event_notify to me? It isn't readily apparent as to how this is suppose to be used within this module and by the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the docs on the connection functions are a bit clearer now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on one of those functions with regards to clarification.

The module-level documentation should describe how events are used with the API, including a concise example, as it seems that is the primary interaction users have after creating connections. The explanation should not need to be repeated on all three connection functions.

Comment on lines 168 to 169
/// You should poll the Receive end of event_notify and call get_and_clear_pending_events() on
/// ChannelManager and ChannelMonitor objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems out of place as it presupposes the reader knows how event_notify is used in the API, which AFAICT is never explained. The context around calling "get_and_clear_pending_events() on ChannelManager and ChannelMonitor objects" is also missing.

Comment on lines 165 to 166
/// Process incoming messages and feed outgoing messages on the provided socket generated by
/// accepting an incoming connection (by scheduling futures with tokio::spawn).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit verbose which indicates to me that this may be doing too much. As per my comment on line 179, I'd like to explore whether we can provide an asynchronous API to the user based around a Connection abstraction. But I need to get a better understanding of the design before giving any concrete recommendations with regards to its interface.


(reader, us)
(reader, receiver,
Arc::new(Mutex::new(Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I follow why an Arc and a Mutex are needed. Could you describe a scenario for me where a Connection may be accessed concurrently?

}
}
impl<CMH: ChannelMessageHandler> peer_handler::SocketDescriptor for SocketDescriptor<CMH> {
impl peer_handler::SocketDescriptor for SocketDescriptor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a high-level description of this implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to ping this comment. Trying to get a grasp of how send_data works, specifically wrt how Waker and RawWaker fit in. Feel free to point to a reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this, gotta love the GitHub UI. The only way I know how to surface old comments again is to leave a response as a part of a review, then GH will show that response duplicatively but at the end of the discussion.

In general, Rust futures (which you can mentally model as green threads) work by calling a poll function, which either makes progress or it doesn't. To know when to poll, you provide a waker which receives a call when the runtime (in this case Tokio) thinks polling makes sense now (ie because we have more room in the socket write buffer, as returned by select() etc). Here, we use that to wake the read/process "thread" and then it calls write_buffer_space_avail() which will call send_data() and we'll end up back at poll().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I have a high-level understanding of how this works, and I don't have a problem following what send_data is doing.

The only open question I have is: why are RawWaker, RawWakerVTable, and the unsafe code needed? I couldn't find them in the async book, which instead demonstrates creating a Waker by implementing the ArcWake trait. But I don't know if that's applicable in this case.

No need to hold the review up on this, but curious for my own understanding of why one approach is chosen over another.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ArcWake trait is just a utility to do essentially what we're doing, but provided by the futures crate, which we don't depend on here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, the way Futures were stabilized upstream is pretty bare-bones. The only way to implement your own future is with all the unsafe code and raw function pointers you see here, or via the async keyword on a function.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 6bc40a0 to d115e42 Compare February 21, 2020 21:05
@TheBlueMatt
Copy link
Collaborator Author

I added a bunch of docs that should clarify things a little bit, let me know if its better now or where I should shore it up.

Comment on lines 190 to 193
/// You should poll the Receive end of event_notify and call get_and_clear_pending_events() on
/// ChannelManager and ChannelMonitor objects after each receive event. In general such an event
/// should occur after nearly every read event, as any read event may trigger an event which must
/// be handled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the uses of "event" in this documentation. Specifically, I'm having a hard time differentiating between "read event", "receive event", and the more generic "event" (e.g., "as any read event may trigger an event"). Is the term "event" being overloaded here to refer to two different things?

pub struct Connection {
writer: Option<mpsc::Sender<bytes::Bytes>>,
writer: Option<io::WriteHalf<TcpStream>>,
event_notify: mpsc::Sender<()>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on one of those functions with regards to clarification.

The module-level documentation should describe how events are used with the API, including a concise example, as it seems that is the primary interaction users have after creating connections. The explanation should not need to be repeated on all three connection functions.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from d115e42 to 725bb3e Compare February 23, 2020 22:34
@TheBlueMatt
Copy link
Collaborator Author

Added an example and some more links in the Events module doc, hopefully thats a bit clearer, especially since its, indeed, somewhat overloaded.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example and some more links in the Events module doc, hopefully thats a bit clearer, especially since its, indeed, somewhat overloaded.

Let me make sure I understand how the term had been overloaded. Lightning events are generated by the PeerManager via ChannelManager when messages are received and also by ChannelMonitor when blocks are connected. This is one use of "events".

Then in our tokio module, the unit type () is used as an event by passing it through an mpsc::channel in order to notify that the aforementioned Lightning events have been received. This "receive event" indicates that the ChannelManager and ChannelMonitor should be queried for those Lightning events.

Is that correct? If so, could these concepts be combined? That is, could we instead have the tokio module send the Lightning events over the mpsc::channel? This seems like exact use case that the mpsc::channel abstraction was designed for.

@TheBlueMatt
Copy link
Collaborator Author

Let me make sure I understand how the term had been overloaded. Lightning events are generated by the PeerManager via ChannelManager when messages are received and also by ChannelMonitor when blocks are connected. This is one use of "events".

Yep!

Then in our tokio module, the unit type () is used as an event by passing it through an mpsc::channel in order to notify that the aforementioned Lightning events have been received

Err, not received, only that "something has happened, this something may or may not have caused some of the aforementioned events to have been created".

This "receive event" indicates that the ChannelManager and ChannelMonitor should be queried for those Lightning events.

Yep!

Is that correct? If so, could these concepts be combined? That is, could we instead have the tokio module send the Lightning events over the mpsc::channel? This seems like exact use case that the mpsc::channel abstraction was designed for.

Hmm, would be somewhat awkward, IMO. We handle the () here: https://github.com/TheBlueMatt/rust-lightning-bitcoinrpc/blob/master/src/main.rs#L123 and we write to it in a number of other places (not just the tokio net stuff) including eg when the user asks us to send a payment.

More importantly, we started in #474 (and I have a pending commit or two, not yet PR'd to do the same for events) making Events less easy to miss by serializing and crashing/shutting down before having handled all the in-user-buffers Events (by moving them to our own buffers and serializing them).

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 725bb3e to 66e6066 Compare February 24, 2020 21:54
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. The recently added comments have been helpful!

fn schedule_read<CMH: ChannelMessageHandler + 'static>(peer_manager: Arc<peer_handler::PeerManager<SocketDescriptor<CMH>, Arc<CMH>>>, us: Arc<Mutex<Self>>, reader: futures::stream::SplitStream<tokio_codec::Framed<TcpStream, tokio_codec::BytesCodec>>) {
let us_ref = us.clone();
let us_close_ref = us.clone();
async fn schedule_read<CMH: ChannelMessageHandler + 'static>(peer_manager: Arc<peer_handler::PeerManager<SocketDescriptor, Arc<CMH>>>, us: Arc<Mutex<Self>>, mut reader: io::ReadHalf<TcpStream>, mut write_event: mpsc::Receiver<()>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, write_event receives write_events (I assume from the remote peer) --- maybe a better name would be write_event_receiver?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and realized while I was at it that some comments still referred to PeerManager::write_event, which doesn't exist anymore (is PeerManager::write_buffer_space_avail).

tokio::select! {
v = write_event.recv() => select_write_ev!(v),
read = reader.read(&mut buf) => match read {
Ok(0) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so reading 0 bytes means the connection is closed, because it'll just block if the connection is still open but there's nothing to read?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Its pretty standard for socket APIs to do that.

pub fn setup_inbound<CMH: ChannelMessageHandler + 'static>(peer_manager: Arc<peer_handler::PeerManager<SocketDescriptor<CMH>, Arc<CMH>>>, event_notify: mpsc::Sender<()>, stream: TcpStream) {
let (reader, us) = Self::new(event_notify, stream);
/// Process incoming messages and feed outgoing messages on the provided socket generated by
/// accepting an incoming connection (by scheduling futures with tokio::spawn).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got confused by (by scheduling futures with tokio::spawn). Does this make sense as a replacement:

s/(by scheduling futures with tokio::spawn)/(by giving the PeerManager a socket to write peer data out on, scheduling data reads from the same socket, and transmitting said data to the PeerManager)/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I intended to inform the user that, via tokio::spawn, things would run free. I clarified a bit.

// If we get disconnected via SocketDescriptor::disconnect_socket(), we don't call
// disconnect_event(), but if we get an Err return value out of PeerManager, in general, we do.
// We track here whether we'll need to call disconnect_event() after the socket closes.
need_disconnect_event: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish there was a way to get rid of this variable, feels a bit hacky.

I kept having to re-reference what it means, so maybe rename more explicitly to something like "call_disconnect_event_on_shutdown"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed. Honestly after all the back and forth on it and bugs and whatnot I think I agree - we should just always require one, though the handling inside PeerHandler will likely get a bunch more complicated, so I don't know if it makes sense to run forward and do it ASAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to be renamed? Not a huge deal tho.

we should just always require one, though the handling inside PeerHandler will likely get a bunch more complicated, so I don't know if it makes sense to run forward and do it ASAP.

Cool, might file an issue for that then!

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, not received, only that "something has happened, this something may or may not have caused some of the aforementioned events to have been created".

For the case where the "something" is not one of the aforementioned events, what's the use of notifying? Could you give me an example to help understand?

Hmm, would be somewhat awkward, IMO. We handle the () here: https://github.com/TheBlueMatt/rust-lightning-bitcoinrpc/blob/master/src/main.rs#L123 and we write to it in a number of other places (not just the tokio net stuff) including eg when the user asks us to send a payment.

Could you add the write use case to the documentation? AFAICT, this usage is not mentioned anywhere. Specifically, it seems like the user is required to notify whenever they use the ChannelManager in some manner.

This is an excellent opportunity to add (small) tests demonstrating each behavior of the library.

Comment on lines 5 to 7
//! TcpStream and a reference to a PeerManager and the rest is handled", with the exception of the
//! act of reading bytes from the remote peer possibly triggering the need to handle some
//! [Events](../lightning/util/events/enum.Event.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception here seems less of an exception and more of the primary way of interacting with the API. I would recommend rewording this to indicate as such.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that was meant as a "the only thing that isnt as trivial is Event handling". Hope its clearer now.

@TheBlueMatt
Copy link
Collaborator Author

For the case where the "something" is not one of the aforementioned events, what's the use of notifying? Could you give me an example to help understand?

Right, sorry, from lightning_net_tokio its always either bytes-read-from-socket or socket-disconnected, though in theory it could be others (and is used generically across the rust-lightning-bitcoinrpc sample app).

Could you add the write use case to the documentation? AFAICT, this usage is not mentioned anywhere. Specifically, it seems like the user is required to notify whenever they use the ChannelManager in some manner.

Hmm, I think maybe I miscommunicated a bit here - within rust-lightning-bitcoinrpc we use the same Sender as we pass into lightning-net-tokio to handle things like "we've sent money, we should go poke the PeerHandler to see if it wants to send any data" (which of course ends up calling send_data within the SocketDescriptor), but that is somewhat unrelated to its use in lightning_net_tokio.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 66e6066 to 6f5731d Compare February 25, 2020 03:13
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd second Jeff's comment:

This is an excellent opportunity to add (small) tests demonstrating each behavior of the library.

But if that comment were addressed, though I wouldn't say I'm thrilled with the parsability (mostly w/r/t some of the weird bools like need_disconnect_event, disconnect_block, disconnect) I'd be okay with revisiting and trying to simplify later.

// Ignore full errors as we just need them to poll after this point, so if the user
// hasn't received the last send yet, it doesn't matter.
assert!(e.is_full());
// Whenever we want to block on reading or waiting for reading to resume, we have to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/or waiting/or are waiting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so - "we want to block on reading or block on waiting" is the de-sugaring.

@TheBlueMatt TheBlueMatt removed this from the 0.0.10 milestone Feb 26, 2020
@TheBlueMatt
Copy link
Collaborator Author

I did find one more thing to test - running without a many-threaded environment has different behavior the first time you go to write bytes. This is a neat test in that it also exercises our write-blocking and RawWakerVTable logic.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 6edd4ca to 2e0ea18 Compare February 28, 2020 19:21
Comment on lines 164 to 192
tokio::select! {
v = write_avail_receiver.recv() => select_write_ev!(v),
read = reader.read(&mut buf) => match read {
Ok(0) => {
println!("Connection closed");
break;
},
Ok(len) => {
if let Some(blocker) = {
let mut lock = us.lock().unwrap();
if lock.disconnect_state == DisconnectionState::RLTriggeredDisconnect {
shutdown_socket!("disconnect_socket() call from RL");
}
if lock.read_paused {
let (sender, blocker) = oneshot::channel();
lock.read_blocker = Some(sender);
Some(blocker)
} else { None }
} {
tokio::select! {
res = blocker => {
res.unwrap(); // We should never drop the sender without sending () into it!
if us.lock().unwrap().disconnect_state == DisconnectionState::RLTriggeredDisconnect {
shutdown_socket!("disconnect_socket() call from RL");
}
},
v = write_avail_receiver.recv() => select_write_ev!(v),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic around paused reads with the nested select! is a bit difficult to follow. I wonder if this could be simplified a bit using a precondition on the read branch.

let read_paused = us.lock().unwrap().read_paused;
tokio::select! {
    v = write_avail_receiver.recv() => {
        // ...
    },
    read = reader.read(&mut buf), if !read_paused => match read {
        // ...
    },
}

Then I think this would eliminate the need for the blocker and thus the nested select!. Instead, we would write until send_data indicates to resume reading, in which case the read branch would be reenabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not convinced that select!{} definitely wont call the read() future if !read_paused (instead of blocking on the future and never taking that branch). I'd buy that it likely wont, but there's sadly way too little documentation on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reading the documentation, I realize my comment was definitely not right, as the docs explicitly call out this case and say that the future wont be polled. That said, I don't think it eliminates the need for the blocker - we still need a way to wake up the reader in case we got a write with resume_read set.

I initially tried to rewrite it with just one Sender which could wake the read task, but sadly we need to make sure that if we wake it due to a write_avail event, we always call write_buff_space_avail. This would mean either the channel has an unlimited depth and we scan it for any write_avail events, or we add yet another bool to indicate that write_avail needs to be called. I find both worse than keeping the current model of just having two Senders, but I removed the Option<> and simplified the logic a ton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice your reply before commenting, so deleted it. Looks much simpler now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, with regard to the blocker, isn't writing the only way to resume reading? That is, we can't resume reading until send_data is called with resume_read set true. That was why I thought the blocker was not needed. I haven't fully reviewed the send_data implementation, so I could be missing something.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 2e0ea18 to 829e176 Compare March 1, 2020 04:21
@TheBlueMatt
Copy link
Collaborator Author

I saw the issue in tokio 2213 again, but I think now I realize they can't possibly fix it as it would break other things (hence, NOTABUG, I guess) - if you have pending writes frozen and then you go to write again, it doesn't make sense for the new write to complete before the pending write(s) get woken up to write their data first. Thus, it becomes a race between the select() thread waking up the pending writes before the new writers make it into some checks (unless they cache the writeable state, which also kinda probably sucks). Anyway, turns out its in fact not only easy to work around but makes a few things probably cheaper than they were to begin with, so no harm in keeping it anyway.

Comment on lines 136 to 159
macro_rules! prepare_read_write_call {
() => { {
let mut us_lock = us.lock().unwrap();
if us_lock.disconnect_state == DisconnectionState::RLTriggeredDisconnect {
shutdown_socket!("disconnect_socket() call from RL");
}
},
Err(e) => {
us_ref.lock().unwrap().need_disconnect = false;
return future::Either::B(future::result(Err(std::io::Error::new(std::io::ErrorKind::InvalidData, e))));
}
us_lock.block_disconnect_socket = true;
} }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a macro and needing to set block_disconnect_socket false in multiple places (because of breaks), could you make a simple RAII struct (DisconnectBlocker?) for managing this state?

Then prepare_read_write_call could be renamed to something less ambiguous like check_for_shutdown, if still needed (see comment on disconnect check).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an RAII struct seems hard, see below, as we need to check the disconnect state in the same lock as setting block_disconnect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could block_disconnect_socket just be a lock? It seems to be functioning as one. Then you would get the RAII behavior for free.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seemed like a great idea, but once I tried to start writing it I realized we'd need to be able to hold a reference to that lock without the outer lock (as we can't hold the outer lock while calling into write_buffer_space_avail), which would imply a second Arc which seems super wasteful. If we ever end up with CPU issues on the spinlock I'd be somewhat surprised, so for now I'd say leave it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to you on performance. From a code complexity perspective, if not using a mutex I'd then recommend:

  • moving us_lock.block_disconnect_socket = true; out of prepare_read_write_call to avoid obfuscating the connection with us_lock.block_disconnect_socket = false;
  • rename prepare_read_write_call to something more meaningful like shutdown_socket_if_requested

Comment on lines +439 to +451
// Happy-path return:
if !us.block_disconnect_socket { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this handled by the following loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without an extra lock().

Comment on lines 139 to 156
if us_lock.disconnect_state == DisconnectionState::RLTriggeredDisconnect {
shutdown_socket!("disconnect_socket() call from RL");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely in love with how disconnect_state is used. It seems to have two purposes:

  1. signaling that SocketDescriptor::disconnect_socket was called
  2. signaling a manner in which the select loop has exited

Could (1) be communicated via a channel? Then a branch could be added to the select! for handling disconnection. Or would this be problematic if another branch is chosen even if a disconnect has been signaled?

Maybe a channel would be too much, but just wanted to note my concern. See also my other comment on breaking from the loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we have to check disconnect_state in the same lock where setting block_disconnect_socket to meet the requirement that "after disconnect_socket() returns, no future calls into read_event or write_buffer_space_avail are made". I'm not sure how to accomplish that with a channel. I could go back to having two bools - one to indicate disconnection should occur and one to indicate that we need to call socket_disconnected. You do, however, point out a simplification here in that we can keep the need_disconnect_event bool in the schedule_read stack.

Comment on lines 129 to 148
macro_rules! shutdown_socket {
($err: expr) => { {
println!("Disconnecting peer due to {}!", $err);
break;
} }
Copy link
Contributor

@jkczyz jkczyz Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange using a macro for this but not for the cases where the connection has been closed based on the result of reader.read(). I understand in the shutdown_socket case it is to reduce code duplication, but I don't like the inconsistency in how the loop break is obscured behind a macro in some cases but not others.

Perhaps a cleaner way to organize the code would be to encapsulate tokio::select! in a method that returned an enum indicating the possible outcomes. Something like Result<(), DisconnectReason>.

enum DisconnectReason {
    ConnectionClosed(String),
    PeerDisconnected(String),
}

Then you can have a straight-line match to handle the possible cases explicitly, which I think would be more readable. You may not even need disconnect_state or at least could reduce its purpose to signaling from SocketDescriptor::disconnect_socket.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think now. I agree it should at least all use the macro, so I went ahead and did that. Further, if you try to break; the loop direct now you'll get an uninit'd variable (as the need_disconnect_event field is now on the stack). I can make it a separate function but I'm not a huge fan of an fn for the read loop and then cleanup going into a different one. A different thing we could do is give the loop a return value (yes, rust is crazy, you can do that) so that you have to provide the need_disconnect_event field in the break; call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looking better modulo my comment on block_disconnect_socket.

I kinda prefer if the loop returned a variable actually. Additionally, I think the uses of shutdown_socket would read better with an enum reason similar to the one given in my earlier comment only without the string. Otherwise, it's not clear what the true and false literals represent without looking at the definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made the loop return a local enum, though I fear we've gotten into diminishing returns at this point.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch 7 times, most recently from 96f70e9 to 02d4c8a Compare March 4, 2020 18:57
@codecov
Copy link

codecov bot commented Mar 4, 2020

Codecov Report

Merging #472 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #472      +/-   ##
==========================================
+ Coverage   90.07%   90.13%   +0.05%     
==========================================
  Files          34       34              
  Lines       19053    19052       -1     
==========================================
+ Hits        17162    17172      +10     
+ Misses       1891     1880      -11
Impacted Files Coverage Δ
lightning/src/ln/peer_handler.rs 48.7% <100%> (ø) ⬆️
lightning/src/ln/chanmon_update_fail_tests.rs 97.32% <0%> (-0.01%) ⬇️
lightning/src/ln/functional_tests.rs 96.52% <0%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83c9eb4...174ccf5. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 02d4c8a to 87f7ad5 Compare March 5, 2020 03:04
Comment on lines 130 to 131
// If Rust-Lightning tells us to disconnect, we don't, then, call socket_disconnect() after
// closing the socket, but if the socket closes on its own (ie our counterparty does it), we do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm extremely confused by this documentation. I'm trying to correspond it to the enum variants and how they are used in the code but am failing.

If Rust-Lightning tells us to disconnect, we don't, then, call socket_disconnect() after closing the socket

Is this saying we don't disconnect? But then says we close the socket. Isn't closing the socket disconnecting? What is meant by "disconnect" here?

Does this correspond to the PeerDisconnected variant? Seems odd to have this sentence first when that variant is second. I'd recommend just putting the appropriate documentation on each variant. Then have the enum documentation itself say what the enum is used for not how it is used.

but if the socket closes on its own (ie our counterparty does it), we do.

Does this correspond to the CloseConnection variant? Isn't that when read returns 0 or an error? But in those cases, PeerDisconnected is used in the code.

I also read it as the imperative "close connection" which would make sense for "we do" if I understood what that is referring to. :P

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment better now? I reworded it to include the enum in the text so that hopefully its clearer. I think we're way past the point where this should be merged and we can do more cleanups in followons if there's still confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment better now? I reworded it to include the enum in the text so that hopefully its clearer.

Ok, after a lot of staring and conferring with Val, the confusing part is , then,. When reading this:

If Rust-Lightning tells us to disconnect, we don't, then, call socket_disconnect() after closing the socket

I took this as, when Rust-Lighting tells us to disconnect:

  • We don't disconnect
  • Then we call socket_disconnected()
  • But only after closing the socket

Which was confusing because it sounded like you were saying we don't disconnect but we close the socket.

However, I can now see the , then, is a manner of speaking (in particular, your manner of speaking) which IMHO does not come across well when reading it. Given that I now recognize this (both from experience over chat and in person) but still couldn't parse the comment correctly, it indicates others may have the same problem.

This is why I tend to request very succinct writing, cutting out unnecessary verbiage; it avoids needless confusion.

In this case, regardless of rewording, I think that instead of repeating how the code works the documentation should say what the enum is used for. FWIW, I'd recommend something like this:

// The manner in which the socket was disconnected.
enum Disconnect {
    // Indicates Rust Lightning disconnected the socket as a result of a PeerHandleError.
    CloseConnection,
    // Indicates the peer disconnected the socket and thus Rust Lightning should be notified via socket_disconnected().
    PeerDisconnected,
}

Irrespective of whether the above wording is accurate (I'm not saying it is), writing a comment like this may clarify the language as there may be ambiguity between "close", "disconnect", "socket", and "connection". Which could also result in more accurate enum variant naming.

I'd also like to use this opportunity to make a request regarding code reviews. When I'm generally confused by something and ask a question for clarification, I'm not necessarily requesting that a change be made. I'm really just looking for an answer to the question. And from that I may suggest a specific change that I think would help remove confusion.

This is particularly the case with regards to comments. That is much more efficient than rewriting something and asking, "How about now?"

 I think we're way past the point where this should be merged and we can do more cleanups in followons if there's still confusion.

First off, thank you for reaching out to me on Slack saying to not take this the wrong way. I genuinely appreciate you doing that.

I just want to note that reviewers have an important role to play in determining whether a change is merged. And any confusion should err on the side of the reviewers given they are looking at the code largely from the perspective of a reader rather than the writer.

I do agree that schedule_read looks to be in pretty good shape. That said, it is only part of the code under review. I did leave a comment about send_data asking for guidance in reviewing it. I would like to review that before merging so would appreciate a follow up there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm generally confused by something and ask a question for clarification, I'm not necessarily requesting that a change be made

Hmm, fair, ok. I generally air on the side of "answer by updating the comment" as it usually implies less likelihood of people being confused down the road, case someone else is likely to have the same question.

As for review process in general, I agree, only noting that we don't have the resources to block things for as long as we are. That's probably an indication of a problem different from reviewers taking too long (documentation not up to par, for example), of course. In this specific case, probably the documentation on PeerHandler itself needs to be the focus of the improvements, though we can do that in a separate PR.

@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 87f7ad5 to ca3c8d9 Compare March 5, 2020 23:28
Comment on lines 130 to 133
// If Rust-Lightning tells us to disconnect (ie Disconnect::CloseConnection), we don't, then,
// call socket_disconnected() (informing Rust-Lightning that we're disconnecting) after
// closing the socket, but if the socket closes on its own (ie our counterparty closes the
// connection and we set disconnect_type to PeerDisconnected), we do.
Copy link
Contributor

@valentinewallace valentinewallace Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me at first too, I suggest this phasing --

If Rust-Lightning tells us to disconnect (i.e. through Disconnect::CloseConnection), then we close the socket but do not call PeerManager's socket_disconnected() (which informs Rust-Lightning that we're disconnecting). But, if the socket closes on its own, (i.e. if our counterparty closes the connection, causing us to set disconnect_type to PeerDisconnected), we do call PeerManager's socket_disconnected().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took Jeff's suggested wording by mixing it into the enum. Is that clearer now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great wording now!

It looks like we don't currently use the Vec as a Vec, and can
happily take a slice, which makes things easier on the calling
side.
@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from ca3c8d9 to 3f5821b Compare March 10, 2020 16:34
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments around clarity in the implementation comments. Otherwise, LGTM.

}
}
impl<CMH: ChannelMessageHandler> peer_handler::SocketDescriptor for SocketDescriptor<CMH> {
impl peer_handler::SocketDescriptor for SocketDescriptor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I have a high-level understanding of how this works, and I don't have a problem following what send_data is doing.

The only open question I have is: why are RawWaker, RawWakerVTable, and the unsafe code needed? I couldn't find them in the async book, which instead demonstrates creating a Waker by implementing the ArcWake trait. But I don't know if that's applicable in this case.

No need to hold the review up on this, but curious for my own understanding of why one approach is chosen over another.

Comment on lines +599 to +601
tokio::time::timeout(Duration::from_secs(10), a_connected.recv()).await.unwrap();
tokio::time::timeout(Duration::from_secs(1), b_connected.recv()).await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason the timeouts are different for these? Same below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because by the time the connection has done the handshake and disconnected from one end, the other end should be essentially already disconnected.

This is a rather major rewrite, using async/await and tokio 0.2,
which cleans up the code a ton as well as adds significantly to
readability.
If rust-lightning tells us to disconnect a socket after we read
some bytes from the socket, but before we actually give those bytes
to rust-lightning, we may end up calling rust-lightning with a
Descriptor that isn't registered anymore.

Sadly, there really isn't a good way to solve this, and it should
be a pretty quick event, so we just busy-wait.
@TheBlueMatt TheBlueMatt force-pushed the 2020-01-net-async-await branch from 3f5821b to 174ccf5 Compare March 11, 2020 16:20
@TheBlueMatt
Copy link
Collaborator Author

Happy to take any followup comments in more PRs, so feel free to continue review if anyone isn't happy with where this is, but Jeff's ACK is good enough for me.

@TheBlueMatt TheBlueMatt merged commit d27e9e1 into lightningdevkit:master Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants