Skip to content

[peer_handler] Take the peers lock before getting messages to send #891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

Previously, if a user simultaneously called
PeerHandler::process_events() from two threads, we'd race, which
ended up sending messages out-of-order in the real world.
Specifically, we first called get_and_clear_pending_msg_events,
then take the peers lock and push the messages we got into the
sending queue. Two threads may both get some set of messages to
send, but then race each other into the peers lock and send the
messages in random order.

Because we already hold the peers lock when calling most message
handler functions, we can simply take the lock before calling
get_and_clear_pending_msg_events, solving the race.

Previously, if a user simultaneously called
`PeerHandler::process_events()` from two threads, we'd race, which
ended up sending messages out-of-order in the real world.
Specifically, we first called `get_and_clear_pending_msg_events`,
then take the `peers` lock and push the messages we got into the
sending queue. Two threads may both get some set of messages to
send, but then race each other into the `peers` lock and send the
messages in random order.

Because we already hold the `peers` lock when calling most message
handler functions, we can simply take the lock before calling
`get_and_clear_pending_msg_events`, solving the race.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Apr 21, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #891 (bfd1128) into main (f40e47c) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #891      +/-   ##
==========================================
- Coverage   90.30%   90.29%   -0.01%     
==========================================
  Files          57       57              
  Lines       29225    29225              
==========================================
- Hits        26392    26390       -2     
- Misses       2833     2835       +2     
Impacted Files Coverage Δ
lightning/src/ln/peer_handler.rs 44.24% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.00% <0.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f40e47c...bfd1128. Read the comment docs.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Apr 21, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
@TheBlueMatt TheBlueMatt added this to the 0.0.14 milestone Apr 21, 2021
@devrandom
Copy link
Member

utACK

If I understand correctly, read handling locks this in do_read_event, and write handling locks in process_events.

@devrandom
Copy link
Member

Fixes #888

@TheBlueMatt TheBlueMatt merged commit 4f6a038 into lightningdevkit:main Apr 22, 2021
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request May 24, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request May 25, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request May 27, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request May 31, 2021
While trying to debug the issue ultimately tracked down to a
`PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only
individual messages at a time in chanmon_consistency looked
important. Specifically, it initially appeared there may be a race
when an update_add_htlc was delivered, then a node sent a payment,
and only after that, the corresponding commitment-signed was
delivered.

This commit adds such an ability, greatly expanding the potential
for chanmon_consistency to identify channel state machine bugs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants