[peer_handler] Take the peers lock before getting messages to send #891

TheBlueMatt · 2021-04-21T22:04:02Z

Previously, if a user simultaneously called
PeerHandler::process_events() from two threads, we'd race, which
ended up sending messages out-of-order in the real world.
Specifically, we first called get_and_clear_pending_msg_events,
then take the peers lock and push the messages we got into the
sending queue. Two threads may both get some set of messages to
send, but then race each other into the peers lock and send the
messages in random order.

Because we already hold the peers lock when calling most message
handler functions, we can simply take the lock before calling
get_and_clear_pending_msg_events, solving the race.

Previously, if a user simultaneously called `PeerHandler::process_events()` from two threads, we'd race, which ended up sending messages out-of-order in the real world. Specifically, we first called `get_and_clear_pending_msg_events`, then take the `peers` lock and push the messages we got into the sending queue. Two threads may both get some set of messages to send, but then race each other into the `peers` lock and send the messages in random order. Because we already hold the `peers` lock when calling most message handler functions, we can simply take the lock before calling `get_and_clear_pending_msg_events`, solving the race.

While trying to debug the issue ultimately tracked down to a `PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only individual messages at a time in chanmon_consistency looked important. Specifically, it initially appeared there may be a race when an update_add_htlc was delivered, then a node sent a payment, and only after that, the corresponding commitment-signed was delivered. This commit adds such an ability, greatly expanding the potential for chanmon_consistency to identify channel state machine bugs.

codecov · 2021-04-21T22:23:05Z

Codecov Report

Merging #891 (bfd1128) into main (f40e47c) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #891      +/-   ##
==========================================
- Coverage   90.30%   90.29%   -0.01%     
==========================================
  Files          57       57              
  Lines       29225    29225              
==========================================
- Hits        26392    26390       -2     
- Misses       2833     2835       +2

Impacted Files	Coverage Δ
lightning/src/ln/peer_handler.rs	`44.24% <100.00%> (ø)`
lightning/src/ln/functional_tests.rs	`97.00% <0.00%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f40e47c...bfd1128. Read the comment docs.

While trying to debug the issue ultimately tracked down to a `PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only individual messages at a time in chanmon_consistency looked important. Specifically, it initially appeared there may be a race when an update_add_htlc was delivered, then a node sent a payment, and only after that, the corresponding commitment-signed was delivered. This commit adds such an ability, greatly expanding the potential for chanmon_consistency to identify channel state machine bugs.

devrandom · 2021-04-22T08:47:47Z

utACK

If I understand correctly, read handling locks this in do_read_event, and write handling locks in process_events.

devrandom · 2021-04-22T08:48:07Z

Fixes #888

While trying to debug the issue ultimately tracked down to a `PeerHandler` locking bug in lightningdevkit#891, the ability to deliver only individual messages at a time in chanmon_consistency looked important. Specifically, it initially appeared there may be a race when an update_add_htlc was delivered, then a node sent a payment, and only after that, the corresponding commitment-signed was delivered. This commit adds such an ability, greatly expanding the potential for chanmon_consistency to identify channel state machine bugs.

TheBlueMatt added this to the 0.0.14 milestone Apr 21, 2021

jkczyz approved these changes Apr 22, 2021

View reviewed changes

TheBlueMatt merged commit 4f6a038 into lightningdevkit:main Apr 22, 2021

TheBlueMatt mentioned this pull request Apr 22, 2021

Signature verification fails due to different output values #888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[peer_handler] Take the peers lock before getting messages to send #891

[peer_handler] Take the peers lock before getting messages to send #891

Uh oh!

TheBlueMatt commented Apr 21, 2021

Uh oh!

codecov bot commented Apr 21, 2021 •

edited

Loading

Uh oh!

devrandom commented Apr 22, 2021

Uh oh!

devrandom commented Apr 22, 2021

Uh oh!

Uh oh!

[peer_handler] Take the peers lock before getting messages to send #891

[peer_handler] Take the peers lock before getting messages to send #891

Uh oh!

Conversation

TheBlueMatt commented Apr 21, 2021

Uh oh!

codecov bot commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

devrandom commented Apr 22, 2021

Uh oh!

devrandom commented Apr 22, 2021

Uh oh!

Uh oh!

codecov bot commented Apr 21, 2021 •

edited

Loading