-
Notifications
You must be signed in to change notification settings - Fork 302
fix: 0rtt flakes #3496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 0rtt flakes #3496
Conversation
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3496/docs/iroh/ Last updated: 2025-10-07T06:43:29Z |
// Invalidate paths that are not in the new address set. | ||
// If the remote node sends us new addresses, old ones are likely stale (e.g., after restart). | ||
// This forces immediate re-evaluation and prevents preferring "known bad" outdated paths | ||
// over "unknown potentially good" new paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. On a cursory look I think we get NodeAddr
for other reasons, e.g. when a user dials with a NodeAddr
. And that could be totally out of date and other info could be fresher.
In the multipath branch I currently send the initial packets to all known remotes. So that would mean whatever the working path, it should find it. I think it might need some kind of expiry of paths at some point, but wasn't going to do that immediately.
I've honestly been avoiding thinking too much about how to robustly solve these kind of issues in the current version, because it just slows down multipath work... but if others agree there's an improvement here that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unsure, I'll have to dig in a bit more to find out if this could have unintended side effects (ie falsely invalidating good paths).
So what's our path forward here, should I drop the path update and just make sure we survive errors in the loop? That's a "good enough" fix for the flaky tests for now. Just not a proper fix for the underlying issue. |
Yeah lets do that, and file an issue, with the details that you discovered. And then we reexamine this once multipath has landed |
Reverted the path invalidation logic and followed up with an issue #3504 |
Description
So rough description of what was happening:
Being slow enough made it sometimes pass. Locally now tests pass ~instantly vs 15 seconds or so after the fix to gracefully handle errors.
I'm unsure if this is safe at all, like handling path updates from other angles or partial updates etc.
If it turns out too risky, we can drop the path invalidation changes and just keep the graceful error handling. Tests will pass.
Breaking Changes
Notes & open questions
Change checklist
quic-rpc
iroh-gossip
iroh-blobs
dumbpipe
sendme