-
Notifications
You must be signed in to change notification settings - Fork 107
Retry mechanism insufficient in dropped / missing NATS messages scenario #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for writing all of this up @domdom82, I am in favor of all of the proposals you mentioned:
Additionally all of these seem like great options for PRs for people who want to work towards becoming approvers in the networking area! |
Hi @ameowlia I've added two PRs to implement the first two features (make retries configurable + try all endpoints): |
For the TLS problems there is an upstream issue to export them so we won't have to string compare in the classifier: golang/go#35234 Linking for reference. |
Issue
We've found and interesting problem that can occur in a situation where gorouter has to drop messages from NATS or otherwise does not receive "router.unregister" messages for apps that have moved.
Affected Versions
All
Context
In one of the recent stemcell updates there was a bug that caused stemcells to crash randomly (see mailing-list discussion and skipped stemcell release). This led to massive amounts of "router.unregister" messages not to be sent as the apps got restarted elsewhere. Adding insult to injury, users that had 3 or more instances of their apps for "high availability" got hurt the most and were the least available, because all 3 instances got stale in the routing table. Gorouter subsequently tried calling them one after another and then gave up before reaching the new endpoints in the list. So these users saw lots of 502s.
Steps to Reproduce
Once reloaded, you can retry the curl until you see a 502. Rinse and repeat.
Expected result
User sees 200 OK from their app eventually.
Current result
User sees 502 Bad Gateway: Registered endpoint failed to handle the request even though their app is running fine.
Possible Fix
MaxRetries
configurable so that it can be larger than 3 for these edge cases.(this makes sense if the Dial Timeout is set to a low value, so that multiple timeouts don't add up to large response times with the client)
Additional Context
There are other facets to this issue, which should be addressed in separate issues.
connection refused
but mostly in slowdial timeout
errors, which may add up to long response times. So to address this sub-problem, the load balancing algorithm may retry based on AZ tags. This is discussed in another issueThe text was updated successfully, but these errors were encountered: