Skip to content

Retry mechanism insufficient in dropped / missing NATS messages scenario #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
domdom82 opened this issue Sep 8, 2022 · 3 comments · Fixed by #298 or cloudfoundry/gorouter#335
Closed

Comments

@domdom82
Copy link
Contributor

domdom82 commented Sep 8, 2022

Issue

We've found and interesting problem that can occur in a situation where gorouter has to drop messages from NATS or otherwise does not receive "router.unregister" messages for apps that have moved.

  • During load tests we've created an overload situation where NATS emits too many messages for gorouter to consume or even buffer, so it has to drop them eventually.
  • When an app gets redeployed on a different cell, its route emitter sends a "router.unregister" message to instruct gorouter to delete the endpoint
  • If that message gets dropped, gorouter ends up with a stale endpoint
  • If TLS routes are enabled, the pruning cycle won't touch the route and it will stay there until a request is sent to the app again
  • If the app has 3 or more instances, this can lead to 3 contiguous stale entries in the routing table for the route
  • Even if the app has 3 healthy instances later in the list, those will never be tried, because gorouter aborts after 3 retries
  • The user sees a 502 even though their app is perfectly healthy
  • The same problem arose in the context of randomly crashing diego cells (see context)

Affected Versions

All

Context

In one of the recent stemcell updates there was a bug that caused stemcells to crash randomly (see mailing-list discussion and skipped stemcell release). This led to massive amounts of "router.unregister" messages not to be sent as the apps got restarted elsewhere. Adding insult to injury, users that had 3 or more instances of their apps for "high availability" got hurt the most and were the least available, because all 3 instances got stale in the routing table. Gorouter subsequently tried calling them one after another and then gave up before reaching the new endpoints in the list. So these users saw lots of 502s.

Steps to Reproduce

  1. Deploy CF with only 2 cells. This helps gather a large amount of instances on one cell.
  2. Deploy an app, any app.
  3. Scale the app to 10 instances.
  4. You should now have 5 instances on each cell.
  5. Log on to any gorouter
  6. Save this gorouter's route table using the nats-client
/var/vcap/jobs/gorouter/bin# ./nats_client save routes-before-crash.json
Saving route table to routes-before-crash.json
Done
  1. Double-check the number of endpoints for your app in the json file. There should be 10 entries.
  2. Log on to one of the 2 cells
  3. Crash-reboot the cell. This gives route-emitter no chance to unregister any route.
diego-cell/d8eadf50-399a-4d8e-a998-f46dc7f15feb:~# echo 1 > /proc/sys/kernel/sysrq
diego-cell/d8eadf50-399a-4d8e-a998-f46dc7f15feb:~# echo b > /proc/sysrq-trigger
Connection to 10.1.9.0 closed by remote host.
Connection to 10.1.9.0 closed.
(...dead)
  1. Wait for your app instances to recover, they should be restarted on the remaining cell.
  2. Save the gorouter's route table again
/var/vcap/jobs/gorouter/bin# ./nats_client save routes-after-crash.json
Saving route table to routes-after-crash.json
Done
  1. Do a diff on the two json files. routes-after-crash.json should have 15 entries for your app now. (10 running instances on cell A, 5 stale instances on the crashed cell B)
  2. Curl your app. If you're "lucky" you will see a 502 because gorouter tried 3 out of the 5 stale instances in a row:
curl -v https://my-app.cf.com
(...)
> GET / HTTP/2
> Host: my-app.cf.com
> user-agent: curl/7.85.0
> accept: */*
>
< HTTP/2 502
< content-type: text/plain; charset=utf-8
< x-cf-routererror: endpoint_failure (dial tcp 10.1.9.0:61080: connect: connection refused)
< x-content-type-options: nosniff
< content-length: 67
< date: Fri, 04 Nov 2022 09:53:09 GMT
< strict-transport-security: max-age=31536000; includeSubDomains; preload;
<
502 Bad Gateway: Registered endpoint failed to handle the request.
  1. If you see a 200 OK, it means gorouter hit one of the good endpoints and probably pruned some of the bad ones. It's random. If you never see a 502, you may reset gorouter to the previous state by loading the route table from after the crash:
/var/vcap/jobs/gorouter/bin# ./nats_client load routes-after-crash.json
Loading route table from routes-after-crash.json
Done

Once reloaded, you can retry the curl until you see a 502. Rinse and repeat.

Expected result

User sees 200 OK from their app eventually.

Current result

User sees 502 Bad Gateway: Registered endpoint failed to handle the request even though their app is running fine.

Possible Fix

  • Make MaxRetries configurable so that it can be larger than 3 for these edge cases.
  • Another fix could be to keep retrying until there are no more endpoints available and not have a limit at all.
    (this makes sense if the Dial Timeout is set to a low value, so that multiple timeouts don't add up to large response times with the client)

Additional Context

There are other facets to this issue, which should be addressed in separate issues.

  • AZ outages can lead to large portions of the route table becoming stale. What makes this worse is that AZ outages usually don't result in fast-fail errors such as connection refused but mostly in slow dial timeout errors, which may add up to long response times. So to address this sub-problem, the load balancing algorithm may retry based on AZ tags. This is discussed in another issue
  • We have observed that there are TLS 1.3 errors which can appear during handshakes but are not yet covered by the retriable classifier group so these kinds of errors are currently not retries, even though no request has actually reached the backend app yet. There is currently no upstream issue for that but we will open one. I've added this information just to give complete context of the kinds of issues around retries.
@domdom82 domdom82 changed the title Retry mechanism insufficient in dropped NATS messages scenario Retry mechanism insufficient in dropped / missing NATS messages scenario Nov 4, 2022
@ameowlia
Copy link
Member

ameowlia commented Nov 7, 2022

Thank you for writing all of this up @domdom82,

I am in favor of all of the proposals you mentioned:

  • making MaxRetries configurable
  • option to try all endpoints
  • lb across azs
  • fixing the TLS 1.3 erorrs to make them retriable

Additionally all of these seem like great options for PRs for people who want to work towards becoming approvers in the networking area!

@domdom82
Copy link
Contributor Author

Hi @ameowlia I've added two PRs to implement the first two features (make retries configurable + try all endpoints):

@domdom82
Copy link
Contributor Author

For the TLS problems there is an upstream issue to export them so we won't have to string compare in the classifier: golang/go#35234

Linking for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants