Retry mechanism insufficient in dropped / missing NATS messages scenario #285

domdom82 · 2022-09-08T15:29:23Z

Issue

We've found and interesting problem that can occur in a situation where gorouter has to drop messages from NATS or otherwise does not receive "router.unregister" messages for apps that have moved.

During load tests we've created an overload situation where NATS emits too many messages for gorouter to consume or even buffer, so it has to drop them eventually.
When an app gets redeployed on a different cell, its route emitter sends a "router.unregister" message to instruct gorouter to delete the endpoint
If that message gets dropped, gorouter ends up with a stale endpoint
If TLS routes are enabled, the pruning cycle won't touch the route and it will stay there until a request is sent to the app again
If the app has 3 or more instances, this can lead to 3 contiguous stale entries in the routing table for the route
Even if the app has 3 healthy instances later in the list, those will never be tried, because gorouter aborts after 3 retries
The user sees a 502 even though their app is perfectly healthy
The same problem arose in the context of randomly crashing diego cells (see context)

Affected Versions

All

Context

In one of the recent stemcell updates there was a bug that caused stemcells to crash randomly (see mailing-list discussion and skipped stemcell release). This led to massive amounts of "router.unregister" messages not to be sent as the apps got restarted elsewhere. Adding insult to injury, users that had 3 or more instances of their apps for "high availability" got hurt the most and were the least available, because all 3 instances got stale in the routing table. Gorouter subsequently tried calling them one after another and then gave up before reaching the new endpoints in the list. So these users saw lots of 502s.

Steps to Reproduce

Deploy CF with only 2 cells. This helps gather a large amount of instances on one cell.
Deploy an app, any app.
Scale the app to 10 instances.
You should now have 5 instances on each cell.
Log on to any gorouter
Save this gorouter's route table using the nats-client

/var/vcap/jobs/gorouter/bin# ./nats_client save routes-before-crash.json
Saving route table to routes-before-crash.json
Done

Double-check the number of endpoints for your app in the json file. There should be 10 entries.
Log on to one of the 2 cells
Crash-reboot the cell. This gives route-emitter no chance to unregister any route.

diego-cell/d8eadf50-399a-4d8e-a998-f46dc7f15feb:~# echo 1 > /proc/sys/kernel/sysrq
diego-cell/d8eadf50-399a-4d8e-a998-f46dc7f15feb:~# echo b > /proc/sysrq-trigger
Connection to 10.1.9.0 closed by remote host.
Connection to 10.1.9.0 closed.
(...dead)

Wait for your app instances to recover, they should be restarted on the remaining cell.
Save the gorouter's route table again

/var/vcap/jobs/gorouter/bin# ./nats_client save routes-after-crash.json
Saving route table to routes-after-crash.json
Done

Do a diff on the two json files. routes-after-crash.json should have 15 entries for your app now. (10 running instances on cell A, 5 stale instances on the crashed cell B)
Curl your app. If you're "lucky" you will see a 502 because gorouter tried 3 out of the 5 stale instances in a row:

curl -v https://my-app.cf.com
(...)
> GET / HTTP/2
> Host: my-app.cf.com
> user-agent: curl/7.85.0
> accept: */*
>
< HTTP/2 502
< content-type: text/plain; charset=utf-8
< x-cf-routererror: endpoint_failure (dial tcp 10.1.9.0:61080: connect: connection refused)
< x-content-type-options: nosniff
< content-length: 67
< date: Fri, 04 Nov 2022 09:53:09 GMT
< strict-transport-security: max-age=31536000; includeSubDomains; preload;
<
502 Bad Gateway: Registered endpoint failed to handle the request.

If you see a 200 OK, it means gorouter hit one of the good endpoints and probably pruned some of the bad ones. It's random. If you never see a 502, you may reset gorouter to the previous state by loading the route table from after the crash:

/var/vcap/jobs/gorouter/bin# ./nats_client load routes-after-crash.json
Loading route table from routes-after-crash.json
Done

Once reloaded, you can retry the curl until you see a 502. Rinse and repeat.

Expected result

User sees 200 OK from their app eventually.

Current result

User sees 502 Bad Gateway: Registered endpoint failed to handle the request even though their app is running fine.

Possible Fix

Make MaxRetries configurable so that it can be larger than 3 for these edge cases.
Another fix could be to keep retrying until there are no more endpoints available and not have a limit at all.
(this makes sense if the Dial Timeout is set to a low value, so that multiple timeouts don't add up to large response times with the client)

Additional Context

There are other facets to this issue, which should be addressed in separate issues.

AZ outages can lead to large portions of the route table becoming stale. What makes this worse is that AZ outages usually don't result in fast-fail errors such as connection refused but mostly in slow dial timeout errors, which may add up to long response times. So to address this sub-problem, the load balancing algorithm may retry based on AZ tags. This is discussed in another issue
We have observed that there are TLS 1.3 errors which can appear during handshakes but are not yet covered by the retriable classifier group so these kinds of errors are currently not retries, even though no request has actually reached the backend app yet. There is currently no upstream issue for that but we will open one. I've added this information just to give complete context of the kinds of issues around retries.

The text was updated successfully, but these errors were encountered:

ameowlia · 2022-11-07T16:09:44Z

Thank you for writing all of this up @domdom82,

I am in favor of all of the proposals you mentioned:

making MaxRetries configurable
option to try all endpoints
lb across azs
fixing the TLS 1.3 erorrs to make them retriable

Additionally all of these seem like great options for PRs for people who want to work towards becoming approvers in the networking area!

domdom82 · 2022-11-11T17:48:48Z

Hi @ameowlia I've added two PRs to implement the first two features (make retries configurable + try all endpoints):

Specs to make maxRetries configurable for endpoints and route-services #298 (specs)
Make maxRetries configurable for endpoints and route-services gorouter#335 (logic)

domdom82 · 2022-11-16T09:54:07Z

For the TLS problems there is an upstream issue to export them so we won't have to string compare in the classifier: golang/go#35234

Linking for reference.

domdom82 mentioned this issue Oct 13, 2022

Improve load balancing across availability zones using tags #292

Closed

domdom82 mentioned this issue Oct 26, 2022

Add save and load route table feature to nats-client #295

Merged

5 tasks

domdom82 changed the title ~~Retry mechanism insufficient in dropped NATS messages scenario~~ Retry mechanism insufficient in dropped / missing NATS messages scenario Nov 4, 2022

This was referenced Nov 14, 2022

Specs to make maxRetries configurable for endpoints and route-services #298

Merged

Make maxRetries configurable for endpoints and route-services cloudfoundry/gorouter#335

Merged

ameowlia closed this as completed in cloudfoundry/gorouter#335 Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry mechanism insufficient in dropped / missing NATS messages scenario #285

Retry mechanism insufficient in dropped / missing NATS messages scenario #285

domdom82 commented Sep 8, 2022 •

edited

Loading

ameowlia commented Nov 7, 2022

domdom82 commented Nov 11, 2022

domdom82 commented Nov 16, 2022

Retry mechanism insufficient in dropped / missing NATS messages scenario #285

Retry mechanism insufficient in dropped / missing NATS messages scenario #285

Comments

domdom82 commented Sep 8, 2022 • edited Loading

Issue

Affected Versions

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Additional Context

ameowlia commented Nov 7, 2022

domdom82 commented Nov 11, 2022

domdom82 commented Nov 16, 2022

domdom82 commented Sep 8, 2022 •

edited

Loading