Description
When the OSS ClusterClient uses the RouteByLatency
option and a shard is completely down, it will ping every server in the cluster 10 times as much as every 200ms. This can lead to millions of pings per second for production Redis clusters with a large number of clients.
Expected Behavior
RouteByLatency
should limit pings to servers only when latency to servers are expected to have change or when a server does not yet have a latency measurement.
Current Behavior
RouteByLatency
can ping as much as up to 10 times to every server every 200 ms, and will ping at this frequency when all servers of a shard are marked as failing. This happens for every cluster client.
Possible Solution
The behavior to reload the state when there are failovers/topology changes is advantageous as we want the cluster client to have the latest cluster state; this seems to be the only way a server can go from a "failing" state back to a healthy state. Often servers that were healthy before and after the state refresh and were present in both will not have changes in their latency, but servers that may be recovering from failures may be moved in a cloud service provider and have different latencies.
We can consider splitting the conditions for probing latencies of servers - on topology changes/failovers, probe the latency of servers that are new or were previously failed in the state. Otherwise, probe the latency of servers only if the previous latency measurement is sufficiently old (> 10s).
Steps to Reproduce
- Create a OSS Redis cluster with 3 shards and no read replicas.
- Create a test client that issues requests to every shard, example test client gist.
- Shut down redis on a single server in the 3 server cluster - failing one third of the slots.
- Observe the ping rate to the redis cluster.
Change in ping rate from a single cluster client.
Production cluster ping rate during a shard failure.
Context (Environment)
Use of RouteByLatency
introduces additional risk to production clusters, it can limit availability of requests
to all shards when a single shard fails.
In a production incident, this occurred in an 80 node (40 shard) cluster with hundreds of thousands of client connections.
Detailed Description
How/When does go-redis ping servers to determine latency
RouteByLatency
will issue pings to redis servers to determine their latency. It does this by issuing 10 pings to every
server in the cluster.
It does this under several conditions:
- A new OSS cluster client is created.
- A client receives an unexpected
READONLY
reply from a redis client (link) - A client receives a
MOVED
reply from a redis client (link) - Roughly every 10 seconds, anytime the cluster state is required to process a command (link).
Each of these places calls LazyReload
which limits reloading to once every 200ms.
Anytime a reload happens, we load the state and perform GC
one minute after loading the state. This GC
operation will issue a call to update the latencies to each of the servers. (Call to GC, call to update server latency).
Why this happens frequently when all servers in a shard are down?
When all servers in a shard are failed routing to the "closest server" (link) returns a Random server. This random server will not have the slots
associated with the key and will return a MOVED
response triggering the LazyReload
.
Possible Implementation
Open to discussing implementations in more detail and contributing the fix.