`RouteByLatency` leads to significant ping load when all servers of a shard are failed

When the OSS ClusterClient uses the `RouteByLatency` option and a shard is completely down, it will ping every server in the cluster 10 times as much as every 200ms. This can lead to millions of pings per second for production Redis clusters with a large number of clients.

## Expected Behavior
`RouteByLatency` should limit pings to servers only when latency to servers are expected to have change or when a server does not yet have a latency measurement.

## Current Behavior

`RouteByLatency` can ping as much as up to 10 times to every server every 200 ms, and will ping at this frequency when all servers of a shard are marked as failing. This happens for every cluster client.

## Possible Solution
The behavior to reload the state when there are failovers/topology changes is advantageous as we want the cluster client to have the latest cluster state; this seems to be the only way a server can go from a "failing" state back to a healthy state. Often servers that were healthy before and after the state refresh and were present in both will not have changes in their latency, but servers that may be recovering from failures may be moved in a cloud service provider and have different latencies.

We can consider splitting the conditions for probing latencies of servers - on topology changes/failovers, probe the latency of servers that are new or were previously failed in the state. Otherwise, probe the latency of servers only if the previous latency measurement is sufficiently old (> 10s).

## Steps to Reproduce

1. Create a OSS Redis cluster with 3 shards and no read replicas. 
2. Create a test client that issues requests to every shard, [example test client gist](https://gist.github.com/justinmir/fcf949544d7f11f43d56ab0121449b36).
3. Shut down redis on a single server in the 3 server cluster - failing one third of the slots.
4. Observe the ping rate to the redis cluster.

**Change in ping rate from a single cluster client.**
![image](https://github.com/redis/go-redis/assets/8886628/e0097694-b84d-449a-a6d6-696fe641777b)

**Production cluster ping rate during a shard failure.**
![image](https://github.com/redis/go-redis/assets/8886628/6de27d7e-7144-4951-8884-92cb0c122381)

## Context (Environment)
Use of `RouteByLatency` introduces additional risk to production clusters, it can limit availability of requests
to all shards when a single shard fails.

In a production incident, this occurred in an 80 node (40 shard) cluster with hundreds of thousands of client connections.

## Detailed Description

**How/When does go-redis ping servers to determine latency**
`RouteByLatency` will issue pings to redis servers to determine their latency. It does this by issuing 10 pings to every
server in the cluster.

It does this under several conditions:
1. A new OSS cluster client is created.
2. A client receives an unexpected `READONLY` reply from a redis client ([link](https://github.com/redis/go-redis/blob/master/osscluster.go#L946))
3. **A client receives a `MOVED` reply from a redis client ([link](https://github.com/redis/go-redis/blob/master/osscluster.go#L964))**
4. Roughly every 10 seconds, anytime the cluster state is required to process a command ([link](https://github.com/redis/go-redis/blob/master/osscluster.go#L825)).

Each of these places calls `LazyReload` which limits reloading to [once every 200ms](https://github.com/redis/go-redis/blob/master/osscluster.go#L814).

Anytime a reload happens, we load the state and perform `GC` one minute after loading the state. This `GC` operation will issue a call to update the latencies to each of the servers. ([Call to GC](https://github.com/redis/go-redis/blob/master/osscluster.go#L656), [call to update server latency](https://github.com/redis/go-redis/blob/master/osscluster.go#L491)).

**Why this happens frequently when all servers in a shard are down?**
When all servers in a shard are failed routing to the "closest server" ([link](https://github.com/redis/go-redis/blob/master/osscluster.go#L725)) returns a Random server. This random server will not have the slots
associated with the key and will return a `MOVED` response triggering the `LazyReload`.

## Possible Implementation
Open to discussing implementations in more detail and contributing the fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`RouteByLatency` leads to significant ping load when all servers of a shard are failed #2782

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RouteByLatency leads to significant ping load when all servers of a shard are failed #2782

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`RouteByLatency` leads to significant ping load when all servers of a shard are failed #2782