-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Expected Behavior
When restarting the Redis server, due to the default of MaxRetries: 3
, I'd expect go-redis to attempt to reconnect to the server and retry the query.
Current Behavior
If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.
Possible Solution
Some options:
- Perform a ping on the connection before reusing it (adds latency if the connection is fine)
- Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
- Clear the pool on errors (might be overly pessimistic)
- Always use a fresh connection for retries
Steps to Reproduce
The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.
package main
import (
"context"
"github.com/go-redis/redis/v8"
"log"
"net"
"time"
)
func main() {
ctx := context.Background()
// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
client := redis.NewClient(&redis.Options{
Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
log.Print("Dialer called")
conn, err := net.Dial(network, addr)
if err == nil {
go func() {
time.Sleep(5 * time.Second)
conn.Close()
}()
}
return conn, err
},
})
// Function to ping the server and log errors if they occur.
ping := func() {
_, err := client.Ping(ctx).Result()
if err != nil {
log.Print(err)
}
}
// Perform some pings to fill the connection pool.
for i := 0; i < 10; i++ {
go ping()
}
// Wait for connections to die.
time.Sleep(10 * time.Second)
// Perform another ping and log pool stats before and after.
log.Printf("%#v", client.PoolStats())
ping()
log.Printf("%#v", client.PoolStats())
}
Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}
You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.
When adding MaxRetries: 20
to the redis.Options
, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):
2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}
Context (Environment)
I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3
.
Activity
vmihailenco commentedon Apr 23, 2021
All these options look worse than the current behavior. Even with large pools like 100 connections we are talking only about
100/3 = 33
failed commands.Al2Klimov commentedon Apr 23, 2021
Please could you explain why exactly you don’t prefer them, especially the first one?
Why not to check whether a connection is sane before using it "in production"?
vmihailenco commentedon Apr 23, 2021
Because that means executing a ping command for each command we want to process. That means 2x lantency and 2x number of commands.
Al2Klimov commentedon Apr 23, 2021
And the last suggested solution?
If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?
vmihailenco commentedon Apr 25, 2021
That is the most reasonable option if you can provide arguments why we should add some special logic to handle retries. As I said we are talking about
100/3 = 33
failed commands. In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.julianbrost commentedon Apr 26, 2021
Maybe I don't get the purpose of the retry functionality in go-redis. I'd expect it to handle situations like a Redis restart where all existing connections fail simultaneously, so that I don't have to handle this in my application code. The current implementation leads to errors returned from go-redis in situations where I would not expect them.
That's only the case if you perform lots of concurrent actions on the pool. If there are situations were only a few goroutines interact with Redis, it is quite likely to receive an error from go-redis even though an attempt on a fresh connection would have worked perfectly fine. Reducing the pool size is not an option if the amount of concurrent Redis queries changes with time.
Al2Klimov commentedon Apr 26, 2021
Well... unlikely != impossible. So I'd like to reduce 33 or even 3 to 0.
Pro: on Redis restart commands are not likely, but definitively not re-tried using bad connections. Increases app stability. Also apps don’t have to use pool size + x for max retries and Redis restart recoveries don’t take much longer.
Contra: you possibly re-dial unnecessarily, but if one connection breaks, the others likely do as well – Redis restart or similar. This should not hurt much as:
Shall I make a PR?
vmihailenco commentedon May 5, 2021
This is what database/sql does - https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1808-L1816. You will have to
It is going to be non-trivial, but feel free to give it a try :)
Al2Klimov commentedon May 5, 2021
And the much simpler suggestion (#1737 (comment)) is not an option for you?
vmihailenco commentedon May 5, 2021
database/sql does a similar thing but only as an extra last retry. The implementation cost will not be any different and it will match the database/sql behavior which is a nice thing.
Al2Klimov commentedon May 5, 2021
Problems aren’t stop signs, but signposts (for me). :)
Al2Klimov commentedon May 5, 2021
Hm... the command processing process is kinda complex (cmdable -> process -> _process -> withConn -> With* -> ...).
Where would you like to see the check whether the just returned error is X/Y/Z along with the returning of ErrBadConn?
I'd not like to just shoot in the dark by myself as you may say afterwards: err... no.
Or: do we even need that and ErrBadConn itself? Can't we just say retryable error -> bad conn?
ryanschneider commentedon Jun 25, 2021
FWIW I can very reliably reproduce this issue on AWS ElasticCache using a small redis instance w/ primary and replica in non-clustered mode with the default parameter group. Importantly this sets
close-on-replica-write
to true. I then have multiple goroutines reading and writing from the redis server and initiate a failover. In variably this leads to read only commands likeHKEYS
failing withEOF
error, presumably since they are reusing connections that were closed due to theclose-on-replica-write
setting but left in the pool.ryanschneider commentedon Jun 25, 2021
What about adding a user-supplied
OnReuseConnection
hook likeOnConnect
that is called before a connection is reused from the pool?6 remaining items
config.Redis#NewClient(): work-around redis/go-redis#1737
config.Redis#NewClient(): work-around redis/go-redis#1737
config.Redis#NewClient(): work-around redis/go-redis#1737
github-actions commentedon Aug 5, 2025
This issue has been automatically marked as stale due to inactivity.
It will be closed in 30 days if no further activity occurs.
If you believe this issue is still relevant, please add a comment to keep it open.
ndyakov commentedon Aug 6, 2025
The connection health is checked, we have to verify if this is sufficient
. @julianbrost if you are still experiencing this with 9.12.0 let us know.
julianbrost commentedon Aug 19, 2025
Just tried my reproducer with v9.12.1 and now the dialer is called again so a fresh connection is created. Output now looks like this:
So looks like it's fixed, thank you!