Skip to content

Retries may be done using broken connections from the pool #1737

@julianbrost

Description

@julianbrost

Expected Behavior

When restarting the Redis server, due to the default of MaxRetries: 3, I'd expect go-redis to attempt to reconnect to the server and retry the query.

Current Behavior

If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.

Possible Solution

Some options:

  1. Perform a ping on the connection before reusing it (adds latency if the connection is fine)
  2. Don't count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
  3. Clear the pool on errors (might be overly pessimistic)
  4. Always use a fresh connection for retries

Steps to Reproduce

The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.

package main

import (
	"context"
	"github.com/go-redis/redis/v8"
	"log"
	"net"
	"time"
)

func main() {
	ctx := context.Background()

	// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
	client := redis.NewClient(&redis.Options{
		Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
			log.Print("Dialer called")
			conn, err := net.Dial(network, addr)
			if err == nil {
				go func() {
					time.Sleep(5 * time.Second)
					conn.Close()
				}()
			}
			return conn, err
		},
	})

	// Function to ping the server and log errors if they occur.
	ping := func() {
		_, err := client.Ping(ctx).Result()
		if err != nil {
			log.Print(err)
		}
	}

	// Perform some pings to fill the connection pool.
	for i := 0; i < 10; i++ {
		go ping()
	}

	// Wait for connections to die.
	time.Sleep(10 * time.Second)

	// Perform another ping and log pool stats before and after.
	log.Printf("%#v", client.PoolStats())
	ping()
	log.Printf("%#v", client.PoolStats())
}

Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):

2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}

You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.

When adding MaxRetries: 20 to the redis.Options, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):

2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}

Context (Environment)

I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions