On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593

jakepruitt · 2021-04-05T05:03:41Z

Issue

The node-redis client has a race condition in the handling of write error events from the TCP socket if the TCP socket disconnects during the execution of the internal_send_command() function, specifically on versions of Node.js >= 10.23.2.

The issue stems from the way error events are emitted from net sockets between 10.19.0 and 10.23.2. Whereas socket EPIPE error events are emitted asynchronously on 10.19.0, they are emitted synchronously higher versions, which means the internal_send_command function may wind up executing the this.flush_and_error() function within connection_gone before the this.command_queue.push() call here. This means the command queue gets polluted with a stray command that throws off the synchronization between the command queue and the commands sent to redis once redis reconnects.

To be a bit more concrete, here is an example of the same socket disconnect in the same part of the code being handled on 10.19.0 (where the .flush_and_error() is called after .command_queue.push()) vs 10.23.2, where the .flush_and_error happens midway through the code in internal_send_command sending the write to redis.

On 10.19.0:

On node 10.23.2:

I was not able to pinpoint the exact change in libuv that is causing the problem, but it was definitely introduced in libuv between versions 1.25.0 and 1.34.2, since that is the difference between node 10.19.0 and 10.23.2. Without in-depth knowledge, I believe libuv/libuv#2425 may have something to do with it.

Reproducing

This issue is inherently difficult to reproduce, since the perfect conditions need to be aligned for a write to be happening at the same time as a disconnect. When trying to trigger the event by terminating an active redis-server process, it usually takes me 10-20 tries before I can get an error case. For small writes, this probability is even lower.

I wrote a small script to reproduce this issue:

'use strict';

const redis = require('.');
const assert = require('assert');

let onestring = '', twostring = '', threestring = '';
for (let i = 0; i < 30002; i++) {
  onestring += 'one';
  twostring += 'two';
  threestring += 'three';
}


const client = redis.createClient(6379, 'localhost', {
  ttl: 86400,
  return_buffers: true,
  timeout: 200,
  retry_strategy: function(options) {
    console.log('Redis retry #', options.attempt);
    // Try to reconnect every 5 seconds
    if (options.error && options.error.code === 'ECONNREFUSED') {
      console.log('Redis server refused the connection');
      return 5000;
    } // If there's some other error, log it and retry
    if (options.error) {
      console.log(options.error);
      return 2000;
      // otherwise, try reconnecting every 2 seconds
    } else return 2000;
  }
})
  .on('error', console.log)
  .on('ready', function() {
    console.log('Redis connection established.');
  });
client.command_queue_high_water = 2000;

setInterval(() => {
  client.set('one', onestring, (err, res) => {
    console.log(`set one err: ${err}`);
    console.log(`set one res: ${res}`);
    if (res) assert.equal(res, 'OK');
    client.set('two', twostring, (err, res) => {
      console.log(`set two err: ${err}`);
      console.log(`set two res: ${res}`);
      if (res) assert.equal(res, 'OK');
      client.set('three', threestring, (err, res) => {
        console.log(`set three err: ${err}`);
        console.log(`set three res: ${res}`);
        if (res) assert.equal(res, 'OK');
        client.get('one', (err, res) => {
          console.log(`get one err: ${err}`);
          console.log(`get one res: ${res && res.slice(0, 10)}`);
          if (res) assert.equal(res.slice(0,10).toString(), 'oneoneoneo');
          client.get('two', (err, res) => {
            console.log(`get two err: ${err}`);
            console.log(`get two res: ${res && res.slice(0, 10)}`);
            if (res) assert.equal(res.slice(0,10).toString(), 'twotwotwot');
            client.get('three', (err, res) => {
              console.log(`get three err: ${err}`);
              console.log(`get three res: ${res && res.slice(0, 10)}`);
              if (res) assert.equal(res.slice(0,10).toString(), 'threethree');
            });
          });
        });
      });
    });
  });
}, process.env.INTERVAL || 100);

(sorry for such a pyramid of doom shape to this test).

By running redis-server in one terminal and this script in another, with NODE_DEBUG=redis INTERVAL=10 node script.js, then terminating and restarting redis-server in the other terminal, I eventually trigger an error on the assert that says:

assert.js:84
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: <Buffer 23 20 53 65 72 76 65 72 0d 0a 72 65 64 69 73 5f 76 65 72 73 69 6f 6e 3a 34 2e 30 2e 39 0d 0a 72 65 64 69 73 5f 67 69 ... == 'OK'
    at Command.client.set [as callback] (/Users/jake/Projects/node-redis/script.js:43:21)

Since this issue is so difficult to trigger, I usually need to retry the connection severing several times.

Potential solutions

This problem can be solved by adding a setImmediate() or process.nextTick() into the stream.on('error') event handling. I do not know where it would be best to add this, but it seems like there are a few candidates:

At the highest level, right when on('error') is fired
Wrapping around the call to connection_gone()
Around the flush_and_error() call that clears the command queue in connection_gone()
Within flush_and_error()

My own moral compass points toward putting it inside the flush_and_error handler so that flushing is guaranteed to happen on the next tick, and the command queue would not grow after getting flushed. This would also be the least intrusive and most targeted approach. I could also see the case for putting it at the highest level, since that would effectively make the behavior on node >10.23.2 and 10.19.0 equivalent.

Environment

Node.js Version: I can reproduce this on 10.23.2, node 12, and node 14. I do not see this issue on node 10.19.0. The key component seems to be a change in libuv to the way sockets emit events.

Redis Version: Able to reproduce with 4.0.9 and 6.0.5
node-redis Version: Originally detected on 2.8.0, able to reproduce on 3.x as well.

Platform: Mac and Linux

The text was updated successfully, but these errors were encountered:

schancel · 2021-04-15T14:54:57Z

I'm running into what I think is this same problem, but it manifests as a callback undefined error.

Possibly related? #1427

jakepruitt · 2021-04-26T22:58:33Z

@schancel that does sound similar! I'll see if I can create a fix that covers both cases.

shammahatdivvy · 2021-05-06T20:03:25Z

Thank you

cit68 · 2022-01-03T14:15:14Z

hi @jakepruitt, finally I can find someone with the same issue.
Can you suggest some workaround until it is officially fixed?

redis version: 6.0.3
node-redis version: 4.0.1
node version: 14.15.1

jakepruitt · 2022-01-05T22:51:23Z

#1603 works as a workaround!

OzQu · 2022-02-09T13:05:25Z

Hi @cit68, do you have good way to reproduce the issue with node-redis v4 ? We've tried to reproduce the issue with node-redis v4 and have not been able to reproduce it, at least not yet.

nicklvsa · 2022-04-21T19:56:13Z

Hi 👋 , we've see this problem appear again while running node-redis v4 and node 16. It also appears to be triggered by the same steps as listed in the original post. Below I will list steps to reproduce this problem locally, and the requirements to get it setup.

Our exact environment

Node runtime version: v16.13.2
Node-redis version: v4.0.3
Platform: MacOS & Linux

Prerequisites

NodeJS (and related dependencies of the node script listed below)
Docker

Scripts:

Bash run script (run.sh):

echo "Press CTRL+C to exit once complete!"
sleep 2 
docker run --name my-redis-container -p 7001:6379 -d redis

run_reboot() {
    sleep 3
    while :
    do
        redis-cli -p 7001 CLIENT KILL TYPE normal SKIPME no
        sleep 0.0$RANDOM
    done
}

run_terminate() {
    sleep 1
    while :
    do
        docker restart my-redis-container -t 0
        sleep 2
    done
}

run_reboot > /dev/null 2>&1 &
p1=$!
run_terminate > /dev/null 2>&1 &
p2=$!
INTERVAL=5 node test.js &
p3=$!

trap 'kill "$p1"; kill "$p2"; kill "$p3"' SIGINT
while kill -s 0 "$p1" || kill -s 0 "$p2" || kill -s 0 "$p3"; do
    wait "$p1"; wait "$p2"; wait "$p3"
done &>/dev/null

NodeJS client script (test.js):

const redis = require('redis');
const assert = require('assert');
const child_process = require('child_process');
const AssertionError = assert.AssertionError;

const REDIS_HOST = 'localhost';
const REDIS_PORT = '7001';

const buildClient = () => {
    const client = redis.createClient({
        socket: {
            host: REDIS_HOST,
            port: REDIS_PORT,
        },
        return_buffers: true,
    });

    client
        .on('error', console.log)
        .on('ready', () => {
            console.log('Redis connection established.');
        });
    
    return client;
};

let onestring = '', twostring = '', threestring = '';
for (let i = 0; i < 30002; i++) {
    onestring += 'one';
    twostring += 'two';
    threestring += 'three';
}

async function main() {
    try { 
        let client = buildClient();
        await client.connect();

        setInterval(async () => {
            try {
                const res1 = await client.set('one', onestring);

                if (res1) assert.equal(res1, 'OK');

                const res2 = await client.set('two', twostring);
                if (res2) assert.equal(res2, 'OK');

                const res3 = await client.set('three', threestring);
                if (res3) assert.equal(res3, 'OK');

                const get1 = await client.get('one');
                if (get1) assert.equal(get1.slice(0,10).toString(), 'oneoneoneo');

                const get2 = await client.get('two');
                if (get2) assert.equal(get2.slice(0,10).toString(), 'twotwotwot');

                const get3 = await client.get('three');
                if (get3) assert.equal(get3.slice(0,10).toString(), 'threethree');

            } catch(e) {
                if (e instanceof AssertionError) {
                    throw e;
                }
                    console.log(e)
                }
        }, process.env.INTERVAL || 100);
    } catch(e) {
        console.log(e);
        main();
    }
}

main();

How to reproduce this issue

Ensure the above scripts and their dependencies are working / installed locally.
Ensure run.sh has permission to run (chmod +x run.sh).
Run ./run.sh and let the script run until you see an exception similar to the one list below:

node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

AssertionError [ERR_ASSERTION]: 'oneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneoneo... == 'OK'

(Note: the logging should stop as soon as an assertion error occurs)
After seeing the error above, the problem has been successfully reproduced and you can send a SIGINT signal to the process via CTRL+C.

Understanding

We are pretty sure the cause of this is at least similar in nature to the original issue as seen in the first post here. We wanted to see if anyone on the node-redis contributing team has ran into this issue, or has any insight into a possible fix to the library.
We currently have a work-around fix that is external to the node-redis library on our end, but a patch to this library would be the ultimate fix and would allow us to remove some temporary work-arounds.

Thanks!

System-Glitch · 2022-06-07T09:28:29Z

Hello, any update on this issue? This is affecting one of our applications pretty badly. Any workaround while waiting for #1603 to be merged?

jakepruitt mentioned this issue Apr 27, 2021

Wait until next tick before handling socket errors #1603

Open

4 tasks

dhensby mentioned this issue Dec 17, 2021

v4 - requests against connections with unexpectedly closed sockets don't get fulfilled after reconnect #1789

Closed

System-Glitch mentioned this issue Jun 7, 2022

Crash: Cannot read property 'callback' of undefined at normal_reply (Redis) cube-js/cube#4722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593

On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593

jakepruitt commented Apr 5, 2021

schancel commented Apr 15, 2021 •

edited

Loading

Uh oh!

jakepruitt commented Apr 26, 2021

Uh oh!

shammahatdivvy commented May 6, 2021

Uh oh!

cit68 commented Jan 3, 2022

Uh oh!

jakepruitt commented Jan 5, 2022

Uh oh!

OzQu commented Feb 9, 2022

Uh oh!

nicklvsa commented Apr 21, 2022

Uh oh!

System-Glitch commented Jun 7, 2022

Uh oh!

On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593

On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593

Comments

jakepruitt commented Apr 5, 2021

Issue

Reproducing

Potential solutions

Environment

schancel commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakepruitt commented Apr 26, 2021

Uh oh!

shammahatdivvy commented May 6, 2021

Uh oh!

cit68 commented Jan 3, 2022

Uh oh!

jakepruitt commented Jan 5, 2022

Uh oh!

OzQu commented Feb 9, 2022

Uh oh!

nicklvsa commented Apr 21, 2022

Our exact environment

Prerequisites

Scripts:

How to reproduce this issue

Understanding

Uh oh!

System-Glitch commented Jun 7, 2022

Uh oh!

schancel commented Apr 15, 2021 •

edited

Loading