-
Notifications
You must be signed in to change notification settings - Fork 1.9k
On node versions higher than 10.19.0, a socket disconnect can permanently break the internal command_queue state #1593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm running into what I think is this same problem, but it manifests as a Possibly related? #1427 |
@schancel that does sound similar! I'll see if I can create a fix that covers both cases. |
Thank you |
hi @jakepruitt, finally I can find someone with the same issue. redis version: 6.0.3 |
#1603 works as a workaround! |
Hi @cit68, do you have good way to reproduce the issue with node-redis v4 ? We've tried to reproduce the issue with node-redis v4 and have not been able to reproduce it, at least not yet. |
Hi 👋 , we've see this problem appear again while running node-redis v4 and node 16. It also appears to be triggered by the same steps as listed in the original post. Below I will list steps to reproduce this problem locally, and the requirements to get it setup. Our exact environment
Prerequisites
Scripts:
echo "Press CTRL+C to exit once complete!"
sleep 2
docker run --name my-redis-container -p 7001:6379 -d redis
run_reboot() {
sleep 3
while :
do
redis-cli -p 7001 CLIENT KILL TYPE normal SKIPME no
sleep 0.0$RANDOM
done
}
run_terminate() {
sleep 1
while :
do
docker restart my-redis-container -t 0
sleep 2
done
}
run_reboot > /dev/null 2>&1 &
p1=$!
run_terminate > /dev/null 2>&1 &
p2=$!
INTERVAL=5 node test.js &
p3=$!
trap 'kill "$p1"; kill "$p2"; kill "$p3"' SIGINT
while kill -s 0 "$p1" || kill -s 0 "$p2" || kill -s 0 "$p3"; do
wait "$p1"; wait "$p2"; wait "$p3"
done &>/dev/null
const redis = require('redis');
const assert = require('assert');
const child_process = require('child_process');
const AssertionError = assert.AssertionError;
const REDIS_HOST = 'localhost';
const REDIS_PORT = '7001';
const buildClient = () => {
const client = redis.createClient({
socket: {
host: REDIS_HOST,
port: REDIS_PORT,
},
return_buffers: true,
});
client
.on('error', console.log)
.on('ready', () => {
console.log('Redis connection established.');
});
return client;
};
let onestring = '', twostring = '', threestring = '';
for (let i = 0; i < 30002; i++) {
onestring += 'one';
twostring += 'two';
threestring += 'three';
}
async function main() {
try {
let client = buildClient();
await client.connect();
setInterval(async () => {
try {
const res1 = await client.set('one', onestring);
if (res1) assert.equal(res1, 'OK');
const res2 = await client.set('two', twostring);
if (res2) assert.equal(res2, 'OK');
const res3 = await client.set('three', threestring);
if (res3) assert.equal(res3, 'OK');
const get1 = await client.get('one');
if (get1) assert.equal(get1.slice(0,10).toString(), 'oneoneoneo');
const get2 = await client.get('two');
if (get2) assert.equal(get2.slice(0,10).toString(), 'twotwotwot');
const get3 = await client.get('three');
if (get3) assert.equal(get3.slice(0,10).toString(), 'threethree');
} catch(e) {
if (e instanceof AssertionError) {
throw e;
}
console.log(e)
}
}, process.env.INTERVAL || 100);
} catch(e) {
console.log(e);
main();
}
}
main(); How to reproduce this issue
(Note: the logging should stop as soon as an assertion error occurs) Understanding
Thanks! |
Hello, any update on this issue? This is affecting one of our applications pretty badly. Any workaround while waiting for #1603 to be merged? |
Issue
The node-redis client has a race condition in the handling of write error events from the TCP socket if the TCP socket disconnects during the execution of the
internal_send_command()
function, specifically on versions of Node.js >= 10.23.2.The issue stems from the way error events are emitted from net sockets between 10.19.0 and 10.23.2. Whereas socket
EPIPE
error events are emitted asynchronously on 10.19.0, they are emitted synchronously higher versions, which means theinternal_send_command
function may wind up executing thethis.flush_and_error()
function withinconnection_gone
before thethis.command_queue.push()
call here. This means the command queue gets polluted with a stray command that throws off the synchronization between the command queue and the commands sent to redis once redis reconnects.To be a bit more concrete, here is an example of the same socket disconnect in the same part of the code being handled on 10.19.0 (where the
.flush_and_error()
is called after.command_queue.push()
) vs 10.23.2, where the.flush_and_error
happens midway through the code ininternal_send_command
sending the write to redis.On 10.19.0:
On node 10.23.2:
I was not able to pinpoint the exact change in libuv that is causing the problem, but it was definitely introduced in libuv between versions 1.25.0 and 1.34.2, since that is the difference between node 10.19.0 and 10.23.2. Without in-depth knowledge, I believe libuv/libuv#2425 may have something to do with it.
Reproducing
This issue is inherently difficult to reproduce, since the perfect conditions need to be aligned for a write to be happening at the same time as a disconnect. When trying to trigger the event by terminating an active redis-server process, it usually takes me 10-20 tries before I can get an error case. For small writes, this probability is even lower.
I wrote a small script to reproduce this issue:
(sorry for such a pyramid of doom shape to this test).
By running
redis-server
in one terminal and this script in another, withNODE_DEBUG=redis INTERVAL=10 node script.js
, then terminating and restarting redis-server in the other terminal, I eventually trigger an error on the assert that says:Since this issue is so difficult to trigger, I usually need to retry the connection severing several times.
Potential solutions
This problem can be solved by adding a
setImmediate()
orprocess.nextTick()
into thestream.on('error')
event handling. I do not know where it would be best to add this, but it seems like there are a few candidates:on('error')
is firedconnection_gone()
flush_and_error()
call that clears the command queue inconnection_gone()
flush_and_error()
My own moral compass points toward putting it inside the
flush_and_error
handler so that flushing is guaranteed to happen on the next tick, and the command queue would not grow after getting flushed. This would also be the least intrusive and most targeted approach. I could also see the case for putting it at the highest level, since that would effectively make the behavior on node >10.23.2 and 10.19.0 equivalent.Environment
4.0.9
and6.0.5
Mac
andLinux
The text was updated successfully, but these errors were encountered: