Description
Describe the bug
Im seeing an issue with ingesters sometimes failing to leave the ring. This seems to happen no matter which kv store is used. It looks as though there is a race condition with closing the lifecycler loop and leaving the ring. Below is example logs of using etcd as kv store.
cortex-ingester-5 cortex level=info ts=2021-09-08T17:47:21.918300489Z caller=lifecycler.go:754 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester
cortex-ingester-5 cortex {"level":"warn","ts":"2021-09-08T17:47:42.803Z","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008901c0/#initially=[cortex-etcd-0.cortex-etcd:2379;cortex-etcd-1.cortex-etcd:2379;cortex-etcd-2.cortex-etcd:2379]","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
To Reproduce
Steps to reproduce the behavior:
I've been able to reproduce by starting from a completely blank deployment, spin up some ingesters and connect them to the ring. Do a rolling restart on them and all looks good. Every ingester leaves the ring and rejoins properly. After the rolling restart is done, do another rolling restart and some ingesters fail to leave the ring. It doesnt matter if I use memberlist, or etcd, or consul.
Expected behavior
Ingesters should leave the ring no matter how many times they are restarted when unregister on shutdown is true
Environment:
- Infrastructure: kubernetes
- Deployment tool: N/A
Storage Engine
- Blocks
- Chunks
Additional Context
I found this bug when testing a lower replication factor. I'm wondering if this is missed by most deployments because the replication factor of 3 with extending writes hides the issue. With a lower replication factor if an ingester fails to leave the ring all writes fail.