memberlist keeps resurrecting deleted store-gateways

**Describe the bug**
when performing a scale-up of store-gateway pods followed by a scale-down memberlist entries of deleted store-gateway pods sporadically re-appear after a few hours as unhealthy in the memberlist ring. 
The system doesn't recover from the ghost entries and they appear and disappear at random.

<img width="1316" alt="Screen Shot 2021-03-25 at 9 10 29 AM" src="https://user-images.githubusercontent.com/111324/112505376-f48de100-8d49-11eb-8f51-25e61fec306d.png">

In our case we scaled 12 to 80 and back to 12, but this happens with lower scale-up numbers as well. 
we verified that each unhealthy entry as reported by the metrics references a no longer existing store-gateway pod.
This is indicated in the logs with messages like 

```
msg=\"auto-forgetting instance from the ring because it is unhealthy for a long time\" instance=store-gateway-15 
```

**To Reproduce**
Steps to reproduce the behavior:
1. Start Cortex, using memberlist for store-gateway ring (efd1de49ee9a2ae62ca824845204f66864a31d1f)
2. Scale up store-gateway deployment 
3. Scale down store-gateway deployment
4. keep k8s cluster running for a few hours

relevant section of cortex configuration:

```
memberlist:
  bind_port: 7946

  join_members:
    - distributor-memberlist.cortex.svc.cluster.local:7946
    - compactor-memberlist.cortex.svc.cluster.local:7946
  abort_if_cluster_join_fails: false

  rejoin_interval: 10m

  left_ingesters_timeout: 20m

store_gateway:
  sharding_enabled: true
  sharding_strategy: default
  sharding_ring:
    kvstore:
      store: memberlist
      prefix: store-gateway-v1/
    heartbeat_timeout: 10m 
    zone_awareness_enabled: true
```

**Expected behavior**
Inspecting the cortex store-gateway ring status history for the lifetime of the cluster it shouldn't contain unhealthy store-gateways of deleted pods.

**Environment:**
 - Infrastructure: Kubernetes
 - Deployment tool: helm, custom chart

**Storage Engine**
- [x] Blocks
- [ ] Chunks

**Additional Context**
#3603 was a PR to fix it, but it seems it doesn't cover some edge cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

memberlist keeps resurrecting deleted store-gateways #4010

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memberlist keeps resurrecting deleted store-gateways #4010

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions