Unable to complete query with a single unavailable store-gateway, with shuffle-sharding and zone-awareness

**Describe the bug**
Queriers do not seem to gracefully handle store-gateway failure, even with zone-awareness, and replication.

I have 3 different store-gateways, each in different AZs, and a replication factor of 3. If queriers loses network connection to store-gateway, or store-gateway fails to response due to any failure, queriers will respond with 5xx.

I'm wondering if this is the expected failure mode, or a misconfiguration on my side.

in https://github.com/cortexproject/cortex/blob/464c4243311eb727395faa0bd2a5ea5f75965250/pkg/querier/blocks_store_queryable.go#L694 , it seems like if we fail to query from any store-gateway, we'll fail the call, even if we have successfully gathered blocks from other store-gateways. Since our replication factor is 3, I expect querier to indeed query from all store-gateways, but a single store-gateway failure should not result in 5xx on query.

if we are failing a query because a single store-gateway failed, can we instead return the blocks that were queried, and have `queryWithConsistencyCheck` https://github.com/cortexproject/cortex/blob/464c4243311eb727395faa0bd2a5ea5f75965250/pkg/querier/blocks_store_queryable.go#L503 reattempt the missing blocks?

**To Reproduce**
Steps to reproduce the behavior:
1. Start Cortex (3291733c24b77f666dec7a6b632eec285abef44c)
2. Configure store-gateway with shuffle-sharding and zone-awareness

```
store_gateway:
  sharding_enabled: true
  sharding_ring:
    replication_factor: 3
    zone_awareness_enabled: true
    instance_availability_zone: AZ
```

3. scale up store-gateway to 3 pods
4. cut connection between one of the store-gateways and all queriers
5. make a query
6. query fails with the following error

```
{"status":"error","errorType":"internal","error":"expanding series: failed to fetch series from IP:Port: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp IP:Port: i/o timeout""}
```

**Expected behavior**
I expect queries to succeed instead of failing, since store-gateay is zone-aware, and that the data should be replicated to 3 instances. Any single store-gateway failure shouldn't fail the query.

**Environment:**
 - Infrastructure: kubernetes
 - Deployment tool: helm
 - AZs: 3 availability zones
 - Number of store-gateway: 3

**Storage Engine**
- [X] Blocks
- [ ] Chunks

**Additional Context**
* I used chaos-mesh to help with simulating network failure




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to complete query with a single unavailable store-gateway, with shuffle-sharding and zone-awareness #4529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to complete query with a single unavailable store-gateway, with shuffle-sharding and zone-awareness #4529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions