Description
Describe the bug
Queriers do not seem to gracefully handle store-gateway failure, even with zone-awareness, and replication.
I have 3 different store-gateways, each in different AZs, and a replication factor of 3. If queriers loses network connection to store-gateway, or store-gateway fails to response due to any failure, queriers will respond with 5xx.
I'm wondering if this is the expected failure mode, or a misconfiguration on my side.
in
cortex/pkg/querier/blocks_store_queryable.go
Line 694 in 464c424
if we are failing a query because a single store-gateway failed, can we instead return the blocks that were queried, and have queryWithConsistencyCheck
cortex/pkg/querier/blocks_store_queryable.go
Line 503 in 464c424
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (3291733)
- Configure store-gateway with shuffle-sharding and zone-awareness
store_gateway:
sharding_enabled: true
sharding_ring:
replication_factor: 3
zone_awareness_enabled: true
instance_availability_zone: AZ
- scale up store-gateway to 3 pods
- cut connection between one of the store-gateways and all queriers
- make a query
- query fails with the following error
{"status":"error","errorType":"internal","error":"expanding series: failed to fetch series from IP:Port: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp IP:Port: i/o timeout""}
Expected behavior
I expect queries to succeed instead of failing, since store-gateay is zone-aware, and that the data should be replicated to 3 instances. Any single store-gateway failure shouldn't fail the query.
Environment:
- Infrastructure: kubernetes
- Deployment tool: helm
- AZs: 3 availability zones
- Number of store-gateway: 3
Storage Engine
- Blocks
- Chunks
Additional Context
- I used chaos-mesh to help with simulating network failure