Skip to content

Unable to complete query with a single unavailable store-gateway, with shuffle-sharding and zone-awareness #4529

Closed
@roystchiang

Description

@roystchiang

Describe the bug
Queriers do not seem to gracefully handle store-gateway failure, even with zone-awareness, and replication.

I have 3 different store-gateways, each in different AZs, and a replication factor of 3. If queriers loses network connection to store-gateway, or store-gateway fails to response due to any failure, queriers will respond with 5xx.

I'm wondering if this is the expected failure mode, or a misconfiguration on my side.

in

if err := g.Wait(); err != nil {
, it seems like if we fail to query from any store-gateway, we'll fail the call, even if we have successfully gathered blocks from other store-gateways. Since our replication factor is 3, I expect querier to indeed query from all store-gateways, but a single store-gateway failure should not result in 5xx on query.

if we are failing a query because a single store-gateway failed, can we instead return the blocks that were queried, and have queryWithConsistencyCheck

for attempt := 1; attempt <= maxFetchSeriesAttempts; attempt++ {
reattempt the missing blocks?

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex (3291733)
  2. Configure store-gateway with shuffle-sharding and zone-awareness
store_gateway:
  sharding_enabled: true
  sharding_ring:
    replication_factor: 3
    zone_awareness_enabled: true
    instance_availability_zone: AZ
  1. scale up store-gateway to 3 pods
  2. cut connection between one of the store-gateways and all queriers
  3. make a query
  4. query fails with the following error
{"status":"error","errorType":"internal","error":"expanding series: failed to fetch series from IP:Port: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp IP:Port: i/o timeout""}

Expected behavior
I expect queries to succeed instead of failing, since store-gateay is zone-aware, and that the data should be replicated to 3 instances. Any single store-gateway failure shouldn't fail the query.

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: helm
  • AZs: 3 availability zones
  • Number of store-gateway: 3

Storage Engine

  • Blocks
  • Chunks

Additional Context

  • I used chaos-mesh to help with simulating network failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions