Skip to content

Bug in ring.shuffleShard() with Zone Aware Replication #5467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
harry671003 opened this issue Jul 20, 2023 · 2 comments · Fixed by #5489
Closed

Bug in ring.shuffleShard() with Zone Aware Replication #5467

harry671003 opened this issue Jul 20, 2023 · 2 comments · Fixed by #5489

Comments

@harry671003
Copy link
Contributor

Bug:

There is a bug in how the sub ring for a tenant is obtained in ring.shuffleShard(). Ideally when shard size increases or decreases by one, the replicas for any block should only change at-most by one instance. But in the cortex implementation more than one instance can change if the shardSize is not divisible by numZones.

The code for shuffle sharding works as follows ring.shuffleShard():

  • Get all the zones r.ringZones
  • Calculate numInstancesPerZone = int(math.Ceil(float64(shardSize) / float64(numZones)))
  • For each Zone in r.ringZones:
    • Find the first numInstancesPerZone unique instances

Explanation

Assumtions:

  • Assume there are 11 store-gateways sg1 to sg11.
  • Assume current tenant shard size is 9 and it's increased to 10

Initial state

SG_Ring(1)

In the above diagram, the store-gateway tokens are arranged in the order as shown in the first figure. Assume TENANT_ID is hashed just before sg9's token.

When shardSize=9

ShardSize=9

  • When shardSize = 9, the shardSize is evenly divisible by number of zones.
numInstancesPerZone = int(math.Ceil(float64(shardSize) / float64(numZones)))
numInstancesPerZone = int(math.Ceil(9/3))
numInstancesPerZone = 3
  • The algorithm for shuffle sharding will pick the first 3 unique instances from each zone with a maximum of 9 instances.
  • The subRing for the tenant will be [sg9, sg5, sg2, sg4, sg1, sg6, sg7, sg8, sg3]
  • We'll have 3 instances per AZ:
    • AZ1: [sg9, sg2, sg6]
    • AZ2: [sg5, sg1, sg7]
    • AZ3: [sg4, sg8, sg3]
  • The replicas for the block will be [sg8, sg9, sg5]

When shardSize=10

ShardSize=10(1)

  • When shardSize=10, the shardSize is not evenly divisible by number of zones.
numInstancesPerZone = int(math.Ceil(float64(shardSize) / float64(numZones)))
numInstancesPerZone = int(math.Ceil(10/3)) = int(math.Ceil(3.33))
numInstancesPerZone = 4
  • The algorithm for shuffle sharding will pick the first 4 unique instances from each zone with a maximum of 10 instances.
  • The subRing for the tenant will be [sg9, sg5, sg2, sg4, sg1, sg6, sg7, sg8, sg3]
  • Two new instances were added to the subRing: [sg10, sg11]
  • We'll have 4 instances in AZ3 and AZ2 and only 2 instances in AZ1:
    • AZ1: [sg9, sg2, sg6, sg11]
    • AZ2: [sg15 sg1, sg7, sg10]
    • AZ2: [sg4, sg8]
  • The replicas for the block will be [sg10, sg11, sg8]
  • The block replicas also changed by 2 when shard size increased from 9 to 11.

To Reproduce
Steps to reproduce the behavior:

  • To reproduce, the store-gateway tenant shard size should be increased by one from a value divisible by number of zones.

Expected behavior

  • Only one replica should change for any block when tenant shard size increases/decrease by 1.

Possible Fix

  • When the shardSize is not divisible by the zones, only the first zone should get the extra instance.
  • The instances per zone shouldn't differ by more than one.
@yeya24
Copy link
Contributor

yeya24 commented Jul 27, 2023

To clarify:
In the current implementation, when zone awareness is enabled, the shardSize will round up to multiple of numZones so when numZones is 3, shardSize of 10, 11 and 12 will always round up to 12.

Only one replica should change for any block when tenant shard size increases/decrease by 1

I am not sure if the current implementation is designed to support this or it always wants to increase/decrease by numZones. See https://github.com/cortexproject/cortex/blob/master/pkg/ring/ring_test.go#L1800. I think the current implementation considers consistency but it expects changing by numZones each time.

Based on the points mentioned above, assuming numZones is 3, current shard size is 3 and we increased it to 4. It actually goes to 6 and there will be 3 new instances. The issue seems like in store gateway we don't have enough retries when we add 3 new instances because number of retry is hardcoded to 3.
If we can have number of retries for store gateway to be numZones + 1. I guess we won't hit issues like consistency?

@harry671003
Copy link
Contributor Author

Thanks for fixing this @yeya24.

If we can have number of retries for store gateway to be numZones + 1. I guess we won't hit issues like consistency?

If we have RF=3 and numZones=3, we can't retry numZones + 1 because we won't have enough replicas for a block to retry on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants