Refactor block storage User Scanner and introduce user index file #6780

yeya24 · 2025-06-02T08:47:43Z

What this PR does:

This PR refactors the existing block storage user scanner package into new user scanner with configurable strategy. The default value is list which scans the object storage using the same way.

A new strategy called user_index is introduced in this PR. It is intended to reduce the List API call to the object storage by caching user list in a file. It is similar to the idea of bucket index and it contains the list of active, deleting and deleted users.

Active users mean users that have data on the bucket and don't have tenant deletion marker.
Deleting users mean users that have data on the bucket but have tenant deletion marker. Still within the tenant deletion delay.
Deleted users mean users that only have the tenant deletion marker left and all data deleted.

user_index strategy will fallback to list strategy in case user index file is corrupted or not found, or exceeds max stale period. If user_index strategy is used, compactor is responsible of updating user index file periodically.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

danielblando

Thanks. This looks awesome

CHANGELOG.md

pkg/storage/tsdb/users/scanner.go

pkg/compactor/compactor.go

pkg/storage/tsdb/users/updater.go

danielblando · 2025-06-03T17:15:31Z

Any reason you split the old deleted in new deleting, deleted? I couldnt see any benefits

danielblando · 2025-06-03T17:19:34Z

About the updater, i like the way you did with fake tenantId because it give us the owner of the update, but it doesnt work in cases where shard is off or if compactor is not running.
I was thinking if it would make sense to dont have a updater and add that logic when scanningUser. If we are getting the data on ScanUser if the GetUpdatedAt is older than c.compactorCfg.CleanupInterval we do a goRoutine to update it.
wdyt? One issue i see is having concurrency calls to update

yeya24 · 2025-06-03T18:19:54Z

Any reason you split the old deleted in new deleting, deleted? I couldnt see any benefits

Because today I see different components have different behaviors. Store Gateway will try to load blocks for deleting tenants, which means they have the tenant deletion marker and data in their folder but still within tenant clean up delay.

While other components only care about active or active and tenants with deletion marker.

I don't want to change store gateway behavior so I separate it out. It should remain the same behavior for those components.

About the updater, i like the way you did with fake tenantId because it give us the owner of the update, but it doesnt work in cases where shard is off or if compactor is not running.

That's correct. I thought about the idea of updating user index as part of the scanner call but ended up splitting it out for the ownership logic to make things cleaner.

For the case of compactor not running, I think it is fine as user index has max stale period and components can easily fallback to the list scanner. It shouldn't be a big deal as worst case is the same as today.

When sharding is off, yes I want to avoid updating the same user index file from all compactors. I think it is better to just fallback to list for all compactors and rely on tenants iter caching bucket. I can update the doc and mention that this feature is only when sharding is enabled. I would say if the scale is small enough and sharding is disabled then user index is unnecessary for that usecase.

Signed-off-by: Ben Ye <[email protected]>

Signed-off-by: yeya24 <[email protected]>

alanprot

This is great!!!

LGTM! thanks for doing this!

justinjung04

This looks great!

One question -- why is there caching in the user_index strategy? I thought the user index will be updated by the clean up interval, and we always wanted to read the latest user index. Can you elaborate on situations where the TTL will be smaller or greater than the clean up interval?

yeya24 · 2025-06-04T18:02:51Z

One question -- why is there caching in the user_index strategy? I thought the user index will be updated by the clean up interval, and we always wanted to read the latest user index. Can you elaborate on situations where the TTL will be smaller or greater than the clean up interval?

So the idea of caching is to share the user scan result within the same process. For example, compactor and cleaner are the same process and they have their own scanner as they have different goroutines to scan users. We share the base scanner so that with client side caching we don't have to scan twice.

This caching would be most useful for list strategy. User index could also work. We don't have to read the latest user index I think. Users don't change frequently and we are ok as long as we get latest users before queryStoreAfter.

Signed-off-by: yeya24 <[email protected]>

danielblando

Awesome

Signed-off-by: yeya24 <[email protected]>

pull-request-size bot added the size/XXL label Jun 2, 2025

dosubot bot added component/compactor storage/blocks Blocks storage engine labels Jun 2, 2025

yeya24 force-pushed the user-scanner branch from e7bc52f to d9e88f8 Compare June 3, 2025 06:17

danielblando reviewed Jun 3, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pkg/storage/tsdb/users/scanner.go Show resolved Hide resolved

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

pkg/storage/tsdb/users/updater.go Show resolved Hide resolved

yeya24 added 10 commits June 3, 2025 17:15

add list and user index user scanner

d15fe3d

Signed-off-by: Ben Ye <[email protected]>

changelog

22bc20c

Signed-off-by: Ben Ye <[email protected]>

lint

3e56b33

Signed-off-by: Ben Ye <[email protected]>

fix lint and test

21768d9

Signed-off-by: yeya24 <[email protected]>

lint

1117533

Signed-off-by: yeya24 <[email protected]>

fix test

eae9f52

Signed-off-by: yeya24 <[email protected]>

fix unit tests

947effc

Signed-off-by: yeya24 <[email protected]>

fix unit test

b9c63ba

Signed-off-by: yeya24 <[email protected]>

update unit test

099cb70

Signed-off-by: yeya24 <[email protected]>

ignore user index file name as a valid a userID

3406081

Signed-off-by: yeya24 <[email protected]>

yeya24 force-pushed the user-scanner branch from 9863ce3 to 3406081 Compare June 4, 2025 00:16

alanprot approved these changes Jun 4, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 4, 2025

justinjung04 reviewed Jun 4, 2025

View reviewed changes

add metrics for scanner

8bfb8f4

Signed-off-by: yeya24 <[email protected]>

danielblando approved these changes Jun 4, 2025

View reviewed changes

yeya24 added 2 commits June 4, 2025 13:41

fix lint

d44c1a7

Signed-off-by: yeya24 <[email protected]>

fix metrics

05df351

Signed-off-by: yeya24 <[email protected]>

yeya24 merged commit 844fa55 into cortexproject:master Jun 5, 2025
15 of 17 checks passed

yeya24 deleted the user-scanner branch June 5, 2025 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor block storage User Scanner and introduce user index file #6780

Refactor block storage User Scanner and introduce user index file #6780

Uh oh!

yeya24 commented Jun 2, 2025 •

edited

Loading

Uh oh!

danielblando left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielblando commented Jun 3, 2025

Uh oh!

danielblando commented Jun 3, 2025

Uh oh!

yeya24 commented Jun 3, 2025 •

edited

Loading

Uh oh!

alanprot left a comment

Uh oh!

justinjung04 left a comment

Uh oh!

yeya24 commented Jun 4, 2025 •

edited

Loading

Uh oh!

danielblando left a comment

Uh oh!

Uh oh!

Uh oh!

Refactor block storage User Scanner and introduce user index file #6780

Refactor block storage User Scanner and introduce user index file #6780

Uh oh!

Conversation

yeya24 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielblando left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielblando commented Jun 3, 2025

Uh oh!

danielblando commented Jun 3, 2025

Uh oh!

yeya24 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanprot left a comment

Choose a reason for hiding this comment

Uh oh!

justinjung04 left a comment

Choose a reason for hiding this comment

Uh oh!

yeya24 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielblando left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yeya24 commented Jun 2, 2025 •

edited

Loading

yeya24 commented Jun 3, 2025 •

edited

Loading

yeya24 commented Jun 4, 2025 •

edited

Loading