Skip to content

Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alanprot
Copy link
Member

@alanprot alanprot commented Nov 5, 2024

What this PR does:

We can have a race when we partially fetch data on the multilevel cache. The reason for that is internally, the object store changes the dic returned by the cache and this races with the cache internal implementation.

Ex: https://github.com/thanos-io/thanos/blob/d6d19c568f8dc6005a9184d3a5a994da4f0d8c75/pkg/store/cache/caching_bucket.go#L469

Right now this PR only have the unit test demonstrating the issue.

PS: I also optimized the cache to not fetch keys that were already found on the previous cache implementation.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • [NA] Documentation added
  • [NA] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alanprot alanprot force-pushed the chunks-multilevel-cache-race branch from 6c779f0 to 3532f52 Compare November 5, 2024 04:07
@alanprot
Copy link
Member Author

alanprot commented Nov 5, 2024

Test failing before the fix (first commit)

==================
WARNING: DATA RACE
Read at 0x00c0010bc9f0 by goroutine 1663:
  runtime.mapiterinit()
      /usr/local/go/src/runtime/map.go:877 +0x0
  github.com/thanos-io/thanos/pkg/cache.(*InMemoryCache).Store()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cache/inmemory.go:294 +0x73
  github.com/cortexproject/cortex/pkg/storage/tsdb.(*multiLevelChunkCache).Fetch.func1.1()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache.go:137 +0xc1
  github.com/thanos-io/thanos/pkg/cacheutil.(*AsyncOperationProcessor).asyncQueueProcessLoop()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:55 +0xc1
  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor.gowrap1()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:36 +0x33

Previous write at 0x00c0010bc9f0 by goroutine 1651:
  runtime.mapdelete_faststr()
      /usr/local/go/src/runtime/map_faststr.go:321 +0x0
  github.com/cortexproject/cortex/pkg/storage/tsdb.Test_MultiLevelChunkCacheFetchRace()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache_test.go:106 +0xa56
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 1663 (running) created at:
  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:36 +0x13e
  github.com/cortexproject/cortex/pkg/storage/tsdb.newMultiLevelChunkCache()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache.go:63 +0x90
  github.com/cortexproject/cortex/pkg/storage/tsdb.Test_MultiLevelChunkCacheFetchRace()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache_test.go:99 +0x89b
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 1651 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1743 +0x825
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2168 +0x85
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2166 +0x8be
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:2034 +0xf17
  main.main()
      _testmain.go:91 +0x164
==================
--- FAIL: Test_MultiLevelChunkCacheFetchRace (0.00s)
    testing.go:[139](https://github.com/cortexproject/cortex/actions/runs/11677372741/job/32515160421#step:6:140)9: race detected during execution of test

@alanprot alanprot changed the title Fix race on chunks multilevel cache Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. Nov 5, 2024
@alanprot alanprot marked this pull request as ready for review November 5, 2024 04:27
@dosubot dosubot bot added the storage/chunks Chunks storage engine label Nov 5, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks a lot for fixing this issue!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we do better to catch this before? Maybe we shouldn't use mock in our existing test

@alanprot
Copy link
Member Author

alanprot commented Nov 5, 2024

How can we do better to catch this before? Maybe we shouldn't use mock in our existing test

I think the only way is integ tests.. cause the root cause was how the results of the cache interact with the overall code … but catching that would need a very specific test I guess .. idk if we will ever be able to test all combinations of configs and edge cases tbh :(

@alanprot alanprot merged commit 661f47b into cortexproject:master Nov 5, 2024
16 checks passed
@alanprot alanprot deleted the chunks-multilevel-cache-race branch November 5, 2024 07:09
alanprot added a commit to alanprot/cortex that referenced this pull request Nov 5, 2024
…ready found keys. (cortexproject#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
alanprot added a commit that referenced this pull request Nov 5, 2024
* Implementing Expanded Postings Cache

Signed-off-by: alanprot <[email protected]>

* small nit

Signed-off-by: alanprot <[email protected]>

* refactoring the cache so we dont need to call expire on every request

Signed-off-by: alanprot <[email protected]>

* Update total cache size when updating the item

Signed-off-by: alanprot <[email protected]>

* Fix fuzzy test after change the flag name

Signed-off-by: alanprot <[email protected]>

* remove max item config + create a new test case with only head cache enabled

Signed-off-by: alanprot <[email protected]>

* Documenting enabled as first field on the config

Signed-off-by: alanprot <[email protected]>

* Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys.  (#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>

* Improve Doc

Signed-off-by: alanprot <[email protected]>

* create new cortex_ingester_expanded_postings_non_cacheable_queries metric

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
CharlieTLe pushed a commit to CharlieTLe/cortex that referenced this pull request Dec 3, 2024
…ready found keys. (cortexproject#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
CharlieTLe pushed a commit to CharlieTLe/cortex that referenced this pull request Dec 3, 2024
* Implementing Expanded Postings Cache

Signed-off-by: alanprot <[email protected]>

* small nit

Signed-off-by: alanprot <[email protected]>

* refactoring the cache so we dont need to call expire on every request

Signed-off-by: alanprot <[email protected]>

* Update total cache size when updating the item

Signed-off-by: alanprot <[email protected]>

* Fix fuzzy test after change the flag name

Signed-off-by: alanprot <[email protected]>

* remove max item config + create a new test case with only head cache enabled

Signed-off-by: alanprot <[email protected]>

* Documenting enabled as first field on the config

Signed-off-by: alanprot <[email protected]>

* Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys.  (cortexproject#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>

* Improve Doc

Signed-off-by: alanprot <[email protected]>

* create new cortex_ingester_expanded_postings_non_cacheable_queries metric

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size/M storage/chunks Chunks storage engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants