Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

alanprot · 2024-11-05T04:03:54Z

What this PR does:

We can have a race when we partially fetch data on the multilevel cache. The reason for that is internally, the object store changes the dic returned by the cache and this races with the cache internal implementation.

Ex: https://github.com/thanos-io/thanos/blob/d6d19c568f8dc6005a9184d3a5a994da4f0d8c75/pkg/store/cache/caching_bucket.go#L469

Right now this PR only have the unit test demonstrating the issue.

PS: I also optimized the cache to not fetch keys that were already found on the previous cache implementation.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
[NA] Documentation added
[NA] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: alanprot <[email protected]>

alanprot · 2024-11-05T04:25:53Z

Test failing before the fix (first commit)

==================
WARNING: DATA RACE
Read at 0x00c0010bc9f0 by goroutine 1663:
  runtime.mapiterinit()
      /usr/local/go/src/runtime/map.go:877 +0x0
  github.com/thanos-io/thanos/pkg/cache.(*InMemoryCache).Store()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cache/inmemory.go:294 +0x73
  github.com/cortexproject/cortex/pkg/storage/tsdb.(*multiLevelChunkCache).Fetch.func1.1()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache.go:137 +0xc1
  github.com/thanos-io/thanos/pkg/cacheutil.(*AsyncOperationProcessor).asyncQueueProcessLoop()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:55 +0xc1
  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor.gowrap1()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:36 +0x33

Previous write at 0x00c0010bc9f0 by goroutine 1651:
  runtime.mapdelete_faststr()
      /usr/local/go/src/runtime/map_faststr.go:321 +0x0
  github.com/cortexproject/cortex/pkg/storage/tsdb.Test_MultiLevelChunkCacheFetchRace()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache_test.go:106 +0xa56
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 1663 (running) created at:
  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor()
      /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/cacheutil/async_op.go:36 +0x13e
  github.com/cortexproject/cortex/pkg/storage/tsdb.newMultiLevelChunkCache()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache.go:63 +0x90
  github.com/cortexproject/cortex/pkg/storage/tsdb.Test_MultiLevelChunkCacheFetchRace()
      /__w/cortex/cortex/pkg/storage/tsdb/multilevel_chunk_cache_test.go:99 +0x89b
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 1651 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1743 +0x825
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2168 +0x85
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2166 +0x8be
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:2034 +0xf17
  main.main()
      _testmain.go:91 +0x164
==================
--- FAIL: Test_MultiLevelChunkCacheFetchRace (0.00s)
    testing.go:[139](https://github.com/cortexproject/cortex/actions/runs/11677372741/job/32515160421#step:6:140)9: race detected during execution of test

yeya24

Makes sense. Thanks a lot for fixing this issue!

yeya24

How can we do better to catch this before? Maybe we shouldn't use mock in our existing test

alanprot · 2024-11-05T07:08:09Z

How can we do better to catch this before? Maybe we shouldn't use mock in our existing test

I think the only way is integ tests.. cause the root cause was how the results of the cache interact with the overall code … but catching that would need a very specific test I guess .. idk if we will ever be able to test all combinations of configs and edge cases tbh :(

…ready found keys. (cortexproject#6312) * Creating a test to show the race on the multilevel cache Signed-off-by: alanprot <[email protected]> * fix the race problem * Only fetch keys that were not found on the previous cache Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]>

* Implementing Expanded Postings Cache Signed-off-by: alanprot <[email protected]> * small nit Signed-off-by: alanprot <[email protected]> * refactoring the cache so we dont need to call expire on every request Signed-off-by: alanprot <[email protected]> * Update total cache size when updating the item Signed-off-by: alanprot <[email protected]> * Fix fuzzy test after change the flag name Signed-off-by: alanprot <[email protected]> * remove max item config + create a new test case with only head cache enabled Signed-off-by: alanprot <[email protected]> * Documenting enabled as first field on the config Signed-off-by: alanprot <[email protected]> * Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. (#6312) * Creating a test to show the race on the multilevel cache Signed-off-by: alanprot <[email protected]> * fix the race problem * Only fetch keys that were not found on the previous cache Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]> * Improve Doc Signed-off-by: alanprot <[email protected]> * create new cortex_ingester_expanded_postings_non_cacheable_queries metric Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]>

…ready found keys. (cortexproject#6312) * Creating a test to show the race on the multilevel cache Signed-off-by: alanprot <[email protected]> * fix the race problem * Only fetch keys that were not found on the previous cache Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]>

* Implementing Expanded Postings Cache Signed-off-by: alanprot <[email protected]> * small nit Signed-off-by: alanprot <[email protected]> * refactoring the cache so we dont need to call expire on every request Signed-off-by: alanprot <[email protected]> * Update total cache size when updating the item Signed-off-by: alanprot <[email protected]> * Fix fuzzy test after change the flag name Signed-off-by: alanprot <[email protected]> * remove max item config + create a new test case with only head cache enabled Signed-off-by: alanprot <[email protected]> * Documenting enabled as first field on the config Signed-off-by: alanprot <[email protected]> * Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. (cortexproject#6312) * Creating a test to show the race on the multilevel cache Signed-off-by: alanprot <[email protected]> * fix the race problem * Only fetch keys that were not found on the previous cache Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]> * Improve Doc Signed-off-by: alanprot <[email protected]> * create new cortex_ingester_expanded_postings_non_cacheable_queries metric Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]>

pull-request-size bot added the size/M label Nov 5, 2024

Creating a test to show the race on the multilevel cache

3532f52

Signed-off-by: alanprot <[email protected]>

alanprot force-pushed the chunks-multilevel-cache-race branch from 6c779f0 to 3532f52 Compare November 5, 2024 04:07

alanprot added 2 commits November 4, 2024 20:15

fix the race problem

7cb636b

Only fetch keys that were not found on the previous cache

1dfbf45

Signed-off-by: alanprot <[email protected]>

alanprot changed the title ~~Fix race on chunks multilevel cache~~ Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. Nov 5, 2024

alanprot marked this pull request as ready for review November 5, 2024 04:27

dosubot bot added the storage/chunks Chunks storage engine label Nov 5, 2024

yeya24 approved these changes Nov 5, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2024

yeya24 reviewed Nov 5, 2024

View reviewed changes

alanprot merged commit 661f47b into cortexproject:master Nov 5, 2024
16 checks passed

alanprot deleted the chunks-multilevel-cache-race branch November 5, 2024 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

Uh oh!

alanprot commented Nov 5, 2024 •

edited

Loading

Uh oh!

alanprot commented Nov 5, 2024

Uh oh!

yeya24 left a comment

Uh oh!

yeya24 left a comment

Uh oh!

alanprot commented Nov 5, 2024

Uh oh!

Uh oh!

Uh oh!

Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys. #6312

Uh oh!

Conversation

alanprot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanprot commented Nov 5, 2024

Uh oh!

yeya24 left a comment

Choose a reason for hiding this comment

Uh oh!

yeya24 left a comment

Choose a reason for hiding this comment

Uh oh!

alanprot commented Nov 5, 2024

Uh oh!

Uh oh!

Uh oh!

alanprot commented Nov 5, 2024 •

edited

Loading