Skip to content

Cache Expended Posting on ingesters #6296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 5, 2024

Conversation

alanprot
Copy link
Member

@alanprot alanprot commented Oct 30, 2024

What this PR does:
ThisPR introduce caching expanded postings, similar to the approach used in SG, as seen in this change.

  • Key Updates:
    1. Compacted Block Caching:
      • Expanded postings are cached to all queries for compacted blocks. This is safe as compacted blocks are immutable.
    2. Head Block Caching:
      • Caching is applied only to select queries containing an equal matcher for the metric name (name).
        For such cases, a seed is added to the cache key, allowing invalidation whenever a new series is added or removed for a given metric name.
      • Example:
        • Suppose the seed for cpu_utilization is 21. Queries would generate cache keys as follows:
          • sum(cpu_utilization{service="A"}) => Key: 21|head|__name__=cpu_utilization|service=A
          • sum(cpu_utilization{service="A", op="GET"}) => Key: 21|head|__name__=cpu_utilization|service=A|op=GET
    3. Head Cache Invalidation:
      On data ingestion, if a new series is added with cpu_utilization as the metric name, the seed changes, invalidating all cached entries for that metric.
    4. Memory Bound for Seeds:
      To control memory usage, a fixed-size seed slice is used. Metric names are hashed to determine their respective seed.

In load testing, this change demonstrated significant CPU and latency improvements on ingesters experiencing heavy query loads from rulers. In such cases, the same queries are evaluated constantly (every evaluation interval), and the postings are expanded for every iteration, leading to unnecessarily high CPU usage.

This are some graphs:
Screenshot 2024-11-01 at 2 40 32 PM

Flame Graphs Before and After (we can see that the after the CPU is being used mostly for the Push method, where before was mostly being used by the QueryStream method:

Screenshot 2024-11-01 at 2 44 28 PM

Screenshot 2024-11-01 at 2 42 57 PM

The idea was inspired on thanos-io/thanos#6420 and the promise based cache implementation by grafana/mimir-prometheus#14

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alanprot alanprot force-pushed the cache-postings branch 14 times, most recently from d33a2a8 to 9cdddd1 Compare November 2, 2024 00:01
@alanprot alanprot changed the title [Wip] Cache Expended Posting on ingesters Cache Expended Posting on ingesters Nov 2, 2024
@alanprot alanprot marked this pull request as ready for review November 2, 2024 00:19
@alanprot alanprot force-pushed the cache-postings branch 2 times, most recently from a5e3830 to 6937829 Compare November 2, 2024 01:05
@justinjung04
Copy link
Contributor

I love the test results! Thanks for doing this!!

@yeya24 yeya24 mentioned this pull request Nov 2, 2024
3 tasks
@alanprot alanprot force-pushed the cache-postings branch 2 times, most recently from d443d80 to 42e2235 Compare November 4, 2024 22:48
alanprot and others added 3 commits November 5, 2024 10:33
…ready found keys. (cortexproject#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
Signed-off-by: alanprot <[email protected]>
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2024
@alanprot alanprot merged commit 3267515 into cortexproject:master Nov 5, 2024
16 checks passed
CharlieTLe pushed a commit to CharlieTLe/cortex that referenced this pull request Dec 3, 2024
* Implementing Expanded Postings Cache

Signed-off-by: alanprot <[email protected]>

* small nit

Signed-off-by: alanprot <[email protected]>

* refactoring the cache so we dont need to call expire on every request

Signed-off-by: alanprot <[email protected]>

* Update total cache size when updating the item

Signed-off-by: alanprot <[email protected]>

* Fix fuzzy test after change the flag name

Signed-off-by: alanprot <[email protected]>

* remove max item config + create a new test case with only head cache enabled

Signed-off-by: alanprot <[email protected]>

* Documenting enabled as first field on the config

Signed-off-by: alanprot <[email protected]>

* Fix race on chunks multilevel cache + Optimize to avoid refetching already found keys.  (cortexproject#6312)

* Creating a test to show the race on the multilevel cache

Signed-off-by: alanprot <[email protected]>

* fix the race problem

* Only fetch keys that were not found on the previous cache

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>

* Improve Doc

Signed-off-by: alanprot <[email protected]>

* create new cortex_ingester_expanded_postings_non_cacheable_queries metric

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ingester lgtm This PR has been approved by a maintainer size/XXL type/performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants