Skip to content

Commit 2081854

Browse files
committed
address comments
1 parent 6e37330 commit 2081854

File tree

6 files changed

+80
-7
lines changed

6 files changed

+80
-7
lines changed

CHANGELOG.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
* [FEATURE] Query Frontend: Add dynamic interval size for query splitting. This is enabled by configuring experimental flags `querier.max-shards-per-query` and/or `querier.max-fetched-data-duration-per-query`. The split interval size is dynamically increased to maintain a number of shards and total duration fetched below the configured values. #6458
66
* [FEATURE] Querier/Ruler: Add `query_partial_data` and `rules_partial_data` limits to allow queries/rules to be evaluated with data from a single zone, if other zones are not available. #6526
77
* [FEATURE] Update prometheus alertmanager version to v0.28.0 and add new integration msteamsv2, jira, and rocketchat. #6590
8-
* [FEATURE] Ingester: Add a `-ingester.enable-ooo-native-histograms` flag to enable out-of-order native histogram ingestion per tenant. It only takes effect when `-blocks-storage.tsdb.enable-native-histograms=true` and `-ingester.out-of-order-time-window` > 0. It is applied after the restart if it is changed at runtime through the runtime config. #6626
98
* [FEATURE] Ingester/StoreGateway: Add `resource-thresholds` in ingesters and store gateways to throttle query requests when the pods are under resource pressure. #6674
109
* [FEATURE] Ingester: Support out-of-order native histogram ingestion. It automatically enabled when `-ingester.out-of-order-time-window > 0` and `-blocks-storage.tsdb.enable-native-histograms=true`. #6626 #6663
1110
* [ENHANCEMENT] Alertmanager: Add nflog and silences maintenance metrics. #6659

docs/configuration/config-file-reference.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,10 +273,14 @@ query_scheduler:
273273

274274
resource_thresholds:
275275
# Utilization threshold for CPU in percentage, between 0 and 1. 0 to disable.
276+
# The CPU utilization metric is from github.com/prometheus/procfs, which is a
277+
# close estimate. Applicable to ingesters and store-gateways only.
276278
# CLI flag: -resource-thresholds.cpu
277279
[cpu: <float> | default = 0]
278280

279281
# Utilization threshold for heap in percentage, between 0 and 1. 0 to disable.
282+
# The heap utilization metric is from runtime/metrics, which is a close
283+
# estimate. Applicable to ingesters and store-gateways only.
280284
# CLI flag: -resource-thresholds.heap
281285
[heap: <float> | default = 0]
282286

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
title: "Protecting Cortex from Heavy Queries"
3+
linkTitle: "Protecting Cortex from Heavy Queries"
4+
weight: 11
5+
slug: protecting-cortex-from-heavy-queries
6+
---
7+
8+
PromQL is powerful, and is able to result in query requests that have very wide range of data fetched and samples processed. Heavy queries can cause:
9+
10+
1. CPU on any query component to be partially exhausted, increasing latency and causing incoming queries to queue up with high chance of time-out.
11+
2. CPU on any query component to be fully exhausted, causing GC to slow down leading to the pod being out-of-memory and killed.
12+
3. Heap memory on any query component to be exhausted, leading to the pod being out-of-memory and killed.
13+
14+
It's important to protect Cortex components by setting appropriate limits and throttling configurations based on your infrastructure and data ingested by the customers.
15+
16+
## Static limits
17+
18+
There are number of static limits that you could configure to block heavy queries from running.
19+
20+
### Max outstanding requests per tenant
21+
22+
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_outstanding_requests_per_tenant for details.
23+
24+
### Max data bytes fetched per (sharded) query
25+
26+
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_data_bytes_per_query for details.
27+
28+
### Max series fetched per (sharded) query
29+
30+
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_series_per_query for details.
31+
32+
### Max chunks fetched per (sharded) query
33+
34+
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_chunk_bytes_per_query for details.
35+
36+
### Max samples fetched per (sharded) query
37+
38+
See https://cortexmetrics.io/docs/configuration/configuration-file/#querier_config:~:text=max_samples for details.
39+
40+
## Resource-based throttling
41+
42+
Although the static limits are able to protect Cortex components from specific query patterns, they are not generic enough to cover different combinations of bad query patterns. For example, what if the query fetches relatively large postings, series and chunks that are slightly below the individual limits? For a more generic solution, you can enable resource-based throttling by setting CPU and heap utilization thresholds.
43+
44+
Currently, it only throttles incoming query requests with error code 429 (too many requests) when the resource usage breaches the configured thresholds.
45+
46+
For example, the following configuration will start throttling query requests if either CPU or heap utilization is above 80%, leaving 20% of room for inflight requests.
47+
48+
```
49+
target: ingester
50+
resource_thresholds:
51+
cpu: 0.8
52+
heap: 0.8
53+
```
54+
55+
See https://cortexmetrics.io/docs/configuration/configuration-file/#generic-placeholders:~:text=resource_thresholds for details.

pkg/configs/resources.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ type Resources struct {
1313
}
1414

1515
func (cfg *Resources) RegisterFlags(f *flag.FlagSet) {
16-
f.Float64Var(&cfg.CPU, "resource-thresholds.cpu", 0, "Utilization threshold for CPU in percentage, between 0 and 1. 0 to disable.")
17-
f.Float64Var(&cfg.Heap, "resource-thresholds.heap", 0, "Utilization threshold for heap in percentage, between 0 and 1. 0 to disable.")
16+
f.Float64Var(&cfg.CPU, "resource-thresholds.cpu", 0, "Utilization threshold for CPU in percentage, between 0 and 1. 0 to disable. The CPU utilization metric is from github.com/prometheus/procfs, which is a close estimate. Applicable to ingesters and store-gateways only.")
17+
f.Float64Var(&cfg.Heap, "resource-thresholds.heap", 0, "Utilization threshold for heap in percentage, between 0 and 1. 0 to disable. The heap utilization metric is from runtime/metrics, which is a close estimate. Applicable to ingesters and store-gateways only.")
1818
}
1919

2020
func (cfg *Resources) Validate() error {

pkg/cortex/modules.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,10 @@ func (t *Cortex) initResourceMonitor() (services.Service, error) {
777777

778778
scanner, err := resource.NewScanner()
779779
if err != nil {
780+
if errors.As(err, resource.UnsupportedOSError{}) {
781+
level.Warn(util_log.Logger).Log("msg", "Skipping resource monitor", "err", err.Error())
782+
return nil, nil
783+
}
780784
return nil, err
781785
}
782786

pkg/util/resource/monitor.go

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"context"
55
"fmt"
66
"net/http"
7+
"runtime"
78
"runtime/metrics"
89
"sync"
910
"time"
@@ -24,6 +25,12 @@ func (e *ExhaustedError) Error() string {
2425
return "resource exhausted"
2526
}
2627

28+
type UnsupportedOSError struct{}
29+
30+
func (e *UnsupportedOSError) Error() string {
31+
return "resource scanner is only supported in linux"
32+
}
33+
2734
const heapMetricName = "/memory/classes/Heap/objects:bytes"
2835
const monitorInterval = time.Second
2936
const dataPointsToAvg = 30
@@ -43,6 +50,10 @@ type Stats struct {
4350
}
4451

4552
func NewScanner() (*Scanner, error) {
53+
if runtime.GOOS != "linux" {
54+
return nil, &UnsupportedOSError{}
55+
}
56+
4657
proc, err := procfs.Self()
4758
if err != nil {
4859
return nil, errors.Wrap(err, "error reading proc directory")
@@ -89,8 +100,8 @@ type Monitor struct {
89100

90101
// Variables to calculate average CPU utilization
91102
index int
92-
cpuRates []float64
93-
cpuIntervals []float64
103+
cpuRates [dataPointsToAvg]float64
104+
cpuIntervals [dataPointsToAvg]float64
94105
totalCPU float64
95106
totalInterval float64
96107
lastCPU float64
@@ -105,8 +116,8 @@ func NewMonitor(thresholds configs.Resources, limits configs.Resources, scanner
105116
containerLimit: limits,
106117
scanner: scanner,
107118

108-
cpuRates: make([]float64, dataPointsToAvg),
109-
cpuIntervals: make([]float64, dataPointsToAvg),
119+
cpuRates: [dataPointsToAvg]float64{},
120+
cpuIntervals: [dataPointsToAvg]float64{},
110121

111122
lock: sync.RWMutex{},
112123
}

0 commit comments

Comments
 (0)