Skip to content

Commit 3a47852

Browse files
Increase LivenessCheck timeout
Signed-off-by: Anand Rajagopal <[email protected]>
1 parent d829d65 commit 3a47852

File tree

3 files changed

+53
-4
lines changed

3 files changed

+53
-4
lines changed

docs/configuration/config-file-reference.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4460,6 +4460,10 @@ ring:
44604460
# Enable high availability
44614461
# CLI flag: -ruler.enable-ha-evaluation
44624462
[enable_ha_evaluation: <boolean> | default = false]
4463+
4464+
# Health check timeout for evaluation HA
4465+
# CLI flag: -ruler.eval-ha-healthcheck-timeout
4466+
[eval_ha_health_check_timeout: <duration> | default = 1s]
44634467
```
44644468

44654469
### `ruler_storage_config`
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: "Ruler High Availability"
3+
linkTitle: "Ruler high availability"
4+
weight: 10
5+
slug: ruler-high-availability
6+
---
7+
8+
This guide explains the concepts behind ruler high availability and when to use this feature
9+
10+
## Background
11+
12+
When rulers are deployed using shuffle sharding, each rule group is evaluated by a single ruler only. All the rulers in
13+
the hash ring will pick the same ruler instance for a given tenant, rule group, and namespace. To learn more about shuffle
14+
sharding, please refer to [dedicated guide](./shuffle-sharding.md)
15+
16+
There are several scenarios when rule groups might not be evaluated. Few of them are described below
17+
18+
- **Bad underlying node**<br />
19+
If the underlying is unhealthy and is unable to send heartbeat, it might take several minutes for other rulers to mark the ruler as unhealthy in the ring. During this time, no ruler will evaluate the rule groups
20+
that are owned by the ruler running on the unhealthy node
21+
- **OOM Kills**<br />
22+
If a ruler gets OOM (Out Of Memory) killed, then the ruler has no chance to mark itself as `LEAVING` and therefore all the other rulers will not attempt to take ownership of rule groups that were being evaluated
23+
by the ruler that is experiencing OOM kills
24+
- **Availability zone outage**<br />
25+
If one AZ becomes unavailable, then all the rulers in that AZ might experience network partition and the hash ring might still reflect these rulers as healthy. As mentioned in other scenarios, the rulers in other AZs will
26+
not attempt to take ownership of rule groups being evaluated by pods in the bad AZ
27+
28+
In addition to rule evaluation, ruler APIs will also return 5xx errors in the scenarios mentioned above
29+
30+
## Replication factor
31+
32+
Hash ring will return number of instances equal to replication factor for a given tenant, rule group, and namespace. For example, if RF=2, then hash ring will return 2 instances. If RF=3, then hash ring will return 3
33+
instances. If AZ awareness is enabled, hash ring will pick rulers from different AZs. The rulers are picked for each tenant, rule group, and namespace combination.
34+
35+
## Enabling high availability for evaluation
36+
37+
Setting the flag `-ruler.enable-ha-evaluation` to true and setting `ruler.ring.replication-factor` > 1 will enable non-primary rulers (replicas 2..n) to check if 1..n-1 is healthy. For example, if replication factor is set
38+
to 2, then the non-primary ruler will check will primary is healthy. If primary is not healthy then the secondary ruler will evaluate the rule group. If primary ruler for that rule group is healthy, then the non-primary ruler
39+
will either drop the ownership or will not take ownership. This check is performed by each ruler when syncing rule groups from storage. This will reduce the chances of missing rule group evaluations and the maximum duration
40+
of missed evaluations will be limited to the sync interval of the rule groups
41+
42+
## Enabling high availability for API
43+
44+
Setting the replication factor > 1, will instruct non-primary rulers to store back up of rule groups. It is important to note that the backup does not contain any state. This allows API calls to be fault-tolerant. Depending
45+
upon

pkg/ruler/ruler.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,6 @@ const (
8181
unknownHealthFilter string = "unknown"
8282
okHealthFilter string = "ok"
8383
errHealthFilter string = "err"
84-
85-
livenessCheckTimeout = 100 * time.Millisecond
8684
)
8785

8886
type DisabledRuleGroupErr struct {
@@ -161,7 +159,8 @@ type Config struct {
161159
EnableQueryStats bool `yaml:"query_stats_enabled"`
162160
DisableRuleGroupLabel bool `yaml:"disable_rule_group_label"`
163161

164-
EnableHAEvaluation bool `yaml:"enable_ha_evaluation"`
162+
EnableHAEvaluation bool `yaml:"enable_ha_evaluation"`
163+
EvalHAHealthCheckTimeout time.Duration `yaml:"eval_ha_health_check_timeout"`
165164
}
166165

167166
// Validate config and returns error on failure
@@ -238,6 +237,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
238237
f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")
239238

240239
f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability")
240+
f.DurationVar(&cfg.EvalHAHealthCheckTimeout, "ruler.eval-ha-healthcheck-timeout", 1*time.Second, "Health check timeout for evaluation HA")
241241

242242
cfg.RingCheckPeriod = 5 * time.Second
243243
}
@@ -590,7 +590,7 @@ func (r *Ruler) nonPrimaryInstanceOwnsRuleGroup(g *rulespb.RuleGroupDesc, replic
590590
responseChan := make(chan *LivenessCheckResponse, len(jobs))
591591

592592
ctx := user.InjectOrgID(context.Background(), userID)
593-
ctx, cancel := context.WithTimeout(ctx, livenessCheckTimeout)
593+
ctx, cancel := context.WithTimeout(ctx, r.cfg.EvalHAHealthCheckTimeout)
594594
defer cancel()
595595

596596
err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error {

0 commit comments

Comments
 (0)