Skip to content

Commit dbc67ff

Browse files
Ruler HA - Proposal
Signed-off-by: Anand Rajagopal <[email protected]>
1 parent a011efc commit dbc67ff

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

docs/proposals/ruler-ha-new.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: "Ruler HA"
3+
linkTitle: "Ruler HA"
4+
weight: 1
5+
slug: ruler-ha
6+
---
7+
8+
- Author: [Anand Rajagopal](https://github.com/rajagopalanand)
9+
- Date: April 2024
10+
- Status: Proposed
11+
---
12+
13+
## Problem
14+
15+
Rulers in Cortex currently run with a replication factor of 1, wherein each RuleGroup is assigned to exactly 1 ruler. This lack of redundancy creates the following risks:
16+
17+
- Rule group evaluation
18+
- Missed evaluations due to a ruler outage, possibly caused by a deployment, noisy neighbour, hardware failure, etc.
19+
- Missed evaluations due to a ruler brownout due to other tenant rule groups sharing the same ruler (noisy neighbour)
20+
- API
21+
- inconsistent API results during resharding (e.g. due to a deployment) when rulers are in a transition state loading rule groups
22+
23+
This proposal attempts to mitigate the above risks by enabling a ruler replication factor of greater than 1, allowing multiple rulers to evaluate the same rule group — effectively.
24+
25+
## Proposal
26+
27+
### Make ReplicationFactor configurable
28+
29+
ReplicationFactor in Ruler is currently hardcoded to 1. Making this a configurable parameter is the first step to enabling HA in ruler, and would also be the mechanism for the user to turn the feature on. The parameter value will be 1 by default, equating to the feature being turned off by default.
30+
31+
A replication factor greater than 1 will result in multiple rulers loading the same rule groups but only one ruler evaluating the rule group. The replicas are in "passive" state until it is necessary for them to become active
32+
33+
This redundancy will allow for missed rule evaluations from single ruler outages to be covered by other instances evaluating the same rule groups.
34+
35+
To avoid inconsistent rule group state, which is maintained by Prometheus, the author proposes making a change in Prometheus rule group evaluation logic as described below
36+
37+
### Prometheus change
38+
39+
The author proposes making a change to Prometheus to allow for pausing and resuming (or activating and deactivating) a rule group as described [here](https://github.com/prometheus/prometheus/issues/13630)
40+
41+
If the proposal is not accepted by Prometheus community, the proposal is to maintain a fork of Prometheus for Cortex with modified rule group evaluation behavior. This [draft PR](https://github.com/prometheus/prometheus/pull/13858)
42+
shows the changes required in Prometheus to support pausing and resuming a rule group evaluation
43+
44+
### API HA
45+
46+
An interim solution is addressed in this [#5773](https://github.com/cortexproject/cortex/issues/5773) PR. This will be modified such that the replicas will return both active and passive rule groups and the API handler will continue to de-duplicate the results.
47+
The difference is that after Ruler HA, the replicas could potentially return proper rule group state if those replicas evaluated the rule group
48+
49+
PRs:
50+
51+
* Prometheus PR [#13858](https://github.com/prometheus/prometheus/pull/13858) [draft]
52+
* For API HA [#5773](https://github.com/cortexproject/cortex/issues/5773)

0 commit comments

Comments
 (0)