Increase LivenessCheck timeout

rajagopalanand · rajagopalanand · commit 3a47852f59c8 · 2024-09-24T02:23:33.000Z
Signed-off-by: Anand Rajagopal &lt;anrajag@amazon.com&gt;
diff --git a/docs/configuration/config-file-reference.md b/docs/configuration/config-file-reference.md
@@ -4460,6 +4460,10 @@ ring:
 # Enable high availability
 # CLI flag: -ruler.enable-ha-evaluation
 [enable_ha_evaluation: <boolean> | default = false]
+
+# Health check timeout for evaluation HA
+# CLI flag: -ruler.eval-ha-healthcheck-timeout
+[eval_ha_health_check_timeout: <duration> | default = 1s]
 ```
 
 ### `ruler_storage_config`
diff --git a/docs/guides/ruler-high-availability.md b/docs/guides/ruler-high-availability.md
@@ -0,0 +1,45 @@
+---
+title: "Ruler High Availability"
+linkTitle: "Ruler high availability"
+weight: 10
+slug: ruler-high-availability
+---
+
+This guide explains the concepts behind ruler high availability and when to use this feature
+
+## Background
+
+When rulers are deployed using shuffle sharding, each rule group is evaluated by a single ruler only. All the rulers in
+the hash ring will pick the same ruler instance for a given tenant, rule group, and namespace. To learn more about shuffle
+sharding, please refer to [dedicated guide](./shuffle-sharding.md)
+
+There are several scenarios when rule groups might not be evaluated. Few of them are described below
+
+- **Bad underlying node**<br />
+  If the underlying is unhealthy and is unable to send heartbeat, it might take several minutes for other rulers to mark the ruler as unhealthy in the ring. During this time, no ruler will evaluate the rule groups
+  that are owned by the ruler running on the unhealthy node
+- **OOM Kills**<br />
+  If a ruler gets OOM (Out Of Memory) killed, then the ruler has no chance to mark itself as `LEAVING` and therefore all the other rulers will not attempt to take ownership of rule groups that were being evaluated
+  by the ruler that is experiencing OOM kills
+- **Availability zone outage**<br />
+  If one AZ becomes unavailable, then all the rulers in that AZ might experience network partition and the hash ring might still reflect these rulers as healthy. As mentioned in other scenarios, the rulers in other AZs will
+  not attempt to take ownership of rule groups being evaluated by pods in the bad AZ
+
+In addition to rule evaluation, ruler APIs will also return 5xx errors in the scenarios mentioned above
+
+## Replication factor
+
+Hash ring will return number of instances equal to replication factor for a given tenant, rule group, and namespace. For example, if RF=2, then hash ring will return 2 instances. If RF=3, then hash ring will return 3
+instances. If AZ awareness is enabled, hash ring will pick rulers from different AZs. The rulers are picked for each tenant, rule group, and namespace combination. 
+
+## Enabling high availability for evaluation
+
+Setting the flag `-ruler.enable-ha-evaluation` to true and setting `ruler.ring.replication-factor` > 1 will enable non-primary rulers (replicas 2..n) to check if 1..n-1 is healthy. For example, if replication factor is set
+to 2, then the non-primary ruler will check will primary is healthy. If primary is not healthy then the secondary ruler will evaluate the rule group. If primary ruler for that rule group is healthy, then the non-primary ruler
+will either drop the ownership or will not take ownership. This check is performed by each ruler when syncing rule groups from storage. This will reduce the chances of missing rule group evaluations and the maximum duration
+of missed evaluations will be limited to the sync interval of the rule groups
+
+## Enabling high availability for API
+
+Setting the replication factor > 1, will instruct non-primary rulers to store back up of rule groups. It is important to note that the backup does not contain any state. This allows API calls to be fault-tolerant. Depending
+upon 
diff --git a/pkg/ruler/ruler.go b/pkg/ruler/ruler.go
@@ -81,8 +81,6 @@ const (
 	unknownHealthFilter string = "unknown"
 	okHealthFilter      string = "ok"
 	errHealthFilter     string = "err"
-
-	livenessCheckTimeout = 100 * time.Millisecond
 )
 
 type DisabledRuleGroupErr struct {
@@ -161,7 +159,8 @@ type Config struct {
 	EnableQueryStats      bool `yaml:"query_stats_enabled"`
 	DisableRuleGroupLabel bool `yaml:"disable_rule_group_label"`
 
-	EnableHAEvaluation bool `yaml:"enable_ha_evaluation"`
+	EnableHAEvaluation       bool          `yaml:"enable_ha_evaluation"`
+	EvalHAHealthCheckTimeout time.Duration `yaml:"eval_ha_health_check_timeout"`
 }
 
 // Validate config and returns error on failure
@@ -238,6 +237,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
 	f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")
 
 	f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability")
+	f.DurationVar(&cfg.EvalHAHealthCheckTimeout, "ruler.eval-ha-healthcheck-timeout", 1*time.Second, "Health check timeout for evaluation HA")
 
 	cfg.RingCheckPeriod = 5 * time.Second
 }
@@ -590,7 +590,7 @@ func (r *Ruler) nonPrimaryInstanceOwnsRuleGroup(g *rulespb.RuleGroupDesc, replic
 	responseChan := make(chan *LivenessCheckResponse, len(jobs))
 
 	ctx := user.InjectOrgID(context.Background(), userID)
-	ctx, cancel := context.WithTimeout(ctx, livenessCheckTimeout)
+	ctx, cancel := context.WithTimeout(ctx, r.cfg.EvalHAHealthCheckTimeout)
 	defer cancel()
 
 	err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error {