-
Notifications
You must be signed in to change notification settings - Fork 817
Minimize missed rule group evaluations #6129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
278226a
to
6c2871b
Compare
927b0f8
to
f4a71a8
Compare
|
||
# Enable high availability | ||
# CLI flag: -ruler.enable-ha | ||
[enable_ha: <boolean> | default = false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_ha
feels like a bad name to me. Can we just replace this with checking replication factor?
If we can't, can we be specific about what HA we want to enable? For example, enable_ha_evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API HA checks only for replication factor. My thought was to use a separate flag in case if someone wants to use one or the other and not both. I can change this to enable_ha_evaluation
|
||
# Timeout for liveness checks performed during rule sync | ||
# CLI flag: -ruler.liveness-check-timeout | ||
[liveness_check_timeout: <duration> | default = 1s] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1s a good default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple/fast check. I thought 1s
might be big enough for default. If RF=2, this should check only one Ruler. If RF=3, it will check 2 Rulers. Do you think it needs to be higher/lower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this config at all? We cannot just put a sensible timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the config and set the default to 100ms since the gRPC calls are pretty light weight
pkg/ruler/client_pool_test.go
Outdated
@@ -5,6 +5,8 @@ import ( | |||
"net" | |||
"testing" | |||
|
|||
"github.com/cortexproject/cortex/pkg/util/services" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please group import
pkg/ruler/ruler.go
Outdated
@@ -273,6 +281,13 @@ type MultiTenantManager interface { | |||
// | +-----------------+ | | |||
// | | | |||
// +---------------------------------------------------------------+ | |||
|
|||
type ruler interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this nencessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #6129 (comment)
pkg/ruler/ruler.go
Outdated
@@ -488,11 +517,11 @@ func tokenForGroup(g *rulespb.RuleGroupDesc) uint32 { | |||
return ringHasher.Sum32() | |||
} | |||
|
|||
func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) { | |||
func instanceOwnsRuleGroup(r ruler, rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use *Ruler
as the method receiver here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasons for introducing an interface is to narrow down what is available for functions to consume. Existing functions such as filterRuleGroups
is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it. This is the reason I introduced a interface so that it could narrow down the scope of what is available to these functions. Of course, someone in the future could just add a method on the interface that exposes the ring. Having said all this, I'm open to suggestions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing functions such as filterRuleGroups is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it
I am not sure I get the concern here. What's the worst case it could go wrong if accessing the Ruler's ring directly? instanceOwnsRuleGroup
seems does Read operation to the Ring only so it shouldn't be a big issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed filterRuleGroups
and other functions to methods on the Ruler and removed the interface
pkg/ruler/ruler.go
Outdated
if r.Config().EnableHA { | ||
for i, ruler := range rlrs.Instances { | ||
if ruler.Addr == instanceAddr && i == 0 { | ||
level.Debug(r.Logger()).Log("msg", "primary taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "token", hash, "ruler", instanceAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it really helps to log token
here. Does it help debug HA issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helps issues with troubleshooting evaluation issues. Users can filter logs by token and know what happened with a particular rule group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can filter logs by token and know what happened with a particular rule group
Are we going to document this clearly? Otherwise I don't think it is something easy to debug as a user. Tbh I don't how I map a particular rule group to a hash by just looking at the log.
Isn't group
and namespace
already enough to locate a particular rule group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
pkg/ruler/ruler.go
Outdated
return &LivenessCheckResponse{State: int32(r.State())}, nil | ||
} | ||
|
||
func nonPrimaryInstanceOwnsRuleGroup(r ruler, g *rulespb.RuleGroupDesc, replicas []string, selfAddress string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have some comments on this function for better readability
pkg/ruler/ruler.go
Outdated
@@ -1164,6 +1285,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest | |||
} | |||
} | |||
// Concurrently fetch rules from all rulers. | |||
ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a behavior change for all Cortex users. Can we just use the API request's own timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the client sets the timeout, then this timeout/deadline will also be the same and it will fail the overall API request for the client. The desired behavior is not to fail the whole API request but instead be fault tolerant if a subset of Rulers do not respond in a certain amount of time. Ideally, r.cfg.ListRulesFanoutTimeout
should be less than the API request's timeout. If we do not want to change existing behavior, we can either set the default timeout to be a much bigger number (max int) or wrap this around the enable_ha_evaluation
flag or perhaps there is a better option. Open to recommendations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, is this line change related to minimizing missed evaluations or is it unrelated? If unrelated and you are solving another problem I would rather have this in a different pr, just because this can introduce regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Ruler HA, we can return as soon as we have complete answer but a new integration test I wrote was failing. The test fails because the IP of ruler that gets killed is not removed from the docker network fast enough during the test and this causes the gRPC call to that ruler to hang and the test times out. This scenario can happen in non-test environments as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)
When Ruler API HA is enabled and the rules are backed up, the backed up rules do not contain any state. In other words, it does not contain any alerts. To get a complete answer means, we have to get Rules from 100% of the Rulers.
As an example, 3 AZs, RF = 3, ruler count per AZ = 3 (total = 9). Assume 1 ruler in AZ 2 is unhealthy. What we want is results from 8 rulers. Similarly if we have 2 bad rulers, then we need results from 7. In this scenario, we we can tolerate 2 AZs being down (RF=3) but if we bail as soon as results from 1 AZ are retrieved, then the results will contain a lot of backup rules which does not contain alerts
I moved the timeout inside the concurrency.ForEach
so that each Rules()
fan-out call will get its own timeout rather than the timeout being set for the whole getShardedRules
Let me know if this is acceptable or if there are other ideas to handle this scenario
pkg/ruler/ruler.go
Outdated
@@ -518,6 +563,68 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu | |||
return ownsRuleGroup, nil | |||
} | |||
|
|||
func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) { | |||
level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems RulerAddress
is only used here for debug logging purpose. Can we remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole log is a debug log line and the log is not useful if it does not have the ruler address. If someone wants to troubleshoot, having the ruler address that performed the liveness check would be useful I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm we cannot log it in the client side? wouldn't this have the same effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) { | ||
level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress) | ||
if r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped() { | ||
return nil, errors.New("ruler's context is canceled and might be stopping soon") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure about this tbh. When r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped()
condition is met, is the gRPC server still working correctly to handle liveness check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending upon how Ruler is configured, r.lifecycler.ServiceContext()
can be canceled but gRPC server can be active for a long time. When r.lifecycler.ServiceContext()
is canceled, Prometheus will stop evaluating rule groups as soon as possible but if the Ruler instance has not shutdown yet, it will respond to secondary as though it is active and this causes missed evaluations. This check is necessary.
r.subservices.IsStopped()
is debatable. I only saw this in my testing in some cases during hard restarts (kill -9) when the gRPC server was already running but ring was not fully operational
d7b5962
to
06d61ca
Compare
pkg/ruler/ruler.go
Outdated
err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error { | ||
addr := job.(string) | ||
rulerClient, err := r.GetClientFor(addr) | ||
if err != nil { | ||
errorChan <- err | ||
level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr, "token", rgToken) | ||
return nil | ||
} | ||
level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "token", rgToken, "instance addr", selfAddress) | ||
|
||
resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{ | ||
RulerAddress: selfAddress, | ||
}) | ||
if err != nil { | ||
errorChan <- err | ||
level.Debug(r.Logger()).Log("msg", "liveness check failed", "addr", addr, "for", g.Name, "err", err.Error(), "token", rgToken) | ||
return nil | ||
} | ||
level.Debug(r.Logger()).Log("msg", "liveness check succeeded ", "addr", addr, "for", g.Name, "token", rgToken, "ruler state", services.State(resp.GetState())) | ||
responseChan <- resp | ||
return nil | ||
}) | ||
|
||
close(errorChan) | ||
close(responseChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one ruler is taking time to reply the liveness, wouldn't it make this whole call to hang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline... i just think the timeout could be hardcoded and not a config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the timeout to 100ms and removed the config
3f8f286
to
9305f6b
Compare
pkg/ruler/ruler.go
Outdated
cfg.RingCheckPeriod = 5 * time.Second | ||
cfg.LivenessCheckTimeout = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just be a const? no need to be in the config struct i guess.
pkg/ruler/ruler.go
Outdated
@@ -220,7 +225,12 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) { | |||
f.BoolVar(&cfg.EnableQueryStats, "ruler.query-stats-enabled", false, "Report the wall time for ruler queries to complete as a per user metric and as an info level log message.") | |||
f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics") | |||
|
|||
f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability") | |||
f.DurationVar(&cfg.ListRulesFanoutTimeout, "ruler.list-rules-fanout-timeout", 2*time.Minute, "Timeout for fanout calls to other rulers") | |||
f.DurationVar(&cfg.LivenessCheckTimeout, "ruler.liveness-check-timeout", 1*time.Second, "Timeout for liveness checks performed during rule sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this should be removed as well?
9305f6b
to
f6e0671
Compare
integration/ruler_test.go
Outdated
expectedNames[i] = ruleName | ||
|
||
if num%2 == 0 { | ||
alertCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this variable used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
integration/ruler_test.go
Outdated
c, err := e2ecortex.NewClient("", "", "", ruler1.HTTPEndpoint(), "user-1") | ||
require.NoError(t, err) | ||
namespaceNames := []string{"test1", "test2", "test3", "test4", "test5"} | ||
namespaceNameCount := make([]int, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nit: in case you want to add more users in the tests.
namespaceNameCount := make([]int, 5) | |
namespaceNameCount := make([]int, len(namespaceNames)) |
|
||
results, err := c.GetPrometheusRules(e2ecortex.RuleFilter{}) | ||
require.NoError(t, err) | ||
require.Equal(t, numRulesGroups, len(results)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused on how this test is validating HA. Should we validate that the total number of validations across all rules is what is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test to assert that the rules are evaluated by the remaining rulers
pkg/ruler/ruler.go
Outdated
level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "instance addr", selfAddress) | ||
|
||
resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{ | ||
RulerAddress: selfAddress, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why we are passing the address of the current ruler as parameter here? I checked and this is not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not being used now but I was using it for troubleshooting/logging on the server side. It could still be useful for troubleshooting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is supposed to be used for troubleshooting we can add a debug log statement in the handler no? I think leaving in this state makes me think of YAGNI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thing we can remove for now and add when needed.
pkg/ruler/ruler.go
Outdated
} else { | ||
// Even if the replication factor is set to a number bigger than 1, only the first ruler evaluates the rule group | ||
ownsRuleGroup = rlrs.Instances[0].Addr == instanceAddr | ||
} | ||
} | ||
|
||
if ownsRuleGroup && ruleGroupDisabled(g, disabledRuleGroups) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could create a function to generate the return in case of ownsRuleGroup = true.
this way we can retur directly from the code with early returns instead of using this nesting which is confusing and hard to reason about.
func (r *Ruler) instanceOwnsRuleGroup(rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {
hash := tokenForGroup(g)
rlrs, err := rr.Get(hash, RingOp, nil, nil, nil)
if err != nil {
return false, errors.Wrap(err, "error reading ring to verify rule group ownership")
}
if forBackup {
// Only the second up to the last replica are used as backup
for i := 1; i < len(rlrs.Instances); i++ {
if rlrs.Instances[i].Addr == instanceAddr {
return ownsRuleGroupOrDisable(g, disabledRuleGroups)
}
}
}
if rlrs.Instances[0].Addr == instanceAddr {
// regardless of ruler HA is enabled or not, in this case this ruler is the primary
return ownsRuleGroupOrDisable(g, disabledRuleGroups)
}
if r.Config().EnableHAEvaluation {
for i, ruler := range rlrs.Instances {
if i == 0 {
// we already checked for primary
continue
}
if ruler.Addr == instanceAddr && r.nonPrimaryInstanceOwnsRuleGroup(g, rlrs.GetAddresses()[:i], instanceAddr) {
level.Info(r.Logger()).Log("msg", "non-primary ruler taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "ruler", instanceAddr)
return ownsRuleGroupOrDisable(g, disabledRuleGroups)
}
}
}
// means for sure that the instance is not the owner
return false, nil
}
func ownsRuleGroupOrDisable(g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups)
if ruleGroupDisabled(g, disabledRuleGroups) {
return false, &DisabledRuleGroupErr{Message: fmt.Sprintf("rule group %s, namespace %s, user %s is disabled", g.Name, g.Namespace, g.User)}
}
return true, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
db7034d
to
20b6783
Compare
20b6783
to
4c7f4c5
Compare
pkg/ruler/ruler.go
Outdated
rulerClient, err := r.GetClientFor(addr) | ||
if err != nil { | ||
errorChan <- err | ||
level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level here should be higher than debug?
if ctx.Err() != nil { | ||
level.Info(r.logger).Log("msg", "context is canceled. not syncing rules") | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we needed to introduce this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ruler is shutting down/pod terminating, syncs happen. This is because the manager waits for rule groups to stop finishing evaluation. I think it's better to not sync rule groups when ruler is shutting down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just weird that u check it here as this context can be cancelled any place in the code after this line as well and what u trying to prevent can happen anyway?
pkg/ruler/ruler.go
Outdated
// but only ring passed as parameter. | ||
func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc { | ||
// This method must not use r.ring, but only ring passed as parameter | ||
func (r *Ruler) filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, owned []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems lots of parameters can be removed from this function now that is a method of the Ruler struct?
pkg/ruler/ruler.go
Outdated
@@ -1189,6 +1300,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest | |||
return errors.Wrapf(err, "unable to get client for ruler %s", addr) | |||
} | |||
|
|||
ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is not related to this PR?
Can we remove this from here and create another one where we can discuss possible solutions for this??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related as we discussed offline. Changed it to remote_timeout
under ruler_client section
Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM |
09624eb
to
39598b0
Compare
Pushed up a new commit. Please take a look |
@@ -4142,6 +4142,10 @@ ruler_client: | |||
# CLI flag: -ruler.client.tls-insecure-skip-verify | |||
[tls_insecure_skip_verify: <boolean> | default = false] | |||
|
|||
# Timeout for downstream rulers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a doc talking about Ruler HA. How to enable it with various configs/flags.
We added 3 new flags in this PR but I am confused if I need to tune/change them somehow or they should work out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I will submit another PR for the doc
Note that #5862 needs to be merged before this PR. Without enough approvals on the proposal itself, I don't think we can merge the actual implementation. |
Signed-off-by: Anand Rajagopal <[email protected]>
Signed-off-by: Anand Rajagopal <[email protected]>
Signed-off-by: Anand Rajagopal <[email protected]>
…r address from LivenessCheckRequest gRPC call Signed-off-by: Anand Rajagopal <[email protected]>
…ing generic remote_timeout Signed-off-by: Anand Rajagopal <[email protected]>
39598b0
to
9c3c4ad
Compare
Signed-off-by: Anand Rajagopal <[email protected]>
9c3c4ad
to
0419ddb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, I think we forgot to mark this as experimental. We can do it in next PR
What this PR does:
Currently once a Ruler instance loads a rule group, it evaluates it continuously. If the instance evaluating the rule group becomes unavailable, there is a high chance for missed evaluations before the instance becomes available again or another instance loads and evaluates the rule group. Ruler instances can become unavailable for variety of reasons including bad underlying nodes and OOM kills. This issue can be exacerbated if the Ruler instance appears as healthy within the cluster ring but is actually in an unhealthy state.
This PR addresses the problem by introducing a check to ensure that the primary Ruler is alive and in a running state. Here’s how it works:
With this change, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]