-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[alerts] add alert when autoscaler adds nodes rapidly #10016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is ready yet, not sure I totally understand it too.
summary: "Autoscaler is adding new nodes rapidly", | ||
description: 'Autoscaler in cluster {{ $labels.cluster }} is rapidly adding new nodes.', | ||
}, | ||
expr: '((sum(cluster_autoscaler_nodes_count) by (cluster)) - (sum(cluster_autoscaler_nodes_count offset 10m) by (cluster))) > 10', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have fired a lot recently, are all valid? Seems like a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.
We do seem to scale them up quite aggressively, not sure if this is normal or not:
grafana link
I can tweak alert to be less aggressive, so to catch only really large spikes like the one May 11th. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be a good skateboard. I'm just trying to avoid sending unnecessary pages knowing that we might have to tweak this again in the future.
15 may be a good number., reduces to 2 incidents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
labels: { | ||
severity: 'critical', | ||
}, | ||
'for': '1m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is duration needed in this condition?
@ArthurSens wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let's see if we can remove? You can check to see if it passes validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and done!
/lgtm |
Description
Related Issue(s)
Fixes #9946
How to test
Test alert expression in grafana
Release Notes
Documentation