Skip to content

[alerts] add alert when autoscaler adds nodes rapidly #10016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2022
Merged

Conversation

sagor999
Copy link
Contributor

Description

Related Issue(s)

Fixes #9946

How to test

Test alert expression in grafana

Release Notes

none

Documentation

@sagor999 sagor999 marked this pull request as ready for review May 13, 2022 19:42
@sagor999 sagor999 requested a review from a team May 13, 2022 19:42
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label May 13, 2022
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is ready yet, not sure I totally understand it too.

summary: "Autoscaler is adding new nodes rapidly",
description: 'Autoscaler in cluster {{ $labels.cluster }} is rapidly adding new nodes.',
},
expr: '((sum(cluster_autoscaler_nodes_count) by (cluster)) - (sum(cluster_autoscaler_nodes_count offset 10m) by (cluster))) > 10',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have fired a lot recently, are all valid? Seems like a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.
We do seem to scale them up quite aggressively, not sure if this is normal or not:
grafana link

I can tweak alert to be less aggressive, so to catch only really large spikes like the one May 11th. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be a good skateboard. I'm just trying to avoid sending unnecessary pages knowing that we might have to tweak this again in the future.

15 may be a good number., reduces to 2 incidents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

labels: {
severity: 'critical',
},
'for': '1m',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is duration needed in this condition?

@ArthurSens wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's see if we can remove? You can check to see if it passes validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and done!

@aledbf
Copy link
Member

aledbf commented May 19, 2022

/lgtm

@aledbf aledbf self-requested a review May 19, 2022 07:24
@roboquat roboquat merged commit ad8d971 into main May 19, 2022
@roboquat roboquat deleted the pavel/9946 branch May 19, 2022 07:25
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/S team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fire alert when nodes are added too quickly
4 participants