Skip to content

Conversation

gagan16k
Copy link
Contributor

@gagan16k gagan16k commented Aug 7, 2025

What this PR does / why we need it:
This PR adds changes such that a gauge metric scraped on prometheus gets cleared and re-populated on every poll, so that it reflects the latest state of the reference slice *machineSet.Status.FailedMachines.
The logic to set this variable has also been updated so that it contains the current set of machines in the Failed state.

Which issue(s) this PR fixes:
Fixes #476

Special notes for your reviewer:
IT for mcm-provider-aws passed

Testing was done by setting the machineCreationTimeout to a value lesser than the time it takes for a machine to be provisioned and move into the Running state (1m). The machine moves to the Failed state, when it can be checked against a prometheus graph to verify if it was recorded accurately

Some changes were required to ensure a Failed machine stays failed, like an artificial delay in the goroutine marking a machine for deletion

image Stacked graph reflecting number of failed machines accurately matched number in the cluster

Release note:

Fixed metric `mcm_machine_set_failed_machines` and underlying variable `*machineSet.Status.FailedMachines` so that they reflect the current state of machines

@gagan16k gagan16k requested a review from a team as a code owner August 7, 2025 10:16
@gardener-robot gardener-robot added the needs/review Needs review label Aug 7, 2025
@CLAassistant
Copy link

CLAassistant commented Aug 7, 2025

CLA assistant check
All committers have signed the CLA.

@gardener-robot gardener-robot added the size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) label Aug 7, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 7, 2025
Copy link
Contributor

@takoverflow takoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Please update the release notes (you can check other PRs for reference)

Copy link
Contributor

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Just a small nit

Co-authored-by: Aaron Francis Fernandes <[email protected]>
@aaronfern aaronfern merged commit 817afda into gardener:master Aug 14, 2025
14 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 14, 2025
@gagan16k gagan16k deleted the failedmachines-metric-ch branch August 15, 2025 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/review Needs review reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MCM does not reset the failed_machines gauge once the machine is deleted
6 participants