fix container_oom_events_total always returns 0. #3278

chengjoey · 2023-03-22T14:32:03Z

fix #3015
In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the container_oom_events_total metric will always be 0. this pr refactor the collector of oom events, and retain the deleted container oom information for a period of events. And add flag oom_event_retain_time to decide how long the oom event will be keep, default is 5 minutes

k8s-ci-robot · 2023-03-22T14:32:13Z

Hi @chengjoey. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chengjoey · 2023-03-22T14:42:06Z

/assign @iwankgb @kragniz

Is it feasible to keep the container metrics with oomkilled without deleting them?
please take a look

chengjoey · 2023-03-22T14:42:56Z

/kind bug

iwankgb · 2023-03-24T13:07:10Z

What happens when PID 1 forks another process and the forked process get OOM-killed?

chengjoey · 2023-03-27T02:51:04Z

What happens when PID 1 forks another process and the forked process get OOM-killed?

The forked process that was OOM-killed can still read relevant log information from /dev/kmsg. It should still be possible to associate with the corresponding container.

szuecs · 2023-05-10T08:25:18Z

@chengjoey what happens if a container is killed every second?
Do I understand correctly that we keep creating new container metrics or would this counter increase to >1?
In case we would create new container metrics, then I would call this feature a memory leak, because every container start will create a new set of metrics and now we would store them forever as far as I understand. (I am not very familiar with the code)

ishworgurung · 2023-08-04T04:54:20Z

memory leak [ ... ]

@szuecs In what way would it be a memory leak ?

szuecs · 2023-08-11T21:32:50Z

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand.
Increasing the counter is great, though. Having old metrics forever is likely an issue.
Maybe it's also not part of this PR, feel free to ignore this.

chengjoey · 2023-09-04T08:36:10Z

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand. Increasing the counter is great, though. Having old metrics forever is likely an issue. Maybe it's also not part of this PR, feel free to ignore this.

hi @szuecs @ishworgurung , I have made modifications in this PR, putting the oom event metric information in a separate map, and adding the flag oom_event_retain_time to configure the retention time. Oom metric that exceeds this time will still be deleted to prevent memory leaks.

@iwankgb could you please task a review when you have time

dims · 2023-10-16T19:16:23Z

/ok-to-test

dims · 2023-10-16T21:08:28Z

@chengjoey please resolve merge conflicts:

In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the `container_oom_events_total` metric will always be 0. Refactor the collector of oom events, and retain the deleted container oom information for a period of events Signed-off-by: joey <[email protected]>

chengjoey · 2023-10-17T05:40:00Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-17T06:16:43Z

@chengjoey please resolve merge conflicts:

Thanks @dims , pr has been rebased

chengjoey · 2023-10-17T12:27:59Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-18T01:15:04Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-18T07:30:32Z

/test pull-cadvisor-e2e

k8s-ci-robot · 2023-10-18T07:37:49Z

@chengjoey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cadvisor-e2e	`0b6dfeb`	link	true	`/test pull-cadvisor-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

nlamirault · 2023-12-04T08:30:29Z

Hi, any news on this bugfix?
We're waiting for an alert on OOMKilled event (kubernetes-monitoring/kubernetes-mixin#822)
Thanks.

taraspos · 2024-04-25T14:53:05Z

Hi!
just wanted to bump this issue again. Would be great to get it fixed.

tsipo · 2024-04-29T20:17:19Z

Hi @pschichtel and others.
I was testing this issue on Kubernetes using a small image I have built - see here. I have noticed that when I use that tool in a forked mode - this and this use-cases, I go get container_oom_events_total == 1 as the container continues to live enough time after the OOMKill for cAdvisor to be scraped. If I change the AFTER_FORK_INTERVAL env var to 0, and the container exits immediately, I gon't get container_oom_events_total == 1 as the container is de-registered immediately from cAdvisor.
FYI.

frittentheke · 2024-08-16T07:30:05Z

@dims @iwankgb @kragniz may I kindly ask for an update on this issue?
Monitoring OOM events seems to be in a quite confusing slightly messy state ...

I was discussing the monitoring aspects of OOM events in Kubernetes with e.g. @dgrisonnet in kubernetes/kubernetes#69676 (comment) and was pointed at kubernetes/kubernetes#108004 which supposedly enabled Kubelet to expose the OOM metrics from Cadvisor. But in the end those need to be available, which, looking at this very issue / PR here, seems to be buggy still?

chengjoey · 2024-08-16T07:51:20Z

now, the main reason is probably the container OOM. Kubelet will kill the container and create a new one, so the historical OOM events are lost. I think the current PR idea is correct. Maybe there are some things I missing.

sellers · 2024-11-26T21:15:16Z

now, the main reason is probably the container OOM. Kubelet will kill the container and create a new one, so the historical OOM events are lost. I think the current PR idea is correct. Maybe there are some things I missing.
Do I understand correctly that this whole issue is a race condition; in that prometheus can not scrape the container to get the metric for the OOM event because the said container the provides that metric has OOMed? That would make the merge request offered here have more context to those following this thread. It would also give context to @tsipo 's above test cases.

frittentheke · 2025-04-23T07:39:06Z

@dims may I kindly ask again if there is any way forward with this PR? Google's Autopilot paper (https://research.google/pubs/autopilot-workload-autoscaling-at-google-scale/) talks at length about how OOM events are used to rightsize application resource requests - so having cadvisor and Kubernetes do their best to provide these metrics and events should be obvious.
Please kindly also see my references in #3278 (comment).

dgrisonnet · 2025-04-24T05:38:56Z

manager/manager.go

+				if err := m.addOrUpdateOomInfo(cont, oomInstance.TimeOfDeath); err != nil {
+					klog.Errorf("failed to add OOM info for %q: %v", oomInstance.ContainerName, err)
+				}


This is dependent on the existence of the container:

cadvisor/manager/manager.go

Lines 1307 to 1311 in 0b6dfeb

conts, err := m.getRequestedContainers(oomInstance.ContainerName, request)

if err != nil {

klog.V(2).Infof("failed getting container info for %q: %v", oomInstance.ContainerName, err)

continue

}

If we want to address the current dependency issue of the metric on the container lifecycle, I think we shouldn't rely on the containers seen by the manager at all and rather just produce metrics based on the info that cadvisor parses via the oom parser.

shouldn't it be possible to track the necessary information before containers are killed to have them available independent of the container lifecycle?

It is possible but that would require to cache all the container data.
For a separate reason, there was an attempt to create a cache in #2974, but I don't know if any of the maintainers would be willing to pull on the trigger on such a big change. At least it wasn't conceivable in the other PR due to the memory implications.
Also Kubernetes is and has been to trying to move away from cadvisor in favor of container-runtime metrics which doesn't go in the favor of such an investment in cadvisor.

It is possible but that would require to cache all the container data.

true, but not all container data, just the bare-minimum to be able to produce useful metrics

dgrisonnet · 2025-04-24T05:44:44Z

metrics/prometheus.go

@@ -1832,24 +1822,21 @@ func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric)
 		}
 	}

+	if c.includedMetrics.Has(container.OOMMetrics) {
+		for _, oomInfo := range c.infoProvider.GetOOMInfos() {
+			labels, values := genLabelValues(rawLabels, oomInfo.MetricLabels)


rawLabels also depends on the presence of the container. Instead of using the default labels that we can't always rely on with OOM kills, it might be better to define new labels specific to the OOM metrics

k8s-ci-robot added the needs-ok-to-test label Mar 22, 2023

chengjoey force-pushed the fix/container-oom-total branch from dcbab71 to 70b1b02 Compare March 22, 2023 15:05

chengjoey force-pushed the fix/container-oom-total branch from 70b1b02 to 02b6c33 Compare September 4, 2023 08:25

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 16, 2023

chengjoey force-pushed the fix/container-oom-total branch from 02b6c33 to 0b6dfeb Compare October 17, 2023 03:38

pschichtel mentioned this pull request Apr 29, 2024

container_oom_events_total always returns 0 #3015

Open

frittentheke mentioned this pull request Apr 23, 2025

kubelet: expose OOM metrics kubernetes/kubernetes#108004

Merged

dgrisonnet reviewed Apr 24, 2025

View reviewed changes

	conts, err := m.getRequestedContainers(oomInstance.ContainerName, request)
	if err != nil {
	klog.V(2).Infof("failed getting container info for %q: %v", oomInstance.ContainerName, err)
	continue
	}

fix container_oom_events_total always returns 0. #3278

Are you sure you want to change the base?

fix container_oom_events_total always returns 0. #3278

Uh oh!

Conversation

chengjoey commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 22, 2023

Uh oh!

chengjoey commented Mar 22, 2023

Uh oh!

chengjoey commented Mar 22, 2023

Uh oh!

iwankgb commented Mar 24, 2023

Uh oh!

chengjoey commented Mar 27, 2023

Uh oh!

szuecs commented May 10, 2023

Uh oh!

ishworgurung commented Aug 4, 2023

Uh oh!

szuecs commented Aug 11, 2023

Uh oh!

chengjoey commented Sep 4, 2023

Uh oh!

dims commented Oct 16, 2023

Uh oh!

dims commented Oct 16, 2023

Uh oh!

chengjoey commented Oct 17, 2023

Uh oh!

chengjoey commented Oct 17, 2023

Uh oh!

chengjoey commented Oct 17, 2023

Uh oh!

chengjoey commented Oct 18, 2023

Uh oh!

chengjoey commented Oct 18, 2023

Uh oh!

k8s-ci-robot commented Oct 18, 2023

Uh oh!

nlamirault commented Dec 4, 2023

Uh oh!

taraspos commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsipo commented Apr 29, 2024

Uh oh!

frittentheke commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chengjoey commented Aug 16, 2024

Uh oh!

sellers commented Nov 26, 2024

Uh oh!

frittentheke commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgrisonnet Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pschichtel Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

dgrisonnet Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pschichtel Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

dgrisonnet Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chengjoey commented Mar 22, 2023 •

edited

Loading

taraspos commented Apr 25, 2024 •

edited

Loading

frittentheke commented Aug 16, 2024 •

edited

Loading

frittentheke commented Apr 23, 2025 •

edited

Loading

dgrisonnet Apr 24, 2025 •

edited

Loading

dgrisonnet Apr 24, 2025 •

edited

Loading