-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix container_oom_events_total always returns 0. #3278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @chengjoey. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug |
dcbab71
to
70b1b02
Compare
What happens when PID 1 forks another process and the forked process get OOM-killed? |
The forked process that was OOM-killed can still read relevant log information from /dev/kmsg. It should still be possible to associate with the corresponding container. |
@chengjoey what happens if a container is killed every second? |
@szuecs In what way would it be a memory leak ? |
@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand. |
70b1b02
to
02b6c33
Compare
hi @szuecs @ishworgurung , I have made modifications in this PR, putting the oom event metric information in a separate map, and adding the flag @iwankgb could you please task a review when you have time |
/ok-to-test |
@chengjoey please resolve merge conflicts: |
In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the `container_oom_events_total` metric will always be 0. Refactor the collector of oom events, and retain the deleted container oom information for a period of events Signed-off-by: joey <[email protected]>
02b6c33
to
0b6dfeb
Compare
/test pull-cadvisor-e2e |
Thanks @dims , pr has been rebased |
/test pull-cadvisor-e2e |
2 similar comments
/test pull-cadvisor-e2e |
/test pull-cadvisor-e2e |
@chengjoey: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi, any news on this bugfix? |
Hi! |
Hi @pschichtel and others. |
@dims @iwankgb @kragniz may I kindly ask for an update on this issue? I was discussing the monitoring aspects of OOM events in Kubernetes with e.g. @dgrisonnet in kubernetes/kubernetes#69676 (comment) and was pointed at kubernetes/kubernetes#108004 which supposedly enabled Kubelet to expose the OOM metrics from Cadvisor. But in the end those need to be available, which, looking at this very issue / PR here, seems to be buggy still? |
now, the main reason is probably the container OOM. Kubelet will kill the container and create a new one, so the historical OOM events are lost. I think the current PR idea is correct. Maybe there are some things I missing. |
|
@dims may I kindly ask again if there is any way forward with this PR? Google's Autopilot paper (https://research.google/pubs/autopilot-workload-autoscaling-at-google-scale/) talks at length about how OOM events are used to rightsize application resource requests - so having cadvisor and Kubernetes do their best to provide these metrics and events should be obvious. |
if err := m.addOrUpdateOomInfo(cont, oomInstance.TimeOfDeath); err != nil { | ||
klog.Errorf("failed to add OOM info for %q: %v", oomInstance.ContainerName, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dependent on the existence of the container:
Lines 1307 to 1311 in 0b6dfeb
conts, err := m.getRequestedContainers(oomInstance.ContainerName, request) | |
if err != nil { | |
klog.V(2).Infof("failed getting container info for %q: %v", oomInstance.ContainerName, err) | |
continue | |
} |
If we want to address the current dependency issue of the metric on the container lifecycle, I think we shouldn't rely on the containers seen by the manager at all and rather just produce metrics based on the info that cadvisor parses via the oom parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be possible to track the necessary information before containers are killed to have them available independent of the container lifecycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible but that would require to cache all the container data.
For a separate reason, there was an attempt to create a cache in #2974, but I don't know if any of the maintainers would be willing to pull on the trigger on such a big change. At least it wasn't conceivable in the other PR due to the memory implications.
Also Kubernetes is and has been to trying to move away from cadvisor in favor of container-runtime metrics which doesn't go in the favor of such an investment in cadvisor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible but that would require to cache all the container data.
true, but not all container data, just the bare-minimum to be able to produce useful metrics
@@ -1832,24 +1822,21 @@ func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric) | |||
} | |||
} | |||
|
|||
if c.includedMetrics.Has(container.OOMMetrics) { | |||
for _, oomInfo := range c.infoProvider.GetOOMInfos() { | |||
labels, values := genLabelValues(rawLabels, oomInfo.MetricLabels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rawLabels also depends on the presence of the container. Instead of using the default labels that we can't always rely on with OOM kills, it might be better to define new labels specific to the OOM metrics
fix #3015
In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the
container_oom_events_total
metric will always be 0. this pr refactor the collector of oom events, and retain the deleted container oom information for a period of events. And add flagoom_event_retain_time
to decide how long the oom event will be keep, default is 5 minutes