Skip to content

manager pod has multiple restarts with oomkilled reason #1416

@vbedida79

Description

@vbedida79

Summary

The operator's controller-manager pod has multiple restarts with oomkilled termination reason.

Details

The controller manager pod for version 0.26.1 has multiple restarts with oomkilled reason on OCP 4.12. Currently, we have increased the pod's memory limit from 50MB to 100MB. This ceases multiple restarts of the pod.
After memory increase, we observed a slow increase of memory usage over the course of 3 days from- initially from 70 MB to currently at 108 MB. Could there be an internal memory leak possibility?
The pod logs do not show any errors:

I0512 17:44:29.821411 1 reconciler.go:233] "intel-device-plugins-manager: " controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" SgxDevicePlugin="sgxdeviceplugin-sample" namespace="" name="sgxdeviceplugin-sample" reconcileID=f73e9186-de1a-4b4c-a7c1-50e73a749a63 ="(MISSING)" I0512 17:44:29.821583 1 reconciler.go:233] "intel-device-plugins-manager: " controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" GpuDevicePlugin="gpudeviceplugin-sample" namespace="" name="gpudeviceplugin-sample" reconcileID=0c1fac7a-7709-4dcd-b123-54f8a453ff4f ="(MISSING)" I0512 17:44:29.821603 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=84c9d221-e5f4-4348-b7da-a2acae338b90 ="(MISSING)" I0512 17:44:29.828747 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=c89e9d61-74b2-40f6-aa6c-3bda3437ce9e ="(MISSING)"

Possible solutions

  1. Is the memory increase change efficient ? The root cause could be an internal memory leak in the application.
  2. A temporary solution could be to increase the number of replicas for the pod to avoid overlapping of the pod restarts. Might not address the memory leak though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions