-
Notifications
You must be signed in to change notification settings - Fork 210
Description
Summary
The operator's controller-manager pod has multiple restarts with oomkilled termination reason.
Details
The controller manager pod for version 0.26.1 has multiple restarts with oomkilled reason on OCP 4.12. Currently, we have increased the pod's memory limit from 50MB to 100MB. This ceases multiple restarts of the pod.
After memory increase, we observed a slow increase of memory usage over the course of 3 days from- initially from 70 MB to currently at 108 MB. Could there be an internal memory leak possibility?
The pod logs do not show any errors:
I0512 17:44:29.821411 1 reconciler.go:233] "intel-device-plugins-manager: " controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" SgxDevicePlugin="sgxdeviceplugin-sample" namespace="" name="sgxdeviceplugin-sample" reconcileID=f73e9186-de1a-4b4c-a7c1-50e73a749a63 ="(MISSING)" I0512 17:44:29.821583 1 reconciler.go:233] "intel-device-plugins-manager: " controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" GpuDevicePlugin="gpudeviceplugin-sample" namespace="" name="gpudeviceplugin-sample" reconcileID=0c1fac7a-7709-4dcd-b123-54f8a453ff4f ="(MISSING)" I0512 17:44:29.821603 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=84c9d221-e5f4-4348-b7da-a2acae338b90 ="(MISSING)" I0512 17:44:29.828747 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=c89e9d61-74b2-40f6-aa6c-3bda3437ce9e ="(MISSING)"
Possible solutions
- Is the memory increase change efficient ? The root cause could be an internal memory leak in the application.
- A temporary solution could be to increase the number of replicas for the pod to avoid overlapping of the pod restarts. Might not address the memory leak though.