Description
Bug description
We have observed in production, in preview-env and integration tests an error that results in the status of the workspace being "The container could not be located when the pod was terminated". We check the related GCP log and there is no data loss happened.
Questions
- Is "The container could not be located when the pod was deleted. The container used to be Running" also happening?
Yes! Check this GCP log.
- Is this only happening on stop? Milan's scenario seems to indicate otherwise.
Plan
- Verify whether as it is now there is data loss when this error occurs
- Check Milan's case to understand if it happened during the running workspace phase.
Old description
We try to get the status of the pod, when is not running anymore, at a time we are not sure. logs
Impact to the user:
(1) the workspace is generally left in a failed state. Users can try to restart, as failed is a terminal phase.
(2) user data may be lost.
This error message(The container could not be located when the pod was terminated
) comes from kubelet.
https://github.com/kubernetes/kubernetes/blob/4aa451e8458a7cbf78ed464e9e47e87d424541ce/pkg/kubelet/kubelet_pods.go#L1810-L1817
Potentially related with this Kubernetes bug: kubernetes/kubernetes#104107
Steps to reproduce
I don't know
Workspace affected
No response
Expected behavior
There isn't this error message in production.
Example repository
No response
Anything else?
This has been happening in gen59
, gen60
and gen61
, too. Logs.
Definition of done
Let's spend some time researching if this is a Kubernetes bug, or in fact could be caused by other circumstances too. Please timebox at 2 hours, after which please share results with the team in Slack, so we can socialize next steps.
Why research? Because the workspaces impacted by this bug end with a Failed status. cc @geropl I'm not sure if a workspace ending in a failed status will negatively impact UBP...assume not, but, wanted to check.