Skip to content

OOM status not reported for Task/Batch APIs at all times #1806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RobertLucian opened this issue Jan 20, 2021 · 0 comments · Fixed by #1807
Closed

OOM status not reported for Task/Batch APIs at all times #1806

RobertLucian opened this issue Jan 20, 2021 · 0 comments · Fixed by #1807
Assignees
Labels
bug Something isn't working
Milestone

Comments

@RobertLucian
Copy link
Member

Description

There are multiple cases when the OOM error is not reported in cortex get:

  1. Exit code 137/236/237/350/363/370 as shown in logs, but exit code 0 in container status with reason as OOMKilled and Job marked as Successful.
  2. Exit code 137/236/237/350 as shown in logs, but exit code 0 in container status with reason as OOMKilled and Job marked as Failed.
  3. Pod evicted by k8s engine, with Job marked as Successful, but with pod reason “memory was too low, had to be evicted”-like message.

Reproducibility

Set a very low mem request in the cortex.yaml config and then create a big numpy array in the job.
Submit the job and notice the job status not being set to OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant