Skip to content

[ws-manager] Add missing check to fix OOM error #8372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 23, 2022
Merged

[ws-manager] Add missing check to fix OOM error #8372

merged 2 commits into from
Feb 23, 2022

Conversation

princerachit
Copy link
Contributor

@princerachit princerachit commented Feb 22, 2022

Description

We were bailing out on errors when we can retry or safely ignore the error:
1. Pod deletion error - When a pod is already deleted, any attempt to delete it would result in an error. If the error is not found error then it is safe to ignore this error and retry our attempt to create a pod.
2. Finalizer removal error - If a pod gets deleted before we attempt to explicitly remove the finalizer OR the pod object is changed before we attempt to remove the finalizer, then update would fail. In these cases we should check (a) the pod still exists and (b) retry updating it.

Related Issue(s)

🤞🏾 Hopefully Fixes #8238

How to test

We are not able to reproduce this issue in a non prod env so the only way to validate if this fixes the problem is to deploy in prod and monitor.

Release Notes

None

Documentation

@princerachit princerachit requested a review from a team February 22, 2022 06:06
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Feb 22, 2022
@codecov
Copy link

codecov bot commented Feb 22, 2022

Codecov Report

Merging #8372 (76e1a8e) into main (aa52b3e) will increase coverage by 20.98%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #8372       +/-   ##
===========================================
+ Coverage   12.31%   33.30%   +20.98%     
===========================================
  Files          20       31       +11     
  Lines        1161     4573     +3412     
===========================================
+ Hits          143     1523     +1380     
- Misses       1014     2934     +1920     
- Partials        4      116      +112     
Flag Coverage Δ
components-gitpod-cli-app 11.17% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-ws-manager-app 39.44% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-manager/pkg/manager/manager.go 20.46% <0.00%> (ø)
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
components/ws-manager/pkg/manager/monitor.go 9.46% <0.00%> (ø)
components/ws-manager/pkg/manager/manager_ee.go 0.00% <0.00%> (ø)
components/ws-manager/pkg/manager/metrics.go 11.11% <0.00%> (ø)
components/ws-manager/pkg/manager/status.go 74.58% <0.00%> (ø)
components/ws-manager/pkg/manager/annotations.go 65.11% <0.00%> (ø)
...s/ws-manager/pkg/manager/internal/grpcpool/pool.go 71.83% <0.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa52b3e...76e1a8e. Read the comment docs.

@princerachit
Copy link
Contributor Author

Marked this as draft because I saw other errors not related to pod deletion e.g. finalizer removal

@sagor999
Copy link
Contributor

When the pod does not exist and we try to delete the pod, we still get an error.

hm. that was kinda intentional. since if we fail to delete, then we not going to be able to create new one.
did you see this error in logs?

also I would assume it would fail with delete error, and not out of memory error in this case.

@princerachit
Copy link
Contributor Author

princerachit commented Feb 22, 2022

When the pod does not exist and we try to delete the pod, we still get an error.

hm. that was kinda intentional. since if we fail to delete, then we not going to be able to create new one. did you see this error in logs?

If the pod does not exist then we should not be erroring out. Unsure why that would be intentional. You will always fail to delete what is already deleted, right?

also I would assume it would fail with delete error, and not out of memory error in this case.

In the logs I can see that it says deletion failed and returns the pod ran to completion error. However, in the DB I still see the OOM error. I am not very sure where is that being written from.

@princerachit princerachit marked this pull request as ready for review February 22, 2022 07:30
@sagor999
Copy link
Contributor

If the pod does not exist then we should not be erroring out. Unsure why that would be intentional. You will always fail to delete what is already deleted, right?

my concern here is why it was deleted already. or which service deleted it

@princerachit
Copy link
Contributor Author

my concern here is why it was deleted already. or which service deleted it

That is a valid question, and I don't have answer to that.

@roboquat roboquat merged commit c126e60 into main Feb 23, 2022
@roboquat roboquat deleted the prs/fix-oom branch February 23, 2022 16:01
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/M team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Oh, no! Something went wrong! OutOfmemory: Pod Node didn't have enough resource: memory, requested: 2576980378, used: 65006469130, capacity: 66388008960
4 participants