Skip to content

Cordon build nodes if their disk is more than 80% full #10116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2022

Conversation

mads-hartmann
Copy link
Contributor

@mads-hartmann mads-hartmann commented May 19, 2022

Description

This changes the platform-trigger-werft-cleanup job so that instead of triggering a separate Werft job that is scheduled on each build node which performs a cleanup on the individual nodes, it will now SSH out to the instance and simply cordon it if the disk is more than 80% full.

This approach has two benefits

  1. By cordoning the nodes rather than trying to perform cleanup we simplify the implementation and makes it more robust against new things filling up the disk in the future.
  2. The previous cleanup job used docker system prune which had the downside of potentially breaking currently running builds. Because of this we couldn't run the job too often. Now we can run the job as often as we'd like. I have changed it to every 4 hours

The relevant changes to the service account has been made here https://github.com/gitpod-io/ops/pull/2375

Related Issue(s)

Fixes https://github.com/gitpod-io/ops/issues/2050
Fixes https://github.com/gitpod-io/ops/issues/1227

How to test

See the comments in the code on how to run this. Here are two examples

  • Here's a job which cordoned a node (I ran it with the threshold set to 30) (link). I manually uncordoned the node again after.
  • Here's a job where I had set the threshold to 80 so it skipped both nodes (link).

Release Notes

NONE

Documentation

N/A

| tail -n1 \
| tr -d '[:space:]'
)
echo "The disk is %${disk_used_pct} full" | werft log slice "$slice_id"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put the "%" after the number to make the output more readable :)

Copy link
Member

@meysholdt meysholdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 for simplicity and robustness.

Nit:
The new code only checks if the localSSD is full and ignores the root volume.
Not sure if the root volume sometimes runs full, too.
Either way, I think this is out of scope for this PR and just something to keep in mind.

@roboquat roboquat merged commit 24f2e29 into main May 19, 2022
@roboquat roboquat deleted the mads/cordon-build-nodes-when-full branch May 19, 2022 08:37
@mads-hartmann
Copy link
Contributor Author

mads-hartmann commented May 19, 2022

@meysholdt Thanks for the review! I haven't seen the root disk run full yet but lets keep an eye out for it ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants