Cordon build nodes if their disk is more than 80% full #10116

mads-hartmann · 2022-05-19T08:01:35Z

Description

This changes the platform-trigger-werft-cleanup job so that instead of triggering a separate Werft job that is scheduled on each build node which performs a cleanup on the individual nodes, it will now SSH out to the instance and simply cordon it if the disk is more than 80% full.

This approach has two benefits

By cordoning the nodes rather than trying to perform cleanup we simplify the implementation and makes it more robust against new things filling up the disk in the future.
The previous cleanup job used docker system prune which had the downside of potentially breaking currently running builds. Because of this we couldn't run the job too often. Now we can run the job as often as we'd like. I have changed it to every 4 hours

The relevant changes to the service account has been made here https://github.com/gitpod-io/ops/pull/2375

Related Issue(s)

Fixes https://github.com/gitpod-io/ops/issues/2050
Fixes https://github.com/gitpod-io/ops/issues/1227

How to test

See the comments in the code on how to run this. Here are two examples

Here's a job which cordoned a node (I ran it with the threshold set to 30) (link). I manually uncordoned the node again after.
Here's a job where I had set the threshold to 80 so it skipped both nodes (link).

Release Notes

NONE

Documentation

N/A

meysholdt · 2022-05-19T08:34:07Z

.werft/platform-trigger-werft-cleanup.sh

+        | tail -n1 \
+        | tr -d '[:space:]'
+    )
+    echo "The disk is %${disk_used_pct} full" | werft log slice "$slice_id"


nit: put the "%" after the number to make the output more readable :)

meysholdt

🎉 for simplicity and robustness.

Nit:
The new code only checks if the localSSD is full and ignores the root volume.
Not sure if the root volume sometimes runs full, too.
Either way, I think this is out of scope for this PR and just something to keep in mind.

mads-hartmann · 2022-05-19T08:39:08Z

@meysholdt Thanks for the review! I haven't seen the root disk run full yet but lets keep an eye out for it ☺️

mads-hartmann added 2 commits May 19, 2022 07:47

Cordon build nodes if their disk is more than 80% full

c07617b

Run job every 4 hours

07db6f9

mads-hartmann requested a review from a team May 19, 2022 08:01

roboquat added release-note-none size/L labels May 19, 2022

github-actions bot added team: devx and removed size/L labels May 19, 2022

roboquat added the size/L label May 19, 2022

meysholdt reviewed May 19, 2022

View reviewed changes

meysholdt approved these changes May 19, 2022

View reviewed changes

roboquat merged commit 24f2e29 into main May 19, 2022

roboquat deleted the mads/cordon-build-nodes-when-full branch May 19, 2022 08:37

mads-hartmann mentioned this pull request May 19, 2022

Move % to after the number #10119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cordon build nodes if their disk is more than 80% full #10116

Cordon build nodes if their disk is more than 80% full #10116

Uh oh!

mads-hartmann commented May 19, 2022 •

edited

Loading

Uh oh!

meysholdt May 19, 2022

Uh oh!

meysholdt left a comment

Uh oh!

mads-hartmann commented May 19, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Uh oh!

Uh oh!

Cordon build nodes if their disk is more than 80% full #10116

Cordon build nodes if their disk is more than 80% full #10116

Uh oh!

Conversation

mads-hartmann commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

How to test

Release Notes

Documentation

Uh oh!

meysholdt May 19, 2022

Choose a reason for hiding this comment

Uh oh!

meysholdt left a comment

Choose a reason for hiding this comment

Uh oh!

mads-hartmann commented May 19, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mads-hartmann commented May 19, 2022 •

edited

Loading

mads-hartmann commented May 19, 2022 •

edited by werft-gitpod-dev-com bot

Loading