Description
The comment of CleanUpOldVMs
(and CleanUpOldPodsLoop
) includes:
This is the safety mechanism to delete VMs which stray from the
normal deleting process. VMs are created to run a single build and
should be shut down by a controlling process. Due to various types
of failures, they might get stranded. To prevent them from getting
stranded and wasting resources forever, we instead set the
"delete-at" metadata attribute on them when created to some time
that's well beyond their expected lifetime.
This mechanism requires maintaining a timeout for builds, one that's always "well beyond their expected lifetime". If that stops being true, also depending on the state of #42699, resources may be wasted due multiple retries (as happened in #49666 and #52591 in 2021-2022).
Since coordinator knows about all the builds it started, and already deletes builds that it doesn't know about (e.g., because they're left over from a previous instance of coordinator), I don't think a timer is actually needed for that. However, it might still be useful to handle stalls or other unexpected reasons why a build keeps going beyond a "reasonable" timeframe. So maybe we'll always need to maintain such a timeout.
One of the things we can do in either case is add better metrics/monitoring, so we find out when normal builds start to get dangerously close to the limit before it starts to cause problems.
CL 406216 increased the global timeout for builds from 45 mins to 2 hours to accommodate longtest builders, and this is the tracking issue to figure out what we want to do in this space long term. (Possibly simply bump it up from 2 hours if some builds need even longer in the future.)
CC @golang/release.
Activity
gopherbot commentedon May 16, 2022
Change https://go.dev/cl/406216 mentions this issue:
cmd/coordinator: consolidate and increase global VM deletion timeout
cmd/coordinator: consolidate and increase global VM deletion timeout
bcmills commentedon May 16, 2022
FWIW, with the current builder triage process I'm using there is a natural limit on builder time, which is the interval between a CL being submitted and its triage being performed.
I've been using the day boundary as the triage cutoff, and I think the timestamps that
fetchlogs
uses are in UTC. I'm in UTC-4 and I don't start triage until at least 9AM local time, so that gives a natural limit of ~13h (less scheduling latency) before the tests for a CL committed just before midnight would start to overrun the triage window.[-]x/build/cmd/coordinator: decide what to do about a global build time limit[/-][+]x/build/cmd/coordinator: decide what to do about the global VM/pod timeout[/+]