Skip to content

x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929

Open
@dmitshur

Description

@dmitshur

The comment of CleanUpOldVMs (and CleanUpOldPodsLoop) includes:

This is the safety mechanism to delete VMs which stray from the
normal deleting process. VMs are created to run a single build and
should be shut down by a controlling process. Due to various types
of failures, they might get stranded. To prevent them from getting
stranded and wasting resources forever, we instead set the
"delete-at" metadata attribute on them when created to some time
that's well beyond their expected lifetime.

This mechanism requires maintaining a timeout for builds, one that's always "well beyond their expected lifetime". If that stops being true, also depending on the state of #42699, resources may be wasted due multiple retries (as happened in #49666 and #52591 in 2021-2022).

Since coordinator knows about all the builds it started, and already deletes builds that it doesn't know about (e.g., because they're left over from a previous instance of coordinator), I don't think a timer is actually needed for that. However, it might still be useful to handle stalls or other unexpected reasons why a build keeps going beyond a "reasonable" timeframe. So maybe we'll always need to maintain such a timeout.

One of the things we can do in either case is add better metrics/monitoring, so we find out when normal builds start to get dangerously close to the limit before it starts to cause problems.

CL 406216 increased the global timeout for builds from 45 mins to 2 hours to accommodate longtest builders, and this is the tracking issue to figure out what we want to do in this space long term. (Possibly simply bump it up from 2 hours if some builds need even longer in the future.)

CC @golang/release.

Activity

added
Buildersx/build issues (builders, bots, dashboards)
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on May 16, 2022
added this to the Unreleased milestone on May 16, 2022
gopherbot

gopherbot commented on May 16, 2022

@gopherbot
Contributor

Change https://go.dev/cl/406216 mentions this issue: cmd/coordinator: consolidate and increase global VM deletion timeout

bcmills

bcmills commented on May 16, 2022

@bcmills
Contributor

FWIW, with the current builder triage process I'm using there is a natural limit on builder time, which is the interval between a CL being submitted and its triage being performed.

I've been using the day boundary as the triage cutoff, and I think the timestamps that fetchlogs uses are in UTC. I'm in UTC-4 and I don't start triage until at least 9AM local time, so that gives a natural limit of ~13h (less scheduling latency) before the tests for a CL committed just before midnight would start to overrun the triage window.

changed the title [-]x/build/cmd/coordinator: decide what to do about a global build time limit[/-] [+]x/build/cmd/coordinator: decide what to do about the global VM/pod timeout[/+] on May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Buildersx/build issues (builders, bots, dashboards)NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dmitshur@bcmills@gopherbot

        Issue actions

          x/build/cmd/coordinator: decide what to do about the global VM/pod timeout · Issue #52929 · golang/go