x/build/cmd/coordinator: decide what to do about the global VM/pod timeout

The comment of [`CleanUpOldVMs`](https://cs.opensource.google/go/x/build/+/master:internal/coordinator/pool/gce.go;l=648-661;drc=3ab5e7e87a80893ee88aba0e9de7e86cde9f5a0a) (and `CleanUpOldPodsLoop`) includes:

> This is the safety mechanism to delete VMs which stray from the
> normal deleting process. VMs are created to run a single build and
> should be shut down by a controlling process. Due to various types
> of failures, they might get stranded. To prevent them from getting
> stranded and wasting resources forever, we instead set the
> "delete-at" metadata attribute on them when created to some time
> that's well beyond their expected lifetime.

This mechanism requires maintaining a timeout for builds, one that's always "well beyond their expected lifetime". If that stops being true, also depending on the state of #42699, resources may be wasted due multiple retries (as happened in #49666 and #52591 in 2021-2022).

Since coordinator knows about all the builds it started, and already deletes builds that it doesn't know about (e.g., because they're left over from a previous instance of coordinator), I don't think a timer is actually needed for that. However, it might still be useful to handle stalls or other unexpected reasons why a build keeps going beyond a "reasonable" timeframe. So maybe we'll always need to maintain such a timeout.

One of the things we can do in either case is add better metrics/monitoring, so we find out when normal builds start to get dangerously close to the limit before it starts to cause problems.

[CL 406216](https://go.dev/cl/406216) increased the global timeout for builds from 45 mins to 2 hours to accommodate longtest builders, and this is the tracking issue to figure out what we want to do in this space long term. (Possibly simply bump it up from 2 hours if some builds need even longer in the future.)

CC @golang/release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929

Description

Activity

gopherbot commented on May 16, 2022

bcmills commented on May 16, 2022

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions