Skip to content

Poll GCE long-running operations to completion #17511

@ouillie

Description

@ouillie

/kind bug

See this Slack thread for context.

The problem is that kOps sometimes does not poll long-running operations to completion. It definitely fails to do so here, and probably fails to do so elsewhere as well. GCP APIs that return long-running operations often look successful at first, and only surface errors later on when polled, which was happening here.

For instance, the operation returned by Insert on that line was:

{    <nil>  0 244554770875779894 2025-07-22T18:07:37.440-07:00 <nil> compute#operation operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e  insert 0  https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/operations/operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e <nil> 2025-07-22T18:07:37.449-07:00 RUNNING  140040117571914550 https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/disks/a-etcd-main-demo-vimana-host [email protected] [] https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a {200 map[Alt-Svc:[h3=":443"; ma=2592000,h3-29=":443"; ma=2592000] Content-Type:[application/json; charset=UTF-8] Date:[Wed, 23 Jul 2025 01:07:37 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} [] []}

Notice how the Status is RUNNING and the Progress is 0, so kOps really doesn't know yet whether the operation is ultimately successful, but just assumes that it will be.

1. What kops version are you running? The command kops version, will display
this information.

1.32.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

3. What cloud provider are you using?

GCE

4. What commands did you run? What is the simplest way to reproduce this issue?

I ran into this issue because I tried to provision a cluster in a GCP zone that had run out of disks. There's probably some other condition(s) that could trigger the problem, but my actual trigger is hard to reproduce.

5. What happened after the commands executed?

kOps finished and seemed to think everything was fine, but then the etcd-manager containers on the control plane node were never able to find the disks they were supposed to use.

6. What did you expect to happen?

The disks would exist.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions