Poll GCE long-running operations to completion

/kind bug

See [this Slack thread](https://kubernetes.slack.com/archives/C3QUFP0QM/p1753172206375029) for context.

The problem is that kOps sometimes does not [poll long-running operations](https://cloud.google.com/service-infrastructure/docs/polling-operations) to completion. It definitely fails to do so [here](https://github.com/kubernetes/kops/blob/v1.32.1/upup/pkg/fi/cloudup/gcetasks/disk.go#L120), and probably fails to do so elsewhere as well. GCP APIs that return long-running operations often look successful at first, and only surface errors later on when polled, which was happening here.

For instance, the operation returned by `Insert` on [that line](https://github.com/kubernetes/kops/blob/v1.32.1/upup/pkg/fi/cloudup/gcetasks/disk.go#L120) was:

```
{    <nil>  0 244554770875779894 2025-07-22T18:07:37.440-07:00 <nil> compute#operation operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e  insert 0  https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/operations/operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e <nil> 2025-07-22T18:07:37.449-07:00 RUNNING  140040117571914550 https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/disks/a-etcd-main-demo-vimana-host will@vimana.host [] https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a {200 map[Alt-Svc:[h3=":443"; ma=2592000,h3-29=":443"; ma=2592000] Content-Type:[application/json; charset=UTF-8] Date:[Wed, 23 Jul 2025 01:07:37 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} [] []}
```

Notice how the `Status` is `RUNNING` and the `Progress` is `0`, so kOps really doesn't know yet whether the operation is ultimately successful, but just assumes that it will be.

**1. What `kops` version are you running? The command `kops version`, will display
 this information.**

1.32.1

**2. What Kubernetes version are you running? `kubectl version` will print the
 version if a cluster is running or provide the Kubernetes version specified as
 a `kops` flag.**

**3. What cloud provider are you using?**

GCE

**4. What commands did you run?  What is the simplest way to reproduce this issue?**

I ran into this issue because I tried to provision a cluster in a GCP zone that had run out of disks. There's probably some other condition(s) that could trigger the problem, but my actual trigger is hard to reproduce.

**5. What happened after the commands executed?**

kOps finished and seemed to think everything was fine, but then the `etcd-manager` containers on the control plane node were never able to find the disks they were supposed to use.

**6. What did you expect to happen?**

The disks would exist.

**7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -o yaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.**

```yaml
```

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**

**9. Anything else do we need to know?**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poll GCE long-running operations to completion #17511

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poll GCE long-running operations to completion #17511

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions