-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
/kind bug
See this Slack thread for context.
The problem is that kOps sometimes does not poll long-running operations to completion. It definitely fails to do so here, and probably fails to do so elsewhere as well. GCP APIs that return long-running operations often look successful at first, and only surface errors later on when polled, which was happening here.
For instance, the operation returned by Insert
on that line was:
{ <nil> 0 244554770875779894 2025-07-22T18:07:37.440-07:00 <nil> compute#operation operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e insert 0 https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/operations/operation-1753232857301-63a8e55ab031b-a4907b35-77a94a7e <nil> 2025-07-22T18:07:37.449-07:00 RUNNING 140040117571914550 https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a/disks/a-etcd-main-demo-vimana-host [email protected] [] https://www.googleapis.com/compute/v1/projects/vimana-dev/zones/us-west1-a {200 map[Alt-Svc:[h3=":443"; ma=2592000,h3-29=":443"; ma=2592000] Content-Type:[application/json; charset=UTF-8] Date:[Wed, 23 Jul 2025 01:07:37 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} [] []}
Notice how the Status
is RUNNING
and the Progress
is 0
, so kOps really doesn't know yet whether the operation is ultimately successful, but just assumes that it will be.
1. What kops
version are you running? The command kops version
, will display
this information.
1.32.1
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
3. What cloud provider are you using?
GCE
4. What commands did you run? What is the simplest way to reproduce this issue?
I ran into this issue because I tried to provision a cluster in a GCP zone that had run out of disks. There's probably some other condition(s) that could trigger the problem, but my actual trigger is hard to reproduce.
5. What happened after the commands executed?
kOps finished and seemed to think everything was fine, but then the etcd-manager
containers on the control plane node were never able to find the disks they were supposed to use.
6. What did you expect to happen?
The disks would exist.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?