Skip to content

[cluster-autoscaler][AWS] Repeated deletion every 10s of same node takes AWS ASG down to min size killing multiple running pods #4095

@f-ld

Description

@f-ld

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
1.14.1 (but investigations show it could be the same with newer versions)

What environment is this in?:
Running on kops managed cluster on AWS

What did you expect to happen?:
I would not expect to see all nodes of a node group to be stopped while some pods are to be executed on them.

What happened instead?:
See logs below.
We have some nodes that take some time to join the k8s cluster. So AWS creates them (it seems) but they do not join. So after 15 minutes the cluster autoscaler tries to remove the "unregistered node" which usually works well.
But from time to time we can see something like in attached screenshot:

  • every 10 secs the node is attempted to remove
  • because of this piece of code, the group size is decreased on every attemtp down to min (0) when it complains about not being able to go below min
  • then the group notices it needs some more nodes (because more pods are created) so starts counting from 0.

On AWS side, in CloudTrail we can see for example:

  • SetDesiredCapacity to 6
  • Multiple calls to TerminateInstanceInAutoScalingGroup for instance "i-0d9e512ddf8a7321b"
  • SetDesiredCapacity to 2

How to reproduce it (as minimally and precisely as possible):
This happened twice in months (or we noticed it only twice)
No idea on how this can be reproduced, probably understanding how involved code can fail without returning an error would be a good clue.

Anything else we need to know?:
We are in the process of upgrading our k8s cluster versions, but opening this ticket since from what I can see there's not been big changes around this.

Logs
Here are logs filtered on instance ID that got removed multiple times. Log on W0520 06:21:48.687757 shows the problem: we reached min size (0) because of that repeated deletion.

Note: I think the access denial that appears after some time are due to the node not existing anymore (so 403 instead of 404 from AWS). We were not changing anything on AWS permissions side at that time that could explain it.

I0520 06:20:35.811497       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:20:35.995040       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:20:35.995161       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626201893", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:20:46.214697       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:20:46.360828       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:20:46.360991       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202044", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:20:56.694748       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:20:56.910024       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:20:56.910156       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202171", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:07.127117       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:07.266096       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:21:07.266237       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202291", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:17.521201       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:17.761485       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:21:17.761607       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202417", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:28.010358       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:28.151005       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:21:28.151150       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202546", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:38.320254       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:38.411622       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:21:38.411751       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202670", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:48.687750       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
W0520 06:21:48.687757       1 utils.go:483] Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: node group min size reached, skipping unregistered node removal
I0520 06:21:59.385188       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:21:59.524269       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:21:59.524412       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626202932", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:22:09.732725       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:22:09.852224       1 auto_scaling_groups.go:254] Terminating EC2 instance: i-0d9e512ddf8a7321b
I0520 06:22:09.852425       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626203056", FieldPath:""}): type: 'Normal' reason: 'DeleteUnregistered' Removed unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
I0520 06:22:21.040063       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
W0520 06:22:21.040071       1 utils.go:483] Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: node group min size reached, skipping unregistered node removal
I0520 06:22:31.359712       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
W0520 06:22:31.428291       1 utils.go:488] Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup
I0520 06:22:31.428434       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626203321", FieldPath:""}): type: 'Warning' reason: 'DeleteUnregisteredFailed' Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup
I0520 06:22:41.644046       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
W0520 06:22:41.712900       1 utils.go:488] Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup
I0520 06:22:41.713048       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626203575", FieldPath:""}): type: 'Warning' reason: 'DeleteUnregisteredFailed' Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup
I0520 06:22:51.916874       1 utils.go:467] Removing unregistered node aws:///us-east-1b/i-0d9e512ddf8a7321b
W0520 06:22:52.016132       1 utils.go:488] Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup
I0520 06:22:52.016285       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"80f3f121-b1ae-11eb-962e-0e8021dc4e39", APIVersion:"v1", ResourceVersion:"626203764", FieldPath:""}): type: 'Warning' reason: 'DeleteUnregisteredFailed' Failed to remove node aws:///us-east-1b/i-0d9e512ddf8a7321b: AccessDenied: User: arn:aws:sts::836782323787:assumed-role/masters.voxeet-kops-us-prod-us-east-1.k8s.local/i-038c4f916b52f0fde is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerkind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions