Skip to content

Azure AKS cluster autoscaler hangs during up scaling #1661

@netiperher

Description

@netiperher

CloudProvider: Azure AKS
Kubernetes version: 1.11.5
Cluster Autoscaler version: 1.3.5
ACS-Engine version: v0.26.3-aks

In our setup we run an Azure AKS cluster with multiple nodes that each run a single pod requiring all resources on the node. The pods are part of a statefulset that we scale quite aggressively from 2 to 20 and then back to 2 replicas at regular intervals.

Sometimes (<10%) when scaling up nodes the whole process takes considerable longer time than usual and this is often associated with the autoscaler pod crashing. When everything works normally the average time for us is 21 minutes when going from 2 to 20 nodes and when not working scaling up can take as long as 75 minutes.

Attached logs and screenshot of the cluster_autoscaler_nodes_count metric for when the upscaling stopped working after scaling up to 16 nodes and then nothing happens for 30 minutes.

screenshot 2019-02-07 at 18 40 29

Logs before crash (16:30 to 17:15)
autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-before.log

Logs after crash (17:15 to 17:45)
autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-after.log

Relevant log lines are around this lines

E0207 16:49:23.319809       1 static_autoscaler.go:283] Failed to scale up: failed to increase node group size: Code="" Message=""

cc @feiskyer

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions