Azure AKS cluster autoscaler hangs during up scaling

CloudProvider: Azure AKS
Kubernetes version: 1.11.5
Cluster Autoscaler version: 1.3.5
ACS-Engine version: v0.26.3-aks

In our setup we run an Azure AKS cluster with multiple nodes that each run a single pod requiring all resources on the node. The pods are part of a statefulset that we scale quite aggressively from 2 to 20 and then back to 2 replicas at regular intervals.

Sometimes (<10%) when scaling up nodes the whole process takes considerable longer time than usual and this is often associated with the autoscaler pod crashing. When everything works normally the average time for us is 21 minutes when going from 2 to 20 nodes and when not working scaling up can take as long as 75 minutes.

Attached logs and screenshot of the `cluster_autoscaler_nodes_count` metric for when the upscaling stopped working after scaling up to 16 nodes and then nothing happens for 30 minutes.

![screenshot 2019-02-07 at 18 40 29](https://user-images.githubusercontent.com/45091747/52519612-3fc60200-2c5e-11e9-8c36-93b5d8002723.png)

Logs before crash (16:30 to 17:15)
[autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-before.log](https://github.com/kubernetes/autoscaler/files/2847708/autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-before.log)

Logs after crash (17:15 to 17:45)
[autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-after.log](https://github.com/kubernetes/autoscaler/files/2847710/autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-after.log)

Relevant log lines are around this lines
```
E0207 16:49:23.319809       1 static_autoscaler.go:283] Failed to scale up: failed to increase node group size: Code="" Message=""
```

cc @feiskyer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Azure AKS cluster autoscaler hangs during up scaling #1661

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Azure AKS cluster autoscaler hangs during up scaling #1661

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions