-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
CloudProvider: Azure AKS
Kubernetes version: 1.11.5
Cluster Autoscaler version: 1.3.5
ACS-Engine version: v0.26.3-aks
In our setup we run an Azure AKS cluster with multiple nodes that each run a single pod requiring all resources on the node. The pods are part of a statefulset that we scale quite aggressively from 2 to 20 and then back to 2 replicas at regular intervals.
Sometimes (<10%) when scaling up nodes the whole process takes considerable longer time than usual and this is often associated with the autoscaler pod crashing. When everything works normally the average time for us is 21 minutes when going from 2 to 20 nodes and when not working scaling up can take as long as 75 minutes.
Attached logs and screenshot of the cluster_autoscaler_nodes_count
metric for when the upscaling stopped working after scaling up to 16 nodes and then nothing happens for 30 minutes.
Logs before crash (16:30 to 17:15)
autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-before.log
Logs after crash (17:15 to 17:45)
autoscaler-azure-cluster-autoscaler-7c8df96664-h8fs4-after.log
Relevant log lines are around this lines
E0207 16:49:23.319809 1 static_autoscaler.go:283] Failed to scale up: failed to increase node group size: Code="" Message=""
cc @feiskyer