bulk scale-up in azure creates only one node per iteration sometimes

I think that cluster-autoscaler (CA) 1.3.x in Azure has problems dealing with affinity rules.

I use the following deployment to deploy a "pause" pod with two rules:
- affinity: They must use a node in the agentpool named "genmlow"
- podAntiAffinity: pods must not be deployed in the same node

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  labels:
    app: pause
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: poolName
                    operator: In
                    values:
                      - genmlow
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - pause
              topologyKey: kubernetes.io/hostname
      containers:
      - image: "karlherler/pause:1.0"
        name: pause
```

The agentpool "genmlow" uses Standard_DS2_v2 machines (8GB) in a virtual machine scale set.

When I scale the number of replicas to 10 (`kubectl scale deployment pause --replicas=10`), I see that the cluster autoscaler (version 1.3.9, k8s 1.11.8) creates only one node per iteration, as if it was ignoring the affinity rules. See cluster-autoscaler logs, where nodes go from 0->1->2->...->N.

```
I0503 14:03:19.299146       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2019-05-03 14:04:19.2991386 +0000 UTC m=+948.211672501
I0503 14:03:19.993383       1 scale_up.go:249] Pod default/pause-66cf84dcdb-2khzb is unschedulable
I0503 14:03:19.993412       1 scale_up.go:249] Pod default/pause-66cf84dcdb-l7587 is unschedulable
I0503 14:03:19.993418       1 scale_up.go:249] Pod default/pause-66cf84dcdb-t5mb8 is unschedulable
I0503 14:03:19.993422       1 scale_up.go:249] Pod default/pause-66cf84dcdb-xp2kn is unschedulable
I0503 14:03:19.993426       1 scale_up.go:249] Pod default/pause-66cf84dcdb-rpskf is unschedulable
I0503 14:03:19.993429       1 scale_up.go:249] Pod default/pause-66cf84dcdb-kkxc5 is unschedulable
I0503 14:03:19.993433       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lbprj is unschedulable
I0503 14:03:19.993437       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lmwmf is unschedulable
I0503 14:03:19.993441       1 scale_up.go:249] Pod default/pause-66cf84dcdb-c8njm is unschedulable
I0503 14:03:19.993446       1 scale_up.go:249] Pod default/pause-66cf84dcdb-gg6xh is unschedulable
...
I0503 14:03:20.071931       1 utils.go:187] Pod pause-66cf84dcdb-kkxc5 can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072229       1 utils.go:187] Pod pause-66cf84dcdb-lbprj can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072529       1 utils.go:187] Pod pause-66cf84dcdb-lmwmf can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.073242       1 utils.go:187] Pod pause-66cf84dcdb-c8njm can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
...
I0503 14:03:20.076758       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:03:20.076770       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:03:20.076783       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->1 (max: 1000)}]
I0503 14:03:20.076796       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 1
...
I0503 14:06:13.334377       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:06:13.334411       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:06:13.334470       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 1->2 (max: 1000)}]
I0503 14:06:13.334503       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 2
...
I0503 14:09:02.059191       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:09:02.059243       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:09:02.059310       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 2->3 (max: 1000)}]
I0503 14:09:02.059350       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 3
...
I0503 14:11:50.214206       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:11:50.214228       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:11:50.214245       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 3->4 (max: 1000)}]
I0503 14:11:50.214262       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 4
...
...
```

However, it only behaves this way when the pod has no requests. If I add the following requests:

```yaml
  resources:
    requests:
      memory: 5Gi
```

Everything works as expected and the cluster autoscaler works as expected, creating the 10 virtual machines in a single batch (1->10). I guess it is because this time the autoscaler knows that it can not fit two pods in a single node (5Gi + 5Gi > 8GB), even if it still ignoring the affinity rules.

```
I0503 14:31:36.574678       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:31:36.574722       1 scale_up.go:382] Estimated 10 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:31:36.574752       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->10 (max: 1000)}]
I0503 14:31:36.574786       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 10
```

It looks like a bug to me. Using the same setup in AWS (cluster autoscaler 1.2.x instead of 1.3.x is the only difference) works fine, and the CA creates the 10 virtual machines no matter whether you specify the container memory requests or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bulk scale-up in azure creates only one node per iteration sometimes #1984

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bulk scale-up in azure creates only one node per iteration sometimes #1984

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions