Allow scaling multiple Azure vmss synchronously #2152

feiskyer · 2019-07-01T01:15:11Z

This PR allows scaling multiple Azure vmss synchronously by delaying the vmss capacity updates in different goroutines.

To make it work, the upcoming nodes from getUpcomingNodeInfos() are also changed to different names, so that multiple nodes won't be merged as one node in filterOutSchedulableByPacking().

Partially fix #2044 (similar node issues are tracked at #2094)
Fix #1984

/cc @andyzhangx @nilo19

k8s-ci-robot · 2019-07-01T01:15:13Z

@feiskyer: GitHub didn't allow me to request PR reviews from the following users: nilo19.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR allows scaling multiple Azure vmss synchronously by delaying the vmss capacity updates in different goroutines.

To make it work, the upcoming nodes from getUpcomingNodeInfos() are also changed to different names, so that multiple nodes won't be merged as one node in filterOutSchedulableByPacking().

Partially fix #2044 (similar node issues are tracked at #2094)
Fix #1984

/cc @andyzhangx @nilo19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx

The PR title should be Allow scaling multiple Azure vmss simultaneously, actually you are using asynchronous way

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

CecileRobertMichon · 2019-07-01T18:29:25Z

cluster-autoscaler/core/static_autoscaler.go

-			upcomingNodes = append(upcomingNodes, nodeTemplate.Node())
+			// Ensure new nodes having different names because nodeName would used as a map key.
+			node := nodeTemplate.Node().DeepCopy()
+			node.Name = fmt.Sprintf("%s-%d", node.Name, rand.Int63())


will this affect non-Azure nodes as well?

Yep, it would affect non-Azure nodes, but I think it should be fixed for all cloud providers.

@losipiuk @mwielgus Could you help to take a look at this?

It feels like serious bug. Thanks for spotting and fixing that.
One note here. Could you please use counter (increased atomically using AddUint64) instead random value.
Also could we mutate UID too? Probably we should extract a helper function for that buildNodeForNodeTemplate?

cc: @vivekbagade. Vivek could you please scan the code and see if there are no more places where we use node name as dictionary key.
We could also add a sanity check in filterOutSchedulableUsingPacking to detect situation when it gets list of nodes with repeating names.
Also probably we should use UID (not name) as map key.

We use node name as a dictionary key in a few places. Need to check if they could be causing any issues. Even if they are not, we probably should change this to avoid future issues.

This is because of CreateNodeNameToInfoMap func that could potentially mask a few nodes.

Thanks for the suggestion, would update the PR.

cc: @vivekbagade. Vivek could you please scan the code and see if there are no more places where we use node name as dictionary key.
We could also add a sanity check in filterOutSchedulableUsingPacking to detect situation when it gets list of nodes with repeating names.
Also probably we should use UID (not name) as map key.

That looks good. We can do the check and optimization after the code scan.

Could you please use counter (increased atomically using AddUint64) instead random value.

Actually, the index would be ok. The node name here only used for this single filterOutSchedulableUsingPacking() step.

@losipiuk Updated, PTAL

CecileRobertMichon · 2019-07-01T18:32:17Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

+			// Invalidate the vmss size cache, so that it would be got by API.
+			scaleSet.mutex.Lock()
+			defer scaleSet.mutex.Unlock()
+			scaleSet.lastRefresh = time.Now().Add(-1 * 15 * time.Second)


can you explain why we are setting last refresh to -15 seconds from Now()? Also, why not use Sub()? https://golang.org/pkg/time/#Time.Sub

15 second is from L122, I renamed it to vmssSizeRefreshPeriod in PR #2151, but forgot to change here. Let me change it also to vmssSizeRefreshPeriod for clear.

Sub is a different use case, it accepts a param with time.Time, not time.Duration.

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

andyzhangx

/lgtm
LGTM on azure part

losipiuk · 2019-07-02T18:11:37Z

cluster-autoscaler/core/static_autoscaler.go

 	return found && oldest.Add(unschedulablePodWithGpuTimeBuffer).After(currentTime)
 }

+func buildNodeForNodeTemplate(nodeTemplate *schedulernodeinfo.NodeInfo, index int) *apiv1.Node {


Looks good.
@vivekbagade is doing some testing on this (thanks!).
I will LGTM when we are done with that.

CecileRobertMichon

lgtm

vivekbagade · 2019-07-02T19:30:51Z

@losipiuk My testing is done. Works as expected. LGTM from my side.

losipiuk · 2019-07-03T09:48:47Z

/lgtm
/approve

losipiuk · 2019-07-03T09:49:00Z

Thanks!

k8s-ci-robot · 2019-07-03T09:49:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: losipiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [losipiuk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feiskyer · 2019-07-03T12:26:30Z

@losipiuk Thanks

CA 1.15: cherry pick of #2151 and #2152

CA 1.13: cherry pick of #2151 and #2152

CA 1.12: cherry pick of #2151 and #2152

CA 1.14: cherry pick of #2151 and #2152

k8s-ci-robot requested a review from andyzhangx July 1, 2019 01:15

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 1, 2019

andyzhangx reviewed Jul 1, 2019

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Show resolved Hide resolved

CecileRobertMichon reviewed Jul 1, 2019

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Show resolved Hide resolved

CecileRobertMichon reviewed Jul 1, 2019

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Jul 1, 2019

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Jul 1, 2019

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Show resolved Hide resolved

Allow scaling multiple vmss synchronously

bf0a990

feiskyer force-pushed the bulk-scale-up branch from 3a88bc8 to 02ad93b Compare July 2, 2019 04:24

Ensure upcoming nodes are different

d45fee0

feiskyer force-pushed the bulk-scale-up branch from 02ad93b to d45fee0 Compare July 2, 2019 08:52

andyzhangx approved these changes Jul 2, 2019

View reviewed changes

k8s-ci-robot assigned andyzhangx Jul 2, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2019

losipiuk reviewed Jul 2, 2019

View reviewed changes

CecileRobertMichon approved these changes Jul 2, 2019

View reviewed changes

k8s-ci-robot assigned losipiuk Jul 3, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2019

k8s-ci-robot merged commit a342a76 into kubernetes:master Jul 3, 2019

feiskyer deleted the bulk-scale-up branch July 3, 2019 12:26

feiskyer mentioned this pull request Jul 8, 2019

CA 1.12: cherry pick of #2151 and #2152 #2167

Merged

This was referenced Jul 8, 2019

CA 1.13: cherry pick of #2151 and #2152 #2168

Merged

CA 1.14: cherry pick of #2151 and #2152 #2169

Merged

CA 1.15: cherry pick of #2151 and #2152 #2170

Merged

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2019

Merge pull request #2170 from feiskyer/cluster-autoscaler-release-1.15

276399f

CA 1.15: cherry pick of #2151 and #2152

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2019

Merge pull request #2168 from feiskyer/cluster-autoscaler-release-1.13

964e7d7

CA 1.13: cherry pick of #2151 and #2152

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2019

Merge pull request #2167 from feiskyer/cluster-autoscaler-release-1.12

13f58aa

CA 1.12: cherry pick of #2151 and #2152

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2019

Merge pull request #2169 from feiskyer/cluster-autoscaler-release-1.14

5bfa87f

CA 1.14: cherry pick of #2151 and #2152

CecileRobertMichon mentioned this pull request Aug 12, 2019

REQUEST: New membership for CecileRobertMichon kubernetes/org#1096

Closed

6 tasks

danielmellado mentioned this pull request Oct 24, 2019

Rebase to upstream/cluster-autoscaler-release-1.16 openshift/kubernetes-autoscaler#119

Closed

Allow scaling multiple Azure vmss synchronously #2152

Allow scaling multiple Azure vmss synchronously #2152

Uh oh!

Conversation

feiskyer commented Jul 1, 2019

Uh oh!

k8s-ci-robot commented Jul 1, 2019

Uh oh!

andyzhangx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

losipiuk Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andyzhangx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CecileRobertMichon left a comment

Choose a reason for hiding this comment

Uh oh!

vivekbagade commented Jul 2, 2019

Uh oh!

losipiuk commented Jul 3, 2019

Uh oh!

losipiuk commented Jul 3, 2019

Uh oh!

k8s-ci-robot commented Jul 3, 2019

Uh oh!

feiskyer commented Jul 3, 2019

Uh oh!

Uh oh!

losipiuk Jul 2, 2019 •

edited

Loading