Bug 1913960: rebase on top of kubernetes/autoscaler 1.20 #185

elmiko · 2021-01-07T20:52:14Z

1.20 autoscaler rebase process

inspired by the commit description for the 1.19 rebase.
pr #164

identify carry commits:

git log --oneline --no-merges 78d401fd83e3a003c01056fff3f955850d06a756..openshift/master

where 78d401 reflects the changes since our last rebase (1.19.0). this is the
list of commits we will need to apply onto the new upstream version of the
autoscaler. ideally, some of these commits can be dropped.

After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:

Process

$ git remote update # make sure we update our refs
$ git checkout 57cf6d00ef8af2760f3d94dc789002ef92eb4689 # cluster-autoscaler-1.20
$ git checkout -b merge-tmp # create a branch to do our merge work from
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.20' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F -
deadbeef12345678 # id of new merge commit
$ git branch merge-1.20 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.20 # make sure we are on the proper branch
$ git cherry-pick <carry commits>

With the merge-1.20 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts,
and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.

ac7890394 UPSTREAM: <carry>: openshift: add fixes for unstructured scalable resources
3ad14d355 UPSTREAM: 3708: openshift: Bump github.com/heketi/heketi to v10.1.0
a4099f2d7 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml
30a061c91 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml
22838a30c UPSTREAM: <carry>: openshift: Add tests for nodegroup TemplateNodeInfo
5b045be8c UPSTREAM: <carry>: openshift: Copy standard labels from an existing node if one exists
889ce351a UPSTREAM: <carry>: openshift: Ensure stable versions of generic labels are also included
55c138cd5 UPSTREAM: <carry>: openshift: Updating vertical-pod-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
c0ac04d2b UPSTREAM: <carry>: openshift: Updating images/cluster-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
3fabddac9 UPSTREAM: <carry>: openshift: remove Dockerfile.rhel7 files for ca and vpa
31d20311d UPSTREAM: <carry>: openshift: Add vertical-pod-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
ac7669062 UPSTREAM: <carry>: openshift: Add images/cluster-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
88e10aa44 UPSTREAM: <carry>: openshift: Update VPA builder images to use go 1.14
bce8208b3 UPSTREAM: <carry>: openshift: Implement scale from zero
a2156bf25 UPSTREAM: <carry>: openshift: Rename FailureMessage to ErrorMessage
45a0bd5e3 UPSTREAM: <carry>: openshift: update cluster-autoscaler OWNERS
f8624cb76 UPSTREAM: <carry>: openshift: Add OpenShift VPA image builds
104f3d494 UPSTREAM: <carry>: openshift: Revert "Adding config for event filtering"
d008d774a UPSTREAM: <carry>: openshift: Set default annotations, and cluster name label to use machine.openshift.io
937d51963 UPSTREAM: <carry>: Bump scripts go version to 1.13 and update path to cloudprovider/clusterapi
17de84708 UPSTREAM: <carry>: openshift: Extend makefile with 'make goimports' target
1a1f1fc76 UPSTREAM: <carry>: Fix git commit message verification script
9dd65caae UPSTREAM: <carry>: openshift: Switch builds to use Go 1.12
0ba73333a UPSTREAM: <carry>: openshift: create git history verification script
a2760f607 UPSTREAM: <carry>: openshift: Add fmt, lint, vet scripts/Makefile
0f70d6ae2 UPSTREAM: <carry>: openshift: Add a RHEL7 dockerfile and standarize format
a8e852169 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: set golang_version to 1.10
ce4bdc21d UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: bump golang_version to 1.10.4
906f8f7eb UPSTREAM: <carry>: openshift: Fix spec file to be consistent
eafb0c97b UPSTREAM: <carry>: openshift: Bump embedded tools
99ec67a06 UPSTREAM: <carry>: openshift: Fix the spec and hack scripts so the package can be built in both build systems
5381d0447 UPSTREAM: <carry>: openshift: Add openshift/release Makefile and hack scripts
381e5a654 UPSTREAM: <carry>: openshift: Add dockerfile for cluster autoscaler.
7665c1f10 UPSTREAM: <carry>: openshift: Add spec file for cluster-autoscaler.

Dropped Commits

These commits were carried over the 1.19.x release history and represent work that has been accepted upstream.

99572119c UPSTREAM: <drop>: openshift: Update group identifier to use for Cluster API annotations
61b8048c4 UPSTREAM: <drop>: openshift: Support using --cloud-config for clusterapi provider
bf74515dc UPSTREAM: <carry>: openshift: add fixes for node autodiscovery
71ef9a5ba UPSTREAM: <drop>: openshift: Add node autodiscovery to cluster-autoscaler clusterapi provider
08af78fb5 UPSTREAM: <drop>: openshift: Improve Cluster API tests to work better with constrained resources
08111cefe UPSTREAM: <drop>: openshift: Convert clusterapi provider to use unstructured
19a57d277 UPSTREAM: <drop>: openshift: Update vendor to pull in necessary new paths for client-go
adaf35c01 UPSTREAM: <carry>: openshift: Fallback to Status Replicas if Replicas nil when listing NodeGroups
bb9b9bbe2 UPSTREAM: <carry>: openshift: Fallback to status if replicas nil in spec

Break up the logic in azure_manager

Fixes bug reported by go vet check stringintconv. In go casting an integer to string does not result in the string representation of the number, instead the result is a rune representing the codepoint of that number. Signed-off-by: Tobias Kohlbau <[email protected]>

CA: fix integer to string conversion

Signed-off-by: Marques Johansson <[email protected]>

This is the scale-down equivalent of kubernetes#3429 and it speeds-up findUnneeded by 5x+ in very large clusters (by reducing the number of expensive PreFilter calls #nodes times). A side effect of this change is removing "Simulating scheduling of <pod> to <node> return error <error>" logs. Using FitsAny we no longer have per-node scheduler errors that we could log. I think that's actually a good thing - even with klogx this log was incredibly spammy in cluster with >100 nodes and its practical value was questionable.

Use FitsAny in drain simulation

k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool down periods. On subsequent calls, they will notice the ongoing throttling window and will return a synthetic errors (without HTTPStatusCode) rather than submitting a throttled request to the ARM API: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158 https://github.com/kubernetes/autoscaler/blob/a5ed2cc3fe0aabd92c7758e39f1a9c9fe3bd6505/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go#L118-L123 Some CA components can cope with a temporarily outdated object view when throttled. They call in to `isAzureRequestsThrottled()` on clients errors to return stale objects from cache (if any) and extend the object's refresh period (if any). But this only works for the first API call (returning HTTP 429). Next calls in the same throttling window (per Retry-After header) won't be identified as throttled by `isAzureRequestsThrottled` due to their nul `HTTPStatusCode`. This can makes the CA panic during startup due a failing cache init, when more than one VMSS call hits throttling. We've seen this causing early restarts loops, re-scanning every VMSS due to cold cache on start, keeping the subscription throttled. Practically this change allows the 3 call sites (`scaleSet.Nodes()`, `scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to serve from cache (and extend the object's next refresh deadline) as they would do on the first HTTP 429 hit, rather than returning an error.

Make output of recommender tests easier to read

[cluster-autoscaler][clusterapi] Add support for node autodiscovery to clusterapi provider

remove duplicated values

add Packet cloudprovider owners

…dme. It is currently missing the link and prevent from navigating to Huawei cloud provider readme.md

Add HuaweiCloud info link to FAQ/Documentation section in CA main readme

…throttling Azure: serve stale on ongoing throttling

…er/aws/fix-link-in-readme Fix markdown style link in README

This change should substantially decrease the number of GCE Read Requests when Node Deletion takes place. As Read Requests take place for each node, for large node cluster this is impact-ful.

…uration Decrease the number of GCE Read Requests when node deletion.

…-if-updating dont update capacity if VMSS provisioning state is updating

…io/apiserver

…rhel to match build configuration in ocp-build-data This change adds a new Dockerfile.rhel file to control building release images. It updates the baseimages in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository](https://github.com/openshift/ocp-build-data/tree/openshift-4.6-rhel-8/images) used for producing release artifacts. After this change merges, the release files in https://github.com/openshift/release/blob/master/ci-operator/config/openshift/kubernetes-autoscaler/openshift-kubernetes-autoscaler-master.yaml should be updated with the new dockerfile path.

…d vpa This change removes the now deprecated Dockerfile.rhel7 files from the cluster-autoscaler and the vertical-pod-autoscaler. These files are now replaced by the Dockerfile.rhel file.

…erfile.rhel baseimages to match ocp-build-data config this is a copy of the autogenerated commit, original message: This PR is autogenerated by the [ocp-build-data-enforcer][1]. It updates the base images in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository][2] used for producing release artifacts. Instead of merging this PR you can also create an alternate PR that includes the changes found here. If you believe the content of this PR is incorrect, please contact the dptp team in #aos-art. [1]: https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer [2]: https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images

…file.rhel baseimages to match ocp-build-data config this is a copy of the autogenerated commit, original message: This PR is autogenerated by the [ocp-build-data-enforcer][1]. It updates the base images in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository][2] used for producing release artifacts. Instead of merging this PR you can also create an alternate PR that includes the changes found here. If you believe the content of this PR is incorrect, please contact the dptp team in #aos-art. [1]: https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer [2]: https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images

…s are also included

…ode if one exists

…er & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml

…ages to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml

Updating vendor against [email protected]:kubernetes/kubernetes.git:3eb90c19d0cf90b756c3e08e32c6495b91e0aeed (3eb90c1)

…ources and node group discovery This change adds several things which were removed or refactored during the conversion to use unstructured types and automatic node group discovery. These changes are mostly focused on the unit tests with a few extra bits in the business logic. Fixes scalableResourceProviderIDs to use proper unstructured resource object. Adds unit test fixes to account for dynamic client updates brought in during vendor updates, see kubernetes@4550bfe for the source of conflict. Adds fix to newNodeGroupFromScalableResource to ensure that scaling from zero is respected. Fixes the unit tests to use the unstructured package helper functions for inspecting the objects. Adds unit tests for unstructured annotations to ensure that the cpu, memory, gpu, and max pods information is properly parsed. Adds unit tests to ensure that the logic for scaling from zero is properly observed.

elmiko · 2021-01-08T04:19:27Z

/retest

openshift-ci · 2021-01-08T06:35:27Z

@elmiko: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/git-history	`be51a45`	link	`/test git-history`
ci/prow/e2e-gcp-operator	`be51a45`	link	`/test e2e-gcp-operator`
ci/prow/e2e-azure-operator	`be51a45`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed

We should check that the new workflows that have been added won't affect our PRs

Changes in the autoscaler to be aware of:

Daemonset Pods are now evicted on drain, but no error is returned if this fails

JoelSpeed · 2021-01-08T10:19:23Z

cluster-autoscaler/Dockerfile

@@ -1,19 +0,0 @@
-# Copyright 2016 The Kubernetes Authors. All rights reserved


How does our release image get built? Is this going to affect that?

We have our own dockerfiles in images/cluster-autoscaler, this doesn't affect our build process, no issue here

we should exclusively be using the ./images/cluster-autoscaler/Dockerfile.rhel file to build our stuff.

JoelSpeed · 2021-01-08T10:55:01Z

/bugzilla refresh

openshift-ci-robot · 2021-01-08T10:55:09Z

@JoelSpeed: This pull request references Bugzilla bug 1913960, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2021-01-08T12:56:33Z

/approve

openshift-ci-robot · 2021-01-08T12:57:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

michaelgugino · 2021-01-08T13:33:26Z

/lgtm

openshift-ci-robot · 2021-01-08T13:45:05Z

@elmiko: All pull requests linked via external trackers have merged:

openshift/kubernetes-autoscaler#185

Bugzilla bug 1913960 has been moved to the MODIFIED state.

In response to this:

Bug 1913960: rebase on top of kubernetes/autoscaler 1.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot and others added 30 commits August 26, 2020 21:11

Merge pull request kubernetes#3460 from marwanad/restructure-code

d9e8fa3

Break up the logic in azure_manager

Move rounding logic inside scaleQuantityProportionally

825ec2c

Merge pull request kubernetes#3473 from tobiaskohlbau/CA/fixGoVet

bbc60b1

CA: fix integer to string conversion

add Packet cloudprovider owners

c619071

Signed-off-by: Marques Johansson <[email protected]>

Merge pull request kubernetes#3482 from MaciekPytel/fitsany_scaledown

a5ed2cc

Use FitsAny in drain simulation

Make output of recommender tests easier to read

79ed69e

Merge pull request kubernetes#3485 from jbartosik/test-log-format

cbe0add

Make output of recommender tests easier to read

Merge pull request kubernetes#3314 from detiber/autoDiscovery

f0938fc

[cluster-autoscaler][clusterapi] Add support for node autodiscovery to clusterapi provider

remove dupliated values

5dc52f2

fix: chart version bump

51f6eaf

Merge pull request kubernetes#3487 from gkarthiks/master

2457ad9

remove duplicated values

Merge pull request kubernetes#3480 from displague/packet-owners

b8e285b

add Packet cloudprovider owners

add HuaweiCloud info link to FAQ/Documentation section in CA main rea…

f15b612

…dme. It is currently missing the link and prevent from navigating to Huawei cloud provider readme.md

Merge pull request kubernetes#3491 from paulmchen/master

38d9cd7

Add HuaweiCloud info link to FAQ/Documentation section in CA main readme

Merge pull request kubernetes#3484 from DataDog/azure-detect-ongoing-…

70bef96

…throttling Azure: serve stale on ongoing throttling

Fix markdown style link in README

204d1cb

Merge pull request kubernetes#3492 from shibataka000/cluster-autoscal…

a80841e

…er/aws/fix-link-in-readme Fix markdown style link in README

Add names to controller fetcher test cases

05e28ec

Decrease the number of GCE Read Requests when node deletion.

2a85789

This change should substantially decrease the number of GCE Read Requests when Node Deletion takes place. As Read Requests take place for each node, for large node cluster this is impact-ful.

Go fmt lint

821f42d

Merge pull request kubernetes#3500 from jayantjain93/read-request-sat…

63259fb

…uration Decrease the number of GCE Read Requests when node deletion.

correctly classify error for failed scale ups

a3bada3

dont update capacity if VMSS provisioning state is updating

6da62d0

Merge pull request kubernetes#3510 from marwanad/dont-update-capacity…

24c1f90

…-if-updating dont update capacity if VMSS provisioning state is updating

Fix typo in chart README

03f60a4

Fix typo in scale_down.go

ce19ac9

Update imports after quota is migrated from k8s.io/kubernetes to k8s.…

a64c8d6

…io/apiserver

elmiko and others added 11 commits January 7, 2021 21:20

UPSTREAM: <carry>: openshift: remove Dockerfile.rhel7 files for ca an…

889eb8a

…d vpa This change removes the now deprecated Dockerfile.rhel7 files from the cluster-autoscaler and the vertical-pod-autoscaler. These files are now replaced by the Dockerfile.rhel file.

UPSTREAM: <carry>: openshift: Ensure stable versions of generic label…

c0a685b

…s are also included

UPSTREAM: <carry>: openshift: Copy standard labels from an existing n…

65b1b7d

…ode if one exists

UPSTREAM: <carry>: openshift: Add tests for nodegroup TemplateNodeInfo

c20293a

UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler build…

3bebbae

…er & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml

UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base im…

024d0fa

…ages to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml

UPSTREAM: 3708: openshift: Bump github.com/heketi/heketi to v10.1.0

8667f0c

Updating vendor against [email protected]:kubernetes/kubernetes.git:3eb90c19d0cf90b756c3e08e32c6495b91e0aeed (3eb90c1)

elmiko force-pushed the openshift-rebase-1.20 branch from 245ece6 to be51a45 Compare January 8, 2021 02:20

JoelSpeed reviewed Jan 8, 2021

View reviewed changes

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 8, 2021

openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 8, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2021

openshift-ci-robot assigned michaelgugino Jan 8, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2021

openshift-merge-robot merged commit 7f1d0ad into openshift:master Jan 8, 2021

elmiko deleted the openshift-rebase-1.20 branch January 8, 2021 13:54

sgreene570 mentioned this pull request Feb 22, 2021

NE-514: Rebase to upstream v1.8.1 openshift/coredns#52

Merged

2 tasks

elmiko mentioned this pull request Apr 28, 2021

Bug 1948719: rebase on top of kubernetes/autoscaler 1.21 #201

Merged

		@@ -1,19 +0,0 @@
		# Copyright 2016 The Kubernetes Authors. All rights reserved

Bug 1913960: rebase on top of kubernetes/autoscaler 1.20 #185

Bug 1913960: rebase on top of kubernetes/autoscaler 1.20 #185

Uh oh!

Conversation

elmiko commented Jan 7, 2021

1.20 autoscaler rebase process

Uh oh!

elmiko commented Jan 8, 2021

Uh oh!

openshift-ci bot commented Jan 8, 2021

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

elmiko Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Jan 8, 2021

Uh oh!

openshift-ci-robot commented Jan 8, 2021

Uh oh!

JoelSpeed commented Jan 8, 2021

Uh oh!

openshift-ci-robot commented Jan 8, 2021

Uh oh!

michaelgugino commented Jan 8, 2021

Uh oh!

openshift-ci-robot commented Jan 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

51 participants