Skip to content

Conversation

elmiko
Copy link

@elmiko elmiko commented Jan 7, 2021

1.20 autoscaler rebase process

inspired by the commit description for the 1.19 rebase.
pr #164

identify carry commits:

git log --oneline --no-merges 78d401fd83e3a003c01056fff3f955850d06a756..openshift/master

where 78d401 reflects the changes since our last rebase (1.19.0). this is the
list of commits we will need to apply onto the new upstream version of the
autoscaler. ideally, some of these commits can be dropped.

After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:

Process

$ git remote update # make sure we update our refs
$ git checkout 57cf6d00ef8af2760f3d94dc789002ef92eb4689 # cluster-autoscaler-1.20
$ git checkout -b merge-tmp # create a branch to do our merge work from
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.20' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F -
deadbeef12345678 # id of new merge commit
$ git branch merge-1.20 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.20 # make sure we are on the proper branch
$ git cherry-pick <carry commits>

With the merge-1.20 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts,
and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.

ac7890394 UPSTREAM: <carry>: openshift: add fixes for unstructured scalable resources
3ad14d355 UPSTREAM: 3708: openshift: Bump github.com/heketi/heketi to v10.1.0
a4099f2d7 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml
30a061c91 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml
22838a30c UPSTREAM: <carry>: openshift: Add tests for nodegroup TemplateNodeInfo
5b045be8c UPSTREAM: <carry>: openshift: Copy standard labels from an existing node if one exists
889ce351a UPSTREAM: <carry>: openshift: Ensure stable versions of generic labels are also included
55c138cd5 UPSTREAM: <carry>: openshift: Updating vertical-pod-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
c0ac04d2b UPSTREAM: <carry>: openshift: Updating images/cluster-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
3fabddac9 UPSTREAM: <carry>: openshift: remove Dockerfile.rhel7 files for ca and vpa
31d20311d UPSTREAM: <carry>: openshift: Add vertical-pod-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
ac7669062 UPSTREAM: <carry>: openshift: Add images/cluster-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
88e10aa44 UPSTREAM: <carry>: openshift: Update VPA builder images to use go 1.14
bce8208b3 UPSTREAM: <carry>: openshift: Implement scale from zero
a2156bf25 UPSTREAM: <carry>: openshift: Rename FailureMessage to ErrorMessage
45a0bd5e3 UPSTREAM: <carry>: openshift: update cluster-autoscaler OWNERS
f8624cb76 UPSTREAM: <carry>: openshift: Add OpenShift VPA image builds
104f3d494 UPSTREAM: <carry>: openshift: Revert "Adding config for event filtering"
d008d774a UPSTREAM: <carry>: openshift: Set default annotations, and cluster name label to use machine.openshift.io
937d51963 UPSTREAM: <carry>: Bump scripts go version to 1.13 and update path to cloudprovider/clusterapi
17de84708 UPSTREAM: <carry>: openshift: Extend makefile with 'make goimports' target
1a1f1fc76 UPSTREAM: <carry>: Fix git commit message verification script
9dd65caae UPSTREAM: <carry>: openshift: Switch builds to use Go 1.12
0ba73333a UPSTREAM: <carry>: openshift: create git history verification script
a2760f607 UPSTREAM: <carry>: openshift: Add fmt, lint, vet scripts/Makefile
0f70d6ae2 UPSTREAM: <carry>: openshift: Add a RHEL7 dockerfile and standarize format
a8e852169 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: set golang_version to 1.10
ce4bdc21d UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: bump golang_version to 1.10.4
906f8f7eb UPSTREAM: <carry>: openshift: Fix spec file to be consistent
eafb0c97b UPSTREAM: <carry>: openshift: Bump embedded tools
99ec67a06 UPSTREAM: <carry>: openshift: Fix the spec and hack scripts so the package can be built in both build systems
5381d0447 UPSTREAM: <carry>: openshift: Add openshift/release Makefile and hack scripts
381e5a654 UPSTREAM: <carry>: openshift: Add dockerfile for cluster autoscaler.
7665c1f10 UPSTREAM: <carry>: openshift: Add spec file for cluster-autoscaler.

Dropped Commits

These commits were carried over the 1.19.x release history and represent work that has been accepted upstream.

99572119c UPSTREAM: <drop>: openshift: Update group identifier to use for Cluster API annotations
61b8048c4 UPSTREAM: <drop>: openshift: Support using --cloud-config for clusterapi provider
bf74515dc UPSTREAM: <carry>: openshift: add fixes for node autodiscovery
71ef9a5ba UPSTREAM: <drop>: openshift: Add node autodiscovery to cluster-autoscaler clusterapi provider
08af78fb5 UPSTREAM: <drop>: openshift: Improve Cluster API tests to work better with constrained resources
08111cefe UPSTREAM: <drop>: openshift: Convert clusterapi provider to use unstructured
19a57d277 UPSTREAM: <drop>: openshift: Update vendor to pull in necessary new paths for client-go
adaf35c01 UPSTREAM: <carry>: openshift: Fallback to Status Replicas if Replicas nil when listing NodeGroups
bb9b9bbe2 UPSTREAM: <carry>: openshift: Fallback to status if replicas nil in spec

k8s-ci-robot and others added 30 commits August 26, 2020 21:11
Fixes bug reported by go vet check stringintconv. In go casting an
integer to string does not result in the string representation of
the number, instead the result is a rune representing
the codepoint of that number.

Signed-off-by: Tobias Kohlbau <[email protected]>
Signed-off-by: Marques Johansson <[email protected]>
This is the scale-down equivalent of kubernetes#3429 and it speeds-up
findUnneeded by 5x+ in very large clusters (by reducing the
number of expensive PreFilter calls #nodes times).

A side effect of this change is removing
"Simulating scheduling of <pod> to <node> return error <error>"
logs. Using FitsAny we no longer have per-node scheduler errors
that we could log. I think that's actually a good thing - even
with klogx this log was incredibly spammy in cluster with >100
nodes and its practical value was questionable.
k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool
down periods. On subsequent calls, they will notice the ongoing throttling
window and will return a synthetic errors (without HTTPStatusCode) rather
than submitting a throttled request to the ARM API:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158
https://github.com/kubernetes/autoscaler/blob/a5ed2cc3fe0aabd92c7758e39f1a9c9fe3bd6505/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go#L118-L123

Some CA components can cope with a temporarily outdated object view
when throttled. They call in to `isAzureRequestsThrottled()` on clients
errors to return stale objects from cache (if any) and extend the object's
refresh period (if any).

But this only works for the first API call (returning HTTP 429).
Next calls in the same throttling window (per Retry-After header)
won't be identified as throttled by `isAzureRequestsThrottled` due
to their nul `HTTPStatusCode`.

This can makes the CA panic during startup due a failing cache init, when
more than one VMSS call hits throttling. We've seen this causing early
restarts loops, re-scanning every VMSS due to cold cache on start,
keeping the subscription throttled.

Practically this change allows the 3 call sites (`scaleSet.Nodes()`,
`scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to
serve from cache (and extend the object's next refresh deadline) as they
would do on the first HTTP 429 hit, rather than returning an error.
Make output of recommender tests easier to read
[cluster-autoscaler][clusterapi] Add support for node autodiscovery to clusterapi provider
…dme. It is currently missing the link and prevent from navigating to Huawei cloud provider readme.md
Add HuaweiCloud info link to FAQ/Documentation section in CA main readme
…throttling

Azure: serve stale on ongoing throttling
…er/aws/fix-link-in-readme

Fix markdown style link in README
This change should substantially decrease the number of GCE Read Requests when Node Deletion takes place. As Read Requests take place for each node, for large node cluster this is impact-ful.
…uration

Decrease the number of GCE Read Requests when node deletion.
…-if-updating

dont update capacity if VMSS provisioning state is updating
elmiko and others added 11 commits January 7, 2021 21:20
…rhel to match build configuration in ocp-build-data

This change adds a new Dockerfile.rhel file to control building release
images. It updates the baseimages in the Dockerfile used for promotion
in order to ensure it matches the configuration in the
[ocp-build-data
repository](https://github.com/openshift/ocp-build-data/tree/openshift-4.6-rhel-8/images)
used for producing release artifacts.

After this change merges, the release files in
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/kubernetes-autoscaler/openshift-kubernetes-autoscaler-master.yaml
should be updated with the new dockerfile path.
…d vpa

This change removes the now deprecated Dockerfile.rhel7 files from the
cluster-autoscaler and the vertical-pod-autoscaler. These files are now
replaced by the Dockerfile.rhel file.
…erfile.rhel baseimages to match ocp-build-data config

this is a copy of the autogenerated commit, original message:

This PR is autogenerated by the [ocp-build-data-enforcer][1].
It updates the base images in the Dockerfile used for promotion in order
to ensure it
matches the configuration in the [ocp-build-data repository][2] used
for producing release artifacts.

Instead of merging this PR you can also create an alternate PR that
includes the changes found here.

If you believe the content of this PR is incorrect, please contact the
dptp team in #aos-art.

[1]:
https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer
[2]:
https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images
…file.rhel baseimages to match ocp-build-data config

this is a copy of the autogenerated commit, original message:

This PR is autogenerated by the [ocp-build-data-enforcer][1].
It updates the base images in the Dockerfile used for promotion in order
to ensure it
matches the configuration in the [ocp-build-data repository][2] used
for producing release artifacts.

Instead of merging this PR you can also create an alternate PR that
includes the changes found here.

If you believe the content of this PR is incorrect, please contact the
dptp team in #aos-art.

[1]:
https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer
[2]:
https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images
Updating vendor against [email protected]:kubernetes/kubernetes.git:3eb90c19d0cf90b756c3e08e32c6495b91e0aeed (3eb90c1)
…ources and node group discovery

This change adds several things which were removed or refactored during
the conversion to use unstructured types and automatic node group
discovery. These changes are mostly focused on the unit tests with a few
extra bits in the business logic.

Fixes scalableResourceProviderIDs to use proper unstructured resource
object.

Adds unit test fixes to account for dynamic client updates brought in
during vendor updates, see
kubernetes@4550bfe
for the source of conflict.

Adds fix to newNodeGroupFromScalableResource to ensure that scaling from
zero is respected.

Fixes the unit tests to use the unstructured package helper
functions for inspecting the objects.

Adds unit tests for unstructured annotations to ensure that the cpu,
memory, gpu, and max pods information is properly parsed.

Adds unit tests to ensure that the logic for scaling from zero is
properly observed.
@elmiko elmiko force-pushed the openshift-rebase-1.20 branch from 245ece6 to be51a45 Compare January 8, 2021 02:20
@elmiko
Copy link
Author

elmiko commented Jan 8, 2021

/retest

@openshift-ci
Copy link

openshift-ci bot commented Jan 8, 2021

@elmiko: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/git-history be51a45 link /test git-history
ci/prow/e2e-gcp-operator be51a45 link /test e2e-gcp-operator
ci/prow/e2e-azure-operator be51a45 link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check that the new workflows that have been added won't affect our PRs

Changes in the autoscaler to be aware of:

  • Daemonset Pods are now evicted on drain, but no error is returned if this fails

@@ -1,19 +0,0 @@
# Copyright 2016 The Kubernetes Authors. All rights reserved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does our release image get built? Is this going to affect that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have our own dockerfiles in images/cluster-autoscaler, this doesn't affect our build process, no issue here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should exclusively be using the ./images/cluster-autoscaler/Dockerfile.rhel file to build our stuff.

@JoelSpeed
Copy link

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 8, 2021
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Bugzilla bug 1913960, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 8, 2021
@JoelSpeed
Copy link

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2021
@michaelgugino
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2021
@openshift-merge-robot openshift-merge-robot merged commit 7f1d0ad into openshift:master Jan 8, 2021
@openshift-ci-robot
Copy link

@elmiko: All pull requests linked via external trackers have merged:

Bugzilla bug 1913960 has been moved to the MODIFIED state.

In response to this:

Bug 1913960: rebase on top of kubernetes/autoscaler 1.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.