-
Notifications
You must be signed in to change notification settings - Fork 40
Bug 1913960: rebase on top of kubernetes/autoscaler 1.20 #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1913960: rebase on top of kubernetes/autoscaler 1.20 #185
Conversation
Break up the logic in azure_manager
Fixes bug reported by go vet check stringintconv. In go casting an integer to string does not result in the string representation of the number, instead the result is a rune representing the codepoint of that number. Signed-off-by: Tobias Kohlbau <[email protected]>
CA: fix integer to string conversion
Signed-off-by: Marques Johansson <[email protected]>
This is the scale-down equivalent of kubernetes#3429 and it speeds-up findUnneeded by 5x+ in very large clusters (by reducing the number of expensive PreFilter calls #nodes times). A side effect of this change is removing "Simulating scheduling of <pod> to <node> return error <error>" logs. Using FitsAny we no longer have per-node scheduler errors that we could log. I think that's actually a good thing - even with klogx this log was incredibly spammy in cluster with >100 nodes and its practical value was questionable.
Use FitsAny in drain simulation
k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool down periods. On subsequent calls, they will notice the ongoing throttling window and will return a synthetic errors (without HTTPStatusCode) rather than submitting a throttled request to the ARM API: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158 https://github.com/kubernetes/autoscaler/blob/a5ed2cc3fe0aabd92c7758e39f1a9c9fe3bd6505/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go#L118-L123 Some CA components can cope with a temporarily outdated object view when throttled. They call in to `isAzureRequestsThrottled()` on clients errors to return stale objects from cache (if any) and extend the object's refresh period (if any). But this only works for the first API call (returning HTTP 429). Next calls in the same throttling window (per Retry-After header) won't be identified as throttled by `isAzureRequestsThrottled` due to their nul `HTTPStatusCode`. This can makes the CA panic during startup due a failing cache init, when more than one VMSS call hits throttling. We've seen this causing early restarts loops, re-scanning every VMSS due to cold cache on start, keeping the subscription throttled. Practically this change allows the 3 call sites (`scaleSet.Nodes()`, `scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to serve from cache (and extend the object's next refresh deadline) as they would do on the first HTTP 429 hit, rather than returning an error.
Make output of recommender tests easier to read
[cluster-autoscaler][clusterapi] Add support for node autodiscovery to clusterapi provider
remove duplicated values
add Packet cloudprovider owners
…dme. It is currently missing the link and prevent from navigating to Huawei cloud provider readme.md
Add HuaweiCloud info link to FAQ/Documentation section in CA main readme
…throttling Azure: serve stale on ongoing throttling
…er/aws/fix-link-in-readme Fix markdown style link in README
This change should substantially decrease the number of GCE Read Requests when Node Deletion takes place. As Read Requests take place for each node, for large node cluster this is impact-ful.
…uration Decrease the number of GCE Read Requests when node deletion.
…-if-updating dont update capacity if VMSS provisioning state is updating
…rhel to match build configuration in ocp-build-data This change adds a new Dockerfile.rhel file to control building release images. It updates the baseimages in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository](https://github.com/openshift/ocp-build-data/tree/openshift-4.6-rhel-8/images) used for producing release artifacts. After this change merges, the release files in https://github.com/openshift/release/blob/master/ci-operator/config/openshift/kubernetes-autoscaler/openshift-kubernetes-autoscaler-master.yaml should be updated with the new dockerfile path.
…d vpa This change removes the now deprecated Dockerfile.rhel7 files from the cluster-autoscaler and the vertical-pod-autoscaler. These files are now replaced by the Dockerfile.rhel file.
…erfile.rhel baseimages to match ocp-build-data config this is a copy of the autogenerated commit, original message: This PR is autogenerated by the [ocp-build-data-enforcer][1]. It updates the base images in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository][2] used for producing release artifacts. Instead of merging this PR you can also create an alternate PR that includes the changes found here. If you believe the content of this PR is incorrect, please contact the dptp team in #aos-art. [1]: https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer [2]: https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images
…file.rhel baseimages to match ocp-build-data config this is a copy of the autogenerated commit, original message: This PR is autogenerated by the [ocp-build-data-enforcer][1]. It updates the base images in the Dockerfile used for promotion in order to ensure it matches the configuration in the [ocp-build-data repository][2] used for producing release artifacts. Instead of merging this PR you can also create an alternate PR that includes the changes found here. If you believe the content of this PR is incorrect, please contact the dptp team in #aos-art. [1]: https://github.com/openshift/ci-tools/tree/master/cmd/ocp-build-data-enforcer [2]: https://github.com/openshift/ocp-build-data/tree/openshift-4.6/images
…s are also included
…ode if one exists
…er & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml
Updating vendor against [email protected]:kubernetes/kubernetes.git:3eb90c19d0cf90b756c3e08e32c6495b91e0aeed (3eb90c1)
…ources and node group discovery This change adds several things which were removed or refactored during the conversion to use unstructured types and automatic node group discovery. These changes are mostly focused on the unit tests with a few extra bits in the business logic. Fixes scalableResourceProviderIDs to use proper unstructured resource object. Adds unit test fixes to account for dynamic client updates brought in during vendor updates, see kubernetes@4550bfe for the source of conflict. Adds fix to newNodeGroupFromScalableResource to ensure that scaling from zero is respected. Fixes the unit tests to use the unstructured package helper functions for inspecting the objects. Adds unit tests for unstructured annotations to ensure that the cpu, memory, gpu, and max pods information is properly parsed. Adds unit tests to ensure that the logic for scaling from zero is properly observed.
245ece6
to
be51a45
Compare
/retest |
@elmiko: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check that the new workflows that have been added won't affect our PRs
Changes in the autoscaler to be aware of:
- Daemonset Pods are now evicted on drain, but no error is returned if this fails
@@ -1,19 +0,0 @@ | |||
# Copyright 2016 The Kubernetes Authors. All rights reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does our release image get built? Is this going to affect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have our own dockerfiles in images/cluster-autoscaler, this doesn't affect our build process, no issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should exclusively be using the ./images/cluster-autoscaler/Dockerfile.rhel
file to build our stuff.
/bugzilla refresh |
@JoelSpeed: This pull request references Bugzilla bug 1913960, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@elmiko: All pull requests linked via external trackers have merged: Bugzilla bug 1913960 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1.20 autoscaler rebase process
inspired by the commit description for the 1.19 rebase.
pr #164
identify carry commits:
where
78d401
reflects the changes since our last rebase (1.19.0). this is thelist of commits we will need to apply onto the new upstream version of the
autoscaler. ideally, some of these commits can be dropped.
After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:
Process
With the
merge-1.20
branch in place, I cherry picked the carry commits which applied, resolved merge conflicts,and finally tested the resulting tree against the unit test and end-to-end suite.
Carried Commits
These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.
Dropped Commits
These commits were carried over the 1.19.x release history and represent work that has been accepted upstream.