OCPCLOUD-2835: rebase on upstream 1.32.0 release #340

elmiko · 2025-02-27T19:53:45Z

This commit rebases the autoscaler on top of the Kubernetes/Autoscaler 1.32.0 release. There are several commits that we carry on top of the upstream autoscaler and the rebase process allows us to preserve those. Here is a description of the process I used to create this PR.

(inspired by the commit description for the 1.18 rebase. pr #139)

Process

First we need to identify the carry commits that we currently have, this is done against our previous rebase to catch new changes. Once identified we will drop commits which have merged upstream and only carry unique commits. (see below for the carried and dropped commits).

Identify carry commits (run from the openshift/master branch), these are the commits that begin with UPSTREAM: up until the merge commit for the previous rebase commit (merge upstream/cluster-autoscaler-1.31.0)

git log -n 30 --oneline --no-merges

Important Note for next rebase: due to an error in the upstream cluster-autoscaler-1.32.0 tag, this rebase uses the cluster-autoscaler-1.32 branch as the starting point.

After identifying the carry commits, the next step is to create the new commit-tree that will be used for the rebase and then cherry pick the carry commits into the new branch. The following commands cover these steps:

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-release-1.32
$ git checkout -b merge-tmp # create a temporary branch for our merge commit
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-release-1.32' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F - # create a new merge commit for our history
deadbeef12345678 # id of new merge commit
$ git branch merge-1.32 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.32
$ git cherry-pick <carry commits> # cherry pick the needed commits into the new branch

With the merge-1.32 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts, and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are specific to the releases we create for OpenShift.

f42d15b60 UPSTREAM: <carry>: Update unit tests using upstream annotations new values
d90f8b39d UPSTREAM: <carry>: Add unit tests for upstream annotations
d1fdf4c97 UPSTREAM: <carry>: Update to prefer upstream annotations if present
1409aa59f UPSTREAM: <carry>: Fix unstructured taint parsing in Cluster API provider
43224a99b UPSTREAM: <carry>: revert capacity annotations
646212363 UPSTREAM: <carry>: add machine api label and taint functionality
15d127772 UPSTREAM: <carry>: Have VPA ignore phantom containers named "POD"
4dd4b0916 UPSTREAM: <carry>: Handle old Machine API specific machine delete annotation
c80d1386c UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
512dc15a9 UPSTREAM: <carry>: configure repository for OpenShift releases

Squashed Commits

These commits were squashed into the carried commits to help reduce the length of our history. All these commits have been squashed into their topically related commits.

958747fa3 UPSTREAM: <carry>: Update VPA dockerfile to 4.19
1ed6c1547 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.19 Reconciling with https://github.com/openshift/ocp-build-data/tree/a39508c86497b4e5e463d7b2c78e51e577be9e7d/images/atomic-openshift-cluster-autoscaler.yml
018c6237c UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART for 4.19 Reconciling with https://github.com/openshift/ocp-build-data/tree/a39508c86497b4e5e463d7b2c78e51e577be9e7d/images/vertical-pod-autoscaler.yml
3a27e72c0 UPSTREAM: <carry>: VPA OWNERS: Remove John Kyros, et al., add Max Cao
33194fb53 UPSTREAM: <carry>: Update VPA dockerfiles

Dropped Commits

These commits were dropped.

91e89f91e UPSTREAM: <carry>: :bug:(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions
a1f51718a UPSTREAM: 7388: cleanup clusterapi imports

Of special note in this rebase is this commit

43224a99b UPSTREAM: <carry>: revert capacity annotations

due to the scale from zero changes being accepted upstream we can now drop our carried patch. but, the upstream implementation has differed slightly from our's (mainly around annotation names). we will need to carry this patch until we can fix all the providers to properly use the new annotations. This patch can be dropped once the epic contained in https://issues.redhat.com/browse/OCPCLOUD-2136 is completed.

Change log level of boot dist type and size defaulting in gce_price

…ud-endpoint-reloving-logging 7437 Add logging for endpoint resolving errors

Fix typo in error message

exclude self-managed nodes from being processed

…tion in Cluster Autoscaler Signed-off-by: Maxim Rubchinsky <[email protected]>

Signed-off-by: Omer Aplatony <[email protected]>

…5.75.2-2 Upgrade OCI providers SDK to v65.75.2.

Add flag to force remove long unregistered nodes

…dNodeInfo We need AddNodeInfo in order to propagate DRA objects through the snapshot, which makes AddNodeWithPods redundant.

AddNodes() is redundant - it was indended for batch adding nodes, with batch-specific optimizations in mind probably. However, it has always been implemented as just iterating over AddNode(), and is only used in test code. Most of the uses in the test code were initializing the cluster state. They are replaced with SetClusterState(), which will later be needed for handling DRA anyway (we'll have to start tracking things that aren't node- or pod-scoped). The other uses are replaced with inline loops over AddNode().

The method is already accessible via StorageInfos(), it's redundant.

AddNodeInfo already provides the same functionality, and has to be used in production code in order to propagate DRA objects correctly. Uses in production are replaced with SetClusterState(), which will later take DRA objects into account. Uses in the test code are replaced with AddNodeInfo().

RemoveNode is renamed to RemoveNodeInfo for consistency with other NodeInfo methods. For DRA, the snapshot will have to potentially allocate ResourceClaims when adding a Pod to a Node, and deallocate them when removing a Pod from a Node. This will happen in new methods added to ClusterSnapshot in later commits - SchedulePod and UnschedulePod. These new methods should be the "default" way of moving pods around the snapshot going forward. However, we'll still need to be able to add and remove pods from the snapshot "forcefully" to handle some corner cases (e.g. expendable pods). AddPod is renamed to ForceAddPod, and RemovePod to ForceRemovePod to highlight that these are no longer the "default" methods of moving pods around the snapshot, and are bypassing something important.

It's now redundant - SetClusterState with empty arguments does the same thing.

…groups Add support for node pool placement group config

…eanup CA: refactor ClusterSnapshot methods

…d-rrsa-new-env-vars 7435 Support New Alibaba Cloud ENV Variables names for RRSA Authorization

VPA - Update docs a little

elmiko · 2025-04-15T11:11:15Z

this needs the carry patches we added for ocpbugs-11115

Add improved error handling for machines phase in the ClusterAPI node group implementation. When a machine is in Deleting/Failed/Pending phase, mark the cloudprovider.Instance with a status for cluster-autoscaler recovery actions. The changes: - Enhance Nodes listing to allow reporting the machine phase in Instance status - Add error status reporting for failed machines This change helps identify and manage failed machines more effectively, allowing the autoscaler to make better scaling decisions.

elmiko · 2025-04-15T12:09:51Z

added the last carry patch from the upstream, we should be ok to merge this now.

elmiko · 2025-04-15T12:10:05Z

/hold cancel

JoelSpeed · 2025-04-15T15:09:47Z

/lgtm

openshift-ci-robot · 2025-04-15T15:49:13Z

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

openshift-ci-robot · 2025-04-15T19:10:29Z

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

openshift-ci-robot · 2025-04-16T01:50:57Z

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

openshift-ci · 2025-04-16T04:01:09Z

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/git-history	`d0100ff`	link	false	`/test git-history`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-04-16T08:50:31Z

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

elmiko · 2025-04-16T11:05:39Z

i think we are again seeing slow node creation on azure.

/test e2e-azure-periodic-pre

damdo · 2025-04-16T11:19:19Z

It looks like the git-history checker is picking up the azure: increase UT coverage in azure_vms_pool commit erroneously as a non-conformant non-upstream message.

It looks like this repo is using a custom script https://github.com/openshift/kubernetes-autoscaler/blob/master/hack/verify_history.sh

and this might be a bug in it. I would suggest using commitchecker instead of it: https://github.com/openshift/release/blob/209cbb25d63506072a3089c85b2efdc84624c6e1/ci-operator/config/openshift/cluster-api/openshift-cluster-api-release-4.20.yaml#L173-L178

JoelSpeed · 2025-04-16T11:21:25Z

/test e2e-azure-periodic-pre

elmiko · 2025-04-16T12:25:52Z

i've created openshift/release#63941 to address the history stuff.

openshift-ci-robot · 2025-04-16T13:57:08Z

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

elmiko · 2025-04-16T14:15:17Z

/test e2e-hypershift

elmiko · 2025-04-16T17:17:06Z

/test e2e-hypershift

damdo · 2025-04-17T11:20:17Z

/tide refresh

sdodson · 2025-04-17T12:57:49Z

/payload 4.19 nightly blocking

openshift-ci · 2025-04-17T12:57:53Z

@sdodson: trigger 11 job(s) of type blocking for the nightly release of OCP 4.19

periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
periodic-ci-openshift-release-master-nightly-4.19-fips-payload-scan
periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/917fc7c0-1b8b-11f0-96bd-dba6a0b812d0-0

sdodson · 2025-04-17T19:05:40Z

/test e2e-hypershift

openshift-bot · 2025-04-18T02:11:14Z

[ART PR BUILD NOTIFIER]

Distgit: atomic-openshift-cluster-autoscaler
This PR has been included in build atomic-openshift-cluster-autoscaler-container-v4.19.0-202504180114.p0.g8eef719.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2025-04-18T02:11:18Z

[ART PR BUILD NOTIFIER]

Distgit: vertical-pod-autoscaler
This PR has been included in build ose-vertical-pod-autoscaler-container-v4.19.0-202504180114.p0.g8eef719.assembly.stream.el9.
All builds following this will include this PR.

elmiko · 2025-04-18T11:24:27Z

/test

PBundyra and others added 30 commits November 14, 2024 13:50

Introduce noRetry Parameter for checkcapacity ProvisioningRequest

6e956b8

change default lease name for vpa-recommender to "vpa-recommender-lease"

088da37

update format

a6ddc24

Merge pull request kubernetes#7459 from damikag/update-bootdisk-logs

267a0d8

Change log level of boot dist type and size defaulting in gce_price

Merge pull request kubernetes#7438 from maximrub/bug-7437-alibaba-clo…

009f2b8

…ud-endpoint-reloving-logging 7437 Add logging for endpoint resolving errors

Fix typo in error message

e749f20

Merge pull request kubernetes#7503 from adrianmoisey/typo

97f4089

Fix typo in error message

Merge pull request kubernetes#7453 from gvnc/oci-self-managed-nodes-fix

b01bff1

exclude self-managed nodes from being processed

7435 Support New Alibaba Cloud ENV Variables names for RRSA Authoriza…

dcd6d6a

…tion in Cluster Autoscaler Signed-off-by: Maxim Rubchinsky <[email protected]>

VPA: Poll/PollImmediate PollUntilContextTimeout in v1 e2e tests

f796c7d

Signed-off-by: Omer Aplatony <[email protected]>

Add unit test

3b371f7

Add ForceDeleteNodes method to NodeGroup interface

3b47908

Add ForceDeleteNodes implementation for GCE cloud provider

c5f13bb

Add flag to force remove long unregistered nodes

a0bf108

Split removeOldUnregisteredNodes method

1580315

Merge pull request kubernetes#7385 from jlamillan/jlamillan/oci_sdk_6…

2d37aee

…5.75.2-2 Upgrade OCI providers SDK to v65.75.2.

Merge pull request kubernetes#7493 from BigDarkClown/remove-unneeded

a01276e

Add flag to force remove long unregistered nodes

CA: remove AddNodeWithPods from ClusterSnapshot, replace uses with Ad…

269c7a3

…dNodeInfo We need AddNodeInfo in order to propagate DRA objects through the snapshot, which makes AddNodeWithPods redundant.

CA: remove redundant IsPVCUsedByPods from ClusterSnapshot

3860388

The method is already accessible via StorageInfos(), it's redundant.

CA: remove Clear from ClusterSnapshot

473a1a8

It's now redundant - SetClusterState with empty arguments does the same thing.

Merge pull request kubernetes#6999 from dominic-p/iss-5919-placement-…

4c37ff3

…groups Add support for node pool placement group config

Add logging, add comment, move test util function

ddcb59d

Address nits

14067b7

Merge pull request kubernetes#7466 from towca/jtuznik/dra-snapshot-cl…

30e57c9

…eanup CA: refactor ClusterSnapshot methods

Merge pull request kubernetes#7436 from maximrub/fr-7435-alibaba-clou…

5458e1c

…d-rrsa-new-env-vars 7435 Support New Alibaba Cloud ENV Variables names for RRSA Authorization

refactor(hetzner): refactored placement group code

64495d9

Merge pull request kubernetes#7494 from adrianmoisey/readme_updates

855b940

VPA - Update docs a little

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2025

elmiko mentioned this pull request Apr 17, 2025

OCPBUGS-54231: Fix cool down status condition to trigger scale down #355

Merged

openshift-merge-bot bot merged commit 8eef719 into openshift:master Apr 17, 2025
14 of 16 checks passed

elmiko deleted the merge-1.32 branch April 18, 2025 11:24

OCPCLOUD-2835: rebase on upstream 1.32.0 release #340

OCPCLOUD-2835: rebase on upstream 1.32.0 release #340

Uh oh!

Conversation

elmiko commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Apr 15, 2025

Uh oh!

elmiko commented Apr 15, 2025

Uh oh!

elmiko commented Apr 15, 2025

Uh oh!

JoelSpeed commented Apr 15, 2025

Uh oh!

openshift-ci-robot commented Apr 15, 2025

Uh oh!

openshift-ci-robot commented Apr 15, 2025

Uh oh!

openshift-ci-robot commented Apr 16, 2025

Uh oh!

openshift-ci bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 16, 2025

Uh oh!

elmiko commented Apr 16, 2025

Uh oh!

damdo commented Apr 16, 2025

Uh oh!

JoelSpeed commented Apr 16, 2025

Uh oh!

elmiko commented Apr 16, 2025

Uh oh!

openshift-ci-robot commented Apr 16, 2025

Uh oh!

elmiko commented Apr 16, 2025

Uh oh!

elmiko commented Apr 16, 2025

Uh oh!

damdo commented Apr 17, 2025

Uh oh!

sdodson commented Apr 17, 2025

Uh oh!

openshift-ci bot commented Apr 17, 2025

Uh oh!

sdodson commented Apr 17, 2025

Uh oh!

Uh oh!

openshift-bot commented Apr 18, 2025

Uh oh!

openshift-bot commented Apr 18, 2025

Uh oh!

elmiko commented Apr 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

37 participants

elmiko commented Feb 27, 2025 •

edited

Loading

openshift-ci bot commented Apr 16, 2025 •

edited

Loading