Skip to content

Conversation

elmiko
Copy link

@elmiko elmiko commented Feb 27, 2025

This commit rebases the autoscaler on top of the Kubernetes/Autoscaler 1.32.0 release. There are several commits that we carry on top of the upstream autoscaler and the rebase process allows us to preserve those. Here is a description of the process I used to create this PR.

(inspired by the commit description for the 1.18 rebase. pr #139)

Process

First we need to identify the carry commits that we currently have, this is done against our previous rebase to catch new changes. Once identified we will drop commits which have merged upstream and only carry unique commits. (see below for the carried and dropped commits).

Identify carry commits (run from the openshift/master branch), these are the commits that begin with UPSTREAM: up until the merge commit for the previous rebase commit (merge upstream/cluster-autoscaler-1.31.0)

git log -n 30 --oneline --no-merges

Important Note for next rebase: due to an error in the upstream cluster-autoscaler-1.32.0 tag, this rebase uses the cluster-autoscaler-1.32 branch as the starting point.

After identifying the carry commits, the next step is to create the new commit-tree that will be used for the rebase and then cherry pick the carry commits into the new branch. The following commands cover these steps:

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-release-1.32
$ git checkout -b merge-tmp # create a temporary branch for our merge commit
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-release-1.32' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F - # create a new merge commit for our history
deadbeef12345678 # id of new merge commit
$ git branch merge-1.32 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.32
$ git cherry-pick <carry commits> # cherry pick the needed commits into the new branch

With the merge-1.32 branch in place, I cherry picked the carry commits which applied, resolved merge conflicts, and finally tested the resulting tree against the unit test and end-to-end suite.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are specific to the releases we create for OpenShift.

f42d15b60 UPSTREAM: <carry>: Update unit tests using upstream annotations new values
d90f8b39d UPSTREAM: <carry>: Add unit tests for upstream annotations
d1fdf4c97 UPSTREAM: <carry>: Update to prefer upstream annotations if present
1409aa59f UPSTREAM: <carry>: Fix unstructured taint parsing in Cluster API provider
43224a99b UPSTREAM: <carry>: revert capacity annotations
646212363 UPSTREAM: <carry>: add machine api label and taint functionality
15d127772 UPSTREAM: <carry>: Have VPA ignore phantom containers named "POD"
4dd4b0916 UPSTREAM: <carry>: Handle old Machine API specific machine delete annotation
c80d1386c UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
512dc15a9 UPSTREAM: <carry>: configure repository for OpenShift releases

Squashed Commits

These commits were squashed into the carried commits to help reduce the length of our history. All these commits have been squashed into their topically related commits.

958747fa3 UPSTREAM: <carry>: Update VPA dockerfile to 4.19
1ed6c1547 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler-container image to be consistent with ART for 4.19 Reconciling with https://github.com/openshift/ocp-build-data/tree/a39508c86497b4e5e463d7b2c78e51e577be9e7d/images/atomic-openshift-cluster-autoscaler.yml
018c6237c UPSTREAM: <carry>: Updating ose-vertical-pod-autoscaler-container image to be consistent with ART for 4.19 Reconciling with https://github.com/openshift/ocp-build-data/tree/a39508c86497b4e5e463d7b2c78e51e577be9e7d/images/vertical-pod-autoscaler.yml
3a27e72c0 UPSTREAM: <carry>: VPA OWNERS: Remove John Kyros, et al., add Max Cao
33194fb53 UPSTREAM: <carry>: Update VPA dockerfiles

Dropped Commits

These commits were dropped.

91e89f91e UPSTREAM: <carry>: :bug:(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions
a1f51718a UPSTREAM: 7388: cleanup clusterapi imports

Of special note in this rebase is this commit

43224a99b UPSTREAM: <carry>: revert capacity annotations

due to the scale from zero changes being accepted upstream we can now drop our carried patch. but, the upstream implementation has differed slightly from our's (mainly around annotation names). we will need to carry this patch until we can fix all the providers to properly use the new annotations. This patch can be dropped once the epic contained in https://issues.redhat.com/browse/OCPCLOUD-2136 is completed.

PBundyra and others added 30 commits November 14, 2024 13:50
Change log level of boot dist type and size defaulting in gce_price
…ud-endpoint-reloving-logging

7437 Add logging for endpoint resolving errors
exclude self-managed nodes from being processed
…tion in Cluster Autoscaler

Signed-off-by: Maxim Rubchinsky <[email protected]>
…5.75.2-2

Upgrade OCI providers SDK to v65.75.2.
Add flag to force remove long unregistered nodes
…dNodeInfo

We need AddNodeInfo in order to propagate DRA objects through the
snapshot, which makes AddNodeWithPods redundant.
AddNodes() is redundant - it was indended for batch adding nodes,
with batch-specific optimizations in mind probably. However, it
has always been implemented as just iterating over AddNode(), and
is only used in test code.

Most of the uses in the test code were initializing the cluster state.
They are replaced with SetClusterState(), which will later be needed for
handling DRA anyway (we'll have to start tracking things that aren't
node- or pod-scoped). The other uses are replaced with inline loops over
AddNode().
The method is already accessible via StorageInfos(), it's
redundant.
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.

Uses in production are replaced with SetClusterState(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
RemoveNode is renamed to RemoveNodeInfo for consistency with other
NodeInfo methods.

For DRA, the snapshot will have to potentially allocate ResourceClaims
when adding a Pod to a Node, and deallocate them when removing a Pod
from a Node. This will happen in new methods added to ClusterSnapshot
in later commits - SchedulePod and UnschedulePod. These new methods
should be the "default" way of moving pods around the snapshot going
forward.

However, we'll still need to be able to add and remove pods from the
snapshot "forcefully" to handle some corner cases (e.g. expendable pods).
AddPod is renamed to ForceAddPod, and RemovePod to ForceRemovePod to
highlight that these are no longer the "default" methods of moving pods
around the snapshot, and are bypassing something important.
It's now redundant - SetClusterState with empty arguments does the same
thing.
…groups

Add support for node pool placement group config
…eanup

CA: refactor ClusterSnapshot methods
…d-rrsa-new-env-vars

7435 Support New Alibaba Cloud ENV Variables names for RRSA Authorization
@elmiko
Copy link
Author

elmiko commented Apr 15, 2025

this needs the carry patches we added for ocpbugs-11115

Add improved error handling for machines phase in the ClusterAPI node group
implementation. When a machine is in Deleting/Failed/Pending phase, mark the cloudprovider.Instance
with a status for cluster-autoscaler recovery actions.

The changes:
- Enhance Nodes listing to allow reporting the machine phase in Instance status
- Add error status reporting for failed machines

This change helps identify and manage failed machines more effectively,
allowing the autoscaler to make better scaling decisions.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2025
@elmiko
Copy link
Author

elmiko commented Apr 15, 2025

added the last carry patch from the upstream, we should be ok to merge this now.

@elmiko
Copy link
Author

elmiko commented Apr 15, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025
@JoelSpeed
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

2 similar comments
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

Copy link

openshift-ci bot commented Apr 16, 2025

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/git-history d0100ff link false /test git-history

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

@elmiko
Copy link
Author

elmiko commented Apr 16, 2025

i think we are again seeing slow node creation on azure.

/test e2e-azure-periodic-pre

@damdo
Copy link
Member

damdo commented Apr 16, 2025

It looks like the git-history checker is picking up the azure: increase UT coverage in azure_vms_pool commit erroneously as a non-conformant non-upstream message.

It looks like this repo is using a custom script https://github.com/openshift/kubernetes-autoscaler/blob/master/hack/verify_history.sh

and this might be a bug in it. I would suggest using commitchecker instead of it: https://github.com/openshift/release/blob/209cbb25d63506072a3089c85b2efdc84624c6e1/ci-operator/config/openshift/cluster-api/openshift-cluster-api-release-4.20.yaml#L173-L178

@JoelSpeed
Copy link

/test e2e-azure-periodic-pre

@elmiko
Copy link
Author

elmiko commented Apr 16, 2025

i've created openshift/release#63941 to address the history stuff.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1bbcd4 and 2 for PR HEAD d0100ff in total

@elmiko
Copy link
Author

elmiko commented Apr 16, 2025

/test e2e-hypershift

1 similar comment
@elmiko
Copy link
Author

elmiko commented Apr 16, 2025

/test e2e-hypershift

@damdo
Copy link
Member

damdo commented Apr 17, 2025

/tide refresh

@sdodson
Copy link
Member

sdodson commented Apr 17, 2025

/payload 4.19 nightly blocking

Copy link

openshift-ci bot commented Apr 17, 2025

@sdodson: trigger 11 job(s) of type blocking for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-fips-payload-scan
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/917fc7c0-1b8b-11f0-96bd-dba6a0b812d0-0

@sdodson
Copy link
Member

sdodson commented Apr 17, 2025

/test e2e-hypershift

@openshift-merge-bot openshift-merge-bot bot merged commit 8eef719 into openshift:master Apr 17, 2025
14 of 16 checks passed
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: atomic-openshift-cluster-autoscaler
This PR has been included in build atomic-openshift-cluster-autoscaler-container-v4.19.0-202504180114.p0.g8eef719.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: vertical-pod-autoscaler
This PR has been included in build ose-vertical-pod-autoscaler-container-v4.19.0-202504180114.p0.g8eef719.assembly.stream.el9.
All builds following this will include this PR.

@elmiko
Copy link
Author

elmiko commented Apr 18, 2025

/test

@elmiko elmiko deleted the merge-1.32 branch April 18, 2025 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.