Machines not scaling due to ID character limit #182

VanillaSpoon · 2023-11-20T13:46:40Z

Issue link

closes #143

What changes have been made

This Pr is for 320e5f0 and will be rebased after Hypershift Integration has been merged.

This pr contains an update to the nodepools, and machinepool ID generation convention for instascale. Replacing the appended instance-type with a string of 4 random characters.

The prefix is the appwrapper name (Truncated if over the character limit when appended to the suffix, Nodepools 15, MachinePools 30).

This change also updates the scaledown functions to scaledown machines with a label pointing to the related appwrapper (composed of the appwrapper name, and namespace), as apposed to the machine ID.

These changes also rely on the codeflare-operator e2e tests being updated.

Verification steps

Dispatch an appwrapper (with a long name :) ) on hypershift and OSD with instascale enabled and a specified instance type.

Ensure the machine scales up as expected, and the machine name is truncated to a suitable character length.
Ensure the machine scales down once complete.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

Fiona-Waters

PR looks great, love the name truncation for node and machine pools. I think it might be related to your other PR but I think we could add some error logging in the finalizeScalingDownMachines function so that we can find where the errors are coming from more easily.
I tested this on a rosa hosted/hypershift cluster and all worked as expected, the node pool was scaled up and name updated appropriately.
I also tested it on an OSD cluster but came across some issues with the machine pools not scaling. I didn't find the culprit and think we should test on another osd cluster to see if errors are seen there also.

2023-11-22T15:07:33Z ERROR Reconciler error {"controller": "appwrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "AppWrapper": {"name":"raycluster-complete","namespace":"default"}, "namespace": "default", "name": "raycluster-complete", "reconcileID": "a9baadbf-6a38-4d62-966b-05c2eaf75f3c", "error": "status is 404, identifier is '404', code is 'CLUSTERS-MGMT-404' and operation identifier is 'b22bd107-f842-4009-bc58-815a8628475e': Cluster 'machine_pools' not found"}

VanillaSpoon · 2023-11-24T15:59:52Z

Thanks for your review and feedback @Fiona-Waters :)

I have pushed updates to the logging within the hypershift pr

Fiona-Waters · 2023-11-27T09:19:14Z

Thanks for your review and feedback @Fiona-Waters :)

I have pushed updates to the logging within the hypershift pr

You're welcome. Has it been retested on OSD?

VanillaSpoon · 2023-11-28T15:46:57Z

Hi Fiona,
This has be retested on OSD and Hypershift, the errors are handled in a more graceful manner now. Thanks for the feedback. The only commit for this pr is add: naming convention adjustments to nodepools, and machinepools

However, I have rebased again with the others for testing purposes :)

Bobbins228

Tested it out all good I will drop a lgtm when nodepools is merged good stuff Eoin

…n of machine name character limit

Fiona-Waters · 2024-01-25T09:56:21Z

Just successfully ran this while testing the nodepool e2e.
/lgtm

Bobbins228

/lgtm

astefanutti · 2024-01-25T15:18:37Z

/lgtm

astefanutti · 2024-01-25T15:18:43Z

/approve

openshift-ci · 2024-01-25T15:18:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [astefanutti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Nov 20, 2023

This was referenced Nov 20, 2023

add: function to return machine pool labels project-codeflare/codeflare-common#15

Merged

test: update machinepool tests to identify machines based on label project-codeflare/codeflare-operator#413

Merged

Fiona-Waters reviewed Nov 22, 2023

View reviewed changes

VanillaSpoon force-pushed the machineIDConvention branch from 5b00964 to be477bc Compare November 24, 2023 15:57

VanillaSpoon force-pushed the machineIDConvention branch from be477bc to 3780c79 Compare November 28, 2023 15:43

openshift-merge-robot added the needs-rebase label Dec 8, 2023

VanillaSpoon force-pushed the machineIDConvention branch from 3780c79 to 320e5f0 Compare January 16, 2024 12:38

openshift-merge-robot removed the needs-rebase label Jan 16, 2024

Bobbins228 reviewed Jan 19, 2024

View reviewed changes

add: machine scaledown through the use of machine labels, and additio…

4bf437d

…n of machine name character limit

VanillaSpoon force-pushed the machineIDConvention branch from 320e5f0 to 4bf437d Compare January 23, 2024 10:17

VanillaSpoon marked this pull request as ready for review January 23, 2024 10:18

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 23, 2024

openshift-ci bot requested review from anishasthana and Maxusmusti January 23, 2024 10:18

openshift-ci bot assigned Fiona-Waters Jan 25, 2024

openshift-ci bot added the lgtm label Jan 25, 2024

Bobbins228 reviewed Jan 25, 2024

View reviewed changes

openshift-ci bot assigned Bobbins228 Jan 25, 2024

openshift-ci bot assigned astefanutti Jan 25, 2024

openshift-ci bot added the approved label Jan 25, 2024

openshift-merge-bot bot merged commit 067ae9b into project-codeflare:main Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Machines not scaling due to ID character limit #182

Machines not scaling due to ID character limit #182

Uh oh!

VanillaSpoon commented Nov 20, 2023 •

edited

Loading

Uh oh!

Fiona-Waters left a comment

Uh oh!

VanillaSpoon commented Nov 24, 2023

Uh oh!

Fiona-Waters commented Nov 27, 2023

Uh oh!

VanillaSpoon commented Nov 28, 2023

Uh oh!

Bobbins228 left a comment

Uh oh!

Fiona-Waters commented Jan 25, 2024

Uh oh!

Bobbins228 left a comment

Uh oh!

astefanutti commented Jan 25, 2024

Uh oh!

astefanutti commented Jan 25, 2024

Uh oh!

openshift-ci bot commented Jan 25, 2024

Uh oh!

Uh oh!

Machines not scaling due to ID character limit #182

Machines not scaling due to ID character limit #182

Uh oh!

Conversation

VanillaSpoon commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

Fiona-Waters left a comment

Choose a reason for hiding this comment

Uh oh!

VanillaSpoon commented Nov 24, 2023

Uh oh!

Fiona-Waters commented Nov 27, 2023

Uh oh!

VanillaSpoon commented Nov 28, 2023

Uh oh!

Bobbins228 left a comment

Choose a reason for hiding this comment

Uh oh!

Fiona-Waters commented Jan 25, 2024

Uh oh!

Bobbins228 left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Jan 25, 2024

Uh oh!

astefanutti commented Jan 25, 2024

Uh oh!

openshift-ci bot commented Jan 25, 2024

Uh oh!

Uh oh!

VanillaSpoon commented Nov 20, 2023 •

edited

Loading