Skip to content

Machines not scaling due to ID character limit #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

VanillaSpoon
Copy link
Contributor

@VanillaSpoon VanillaSpoon commented Nov 20, 2023

Issue link

closes #143

What changes have been made

This Pr is for 320e5f0 and will be rebased after Hypershift Integration has been merged.

This pr contains an update to the nodepools, and machinepool ID generation convention for instascale. Replacing the appended instance-type with a string of 4 random characters.

The prefix is the appwrapper name (Truncated if over the character limit when appended to the suffix, Nodepools 15, MachinePools 30).

This change also updates the scaledown functions to scaledown machines with a label pointing to the related appwrapper (composed of the appwrapper name, and namespace), as apposed to the machine ID.

These changes also rely on the codeflare-operator e2e tests being updated.

Verification steps

Dispatch an appwrapper (with a long name :) ) on hypershift and OSD with instascale enabled and a specified instance type.

Ensure the machine scales up as expected, and the machine name is truncated to a suitable character length.
Ensure the machine scales down once complete.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great, love the name truncation for node and machine pools. I think it might be related to your other PR but I think we could add some error logging in the finalizeScalingDownMachines function so that we can find where the errors are coming from more easily.
I tested this on a rosa hosted/hypershift cluster and all worked as expected, the node pool was scaled up and name updated appropriately.
I also tested it on an OSD cluster but came across some issues with the machine pools not scaling. I didn't find the culprit and think we should test on another osd cluster to see if errors are seen there also.

2023-11-22T15:07:33Z ERROR Reconciler error {"controller": "appwrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "AppWrapper": {"name":"raycluster-complete","namespace":"default"}, "namespace": "default", "name": "raycluster-complete", "reconcileID": "a9baadbf-6a38-4d62-966b-05c2eaf75f3c", "error": "status is 404, identifier is '404', code is 'CLUSTERS-MGMT-404' and operation identifier is 'b22bd107-f842-4009-bc58-815a8628475e': Cluster 'machine_pools' not found"}

@VanillaSpoon
Copy link
Contributor Author

Thanks for your review and feedback @Fiona-Waters :)

I have pushed updates to the logging within the hypershift pr

@Fiona-Waters
Copy link
Contributor

Thanks for your review and feedback @Fiona-Waters :)

I have pushed updates to the logging within the hypershift pr

You're welcome. Has it been retested on OSD?

@VanillaSpoon
Copy link
Contributor Author

Hi Fiona,
This has be retested on OSD and Hypershift, the errors are handled in a more graceful manner now. Thanks for the feedback. The only commit for this pr is add: naming convention adjustments to nodepools, and machinepools

However, I have rebased again with the others for testing purposes :)

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it out all good I will drop a lgtm when nodepools is merged good stuff Eoin

@VanillaSpoon VanillaSpoon marked this pull request as ready for review January 23, 2024 10:18
@Fiona-Waters
Copy link
Contributor

Just successfully ran this while testing the nodepool e2e.
/lgtm

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@astefanutti
Copy link
Contributor

/lgtm

@astefanutti
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Jan 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 067ae9b into project-codeflare:main Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePool and MachinePool name character limit
5 participants