Skip to content

[Spot/Serve] Fix controller resources fetching #3468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 23, 2024
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Apr 23, 2024

The previous code fetch the controller record with the wrong name causing the controller fail to use the same resources when starting a service/spot job when the controller already exists.

To reproduce:

  1. Have a serve controller on AWS
  2. sky serve up --cloud gcp ...
    The service launching command will fail with the following error:
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    {1x GCP(cpus=4+, disk_size=200, ports=['30001-30100'])}
  Existing:     1x AWS(m6i.xlarge, disk_size=200, ports=['30001-30100'])

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • The reproducible scripts above
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@@ -337,7 +337,7 @@ def get_controller_resources(
controller_resources)[0]

controller_exist = (global_user_state.get_cluster_from_name(
controller.value.name) is not None)
controller.value.cluster_name) is not None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When controller exists, why return controller_resources_to_use (which is copied from default resources) rather than that controller’s launched resources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the default controller resources can always be scheduled on existing controller, but changed to the existing resources for clarity.

@Michaelvll Michaelvll merged commit e7b812c into master Apr 23, 2024
20 checks passed
@Michaelvll Michaelvll deleted the fix-controller-name branch April 23, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants