Skip to content

Initial Hyperstack cloud implementation #5220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

vvidovic
Copy link
Contributor

@vvidovic vvidovic commented Apr 15, 2025

Initial implementation of the SkyPilot provisioning for the Hyperstack cloud.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@vvidovic
Copy link
Contributor Author

If I am not mistaken, tests are failing because of the missing catalog for the Hyperstack. This should be resolved after merging the PR for the skypilot-catalog.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @vvidovic ! This is exciting. Left some discussions.

Also, could you help run smoke test on hyperstack to make sure it works properly?

@@ -897,7 +898,7 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
elif isinstance(cloud, clouds.Fluidstack):
config = auth.setup_fluidstack_authentication(config)
else:
assert False, cloud
assert False, f'{cloud} - adding authentication to cluster config not implemented.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of adding the Hyperstack provider, there were many errors with non-descriptive messages. I was thinking that this kind of more descriptive error message could help future implementations to easier find where from the error was thrown.

Of course, it is quite easy to remove that if this argument doesn't make sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch!

I thought that since here we are asserting false, this code path should not be entered and thus adding extra information is not of big use. But if you actually encountered this code path, a valid choice would be raising some error instead - that would be more informative. Could you help changing this to a ValueError or sth? thanks!

Comment on lines 251 to 268
@classmethod
def _check_credentials(cls) -> Tuple[bool, Optional[str]]:
try:
assert os.path.exists(
os.path.expanduser(hyperstack_utils.HYPERSTACK_API_KEY_PATH))
except AssertionError:
return False, ('Failed to access Hyperstack Cloud'
' with credentials. '
'To configure credentials, go to:\n '
' https://console.hyperstack.cloud/api-keys \n '
'to obtain an API key, '
'then add save the contents '
'to ~/.hyperstack/api_key \n')
except requests.exceptions.ConnectionError:
return False, ('Failed to verify Hyperstack Cloud credentials. '
'Check your network connection '
'and try again.')
return True, None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will a connection error being raised? this seems only checking if there is a path in local system?

also, can we directly use the result of os.path.exists instead of assert it and catching it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I improperly copied some other cloud logic, let me fix that.

I believe the Fluidstack check is the one this was based on:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 293 to 313
@classmethod
def query_status(cls, name: str, tag_filters: Dict[str, str],
region: Optional[str], zone: Optional[str],
**kwargs) -> List[status_lib.ClusterStatus]:
status_map = {
'booting': status_lib.ClusterStatus.INIT,
'active': status_lib.ClusterStatus.UP,
'unhealthy': status_lib.ClusterStatus.INIT,
'terminating': None,
'terminated': None,
}
# TODO(ewzeng): filter by hash_filter_string to be safe
status_list = []
vms = hyperstack_utils.HyperstackClient().list_instances()
possible_names = [f'{name}-head', f'{name}-worker']
for node in vms:
if node.get('name') in possible_names:
node_status = status_map[node['status']]
if node_status is not None:
status_list.append(node_status)
return status_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skypilot/sky/clouds/aws.py

Lines 1036 to 1041 in cad0356

@classmethod
def query_status(cls, name: str, tag_filters: Dict[str, str],
region: Optional[str], zone: Optional[str],
**kwargs) -> List['status_lib.ClusterStatus']:
# TODO(suquark): deprecate this method
assert False, 'This code path should not be used.'

this code path should be deprecated? can we remove it to reduce confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this method was not used and statuses were not correct for the Hyperstack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

[--api-key-path API_KEY_PATH]

If neither --api-key nor --api-key-path are provided, this script will parse
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`).
`~/.hyperstack/credentials.yaml` to look for IAM API key (`iam_api_key`).

should this be IAM?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, should the path be ~/.hyperstack/api_key?

Suggested change
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`).
`~/.hyperstack/api_key` to look for IBM API key (`iam_api_key`).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a precedence between those cli args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an outdated comment copied from the IBM cloud fetcher.

How do you mean to add a precedence? The first one found is used:

  • api_key arg
  • api_key_path arg
  • default API key path

I think that should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry, i mean adding the precedence in the comments, which is basically this list's order.

encoding='utf-8') as f:
api_key_res = f.read().strip()
else:
# Read from ~/.lambda_cloud/lambda_keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be hyperstack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 53 to 67
def __init__(self):
self.api_key = read_contents(
os.path.expanduser(HYPERSTACK_API_KEY_PATH)).strip()

def get_flavors(self):
response = requests.get(f'{ENDPOINT}/core/flavors',
headers={'api_key': self.api_key})
raise_hyperstack_error(response)
response_json = response.json()
flavor_groups = response_json['data']
flavors = [
flavor for group in flavor_groups for flavor in group['flavors']
]

return flavors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotation for the return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added List[dict]. Done.

Comment on lines 106 to 126
security_rules = [
dict(
direction='ingress',
protocol='tcp',
ethertype='IPv4',
remote_ip_prefix='0.0.0.0/0',
port_range_min=22,
port_range_max=22,
)
]
if ports:
for p in ports:
security_rules.append(
dict(
direction='ingress',
protocol='tcp',
ethertype='IPv4',
remote_ip_prefix='0.0.0.0/0',
port_range_min=p,
port_range_max=p,
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a function to generate one single rule and reuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe know if there is a SkyPilot configuration approach that would enable us to set the remote_ip_prefix (the IP address range from which it is allowed to connect to the port)?
(sorry for the digression)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - currently we assume all ip can access the skypilot's vm. To confirm, this is essentially the outbound port, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #5288 to keep track of it. Feel free to discuss any related topics under that issue :))

Copy link
Contributor Author

@vvidovic vvidovic Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean the outbound port, but the outbound IP address.
It is just an additional check on the VM side that it is not available to the entire Internet but only to some specific IP or IP range of addresses.

For example, you could want to allow access from the:

  • AWS (because some other component of your system is hosted in AWS)
  • your company VPN (for example, you run some internal LLM model in the cloud, but want to allow only your employees to connect to it)

For example, on the Hypercloud, you can configure a Firewall inbound rule with:

  • ethernet type: IPv4 or IPv6
  • protocol: TCP, UDP or ICMP
  • port range: the port(s) that will be opened on the VM instance
  • remote IPs: IP subnet or IP address of the remote client that can connect to the "port range"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Currently we dont have this limitation built-in. Just reposted this in the issue and I think we can skip this limit in this PR :))


return instance_ids

def list_ssh_keys(self, environment_name: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotation for the return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added List[dict]. Done.

Comment on lines 71 to 73
add_active: bool = True,
add_pending: bool = True,
add_stopped: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit confusing. Can we have a status filter like this?

def _filter_instances(cluster_name_on_cloud: str,
status_filters: Optional[List[str]],
head_only: bool = False) -> Dict[str, Any]:
instances = utils.list_instances()
possible_names = [f'{cluster_name_on_cloud}-head']
if not head_only:
possible_names.append(f'{cluster_name_on_cloud}-worker')
filtered_instances = {}
for instance_id, instance in instances.items():
if (status_filters is not None and
instance['status'] not in status_filters):
continue
if instance.get('name') in possible_names:
filtered_instances[instance_id] = instance
return filtered_instances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial implementation was as you suggested, but it looked more confusing to me. The intent of the function was not clear: "Do we want to show active, pending and/or stopped VMs?"
However, you are correct that 3 booleans look a bit confusing too.
Let me spend some more time thinking about this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can add a "status groups" to filter?
(PENDING, ACTIVE, STOPPED)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the "status group to filter" is exactly the status_filters: Optional[List[str]] argument in the above function? Could you elaborate on the differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 9 recognised Hyperstack statuses (from the Hypercloud responses or web UI):

  • 1 of those is ACTIVE
  • 1 of those is STOPPED
  • all other statuses are PENDING (BUILD, CREATING, STARTING, REBOOTING, HARD_REBOOT & DELETING)

That means that each time we would like to filter by Hyperstack status group "PENDING" we would need to add all 6 of those to the list and when we want to combine for example "ACTIVE" and "PENDING" it would be harder to understand that we want "all pending statuses" + "active status".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Then adding a status group to filter sounds great to me!

# subprocess_utils.run_in_parallel(get_internal_ip,
# list(running_instances.values()))
head_instance_id = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# subprocess_utils.run_in_parallel(get_internal_ip,
# list(running_instances.values()))
head_instance_id = None
head_instance_id = None

accidental?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@vvidovic vvidovic force-pushed the master branch 2 times, most recently from 86f5244 to beb3523 Compare April 18, 2025 20:57
Initial implementation of the SkyPilot provisioning for the
Hyperstack cloud.
@vvidovic
Copy link
Contributor Author

@cblmemo - thanks for the review. There are 2 issues left:

  • more descriptive error message (remove or not)
  • simplify filtering instances method

I made a few other changes:

I implemented changes using a force push to my fork. This has a downside in that there is no history of changes I performed based on your review, but it makes the future merge simpler.
If you don't like this approach, I will continue using normal commits and maybe rebase them at the end (for a simpler merge).

@vvidovic
Copy link
Contributor Author

vvidovic commented Apr 19, 2025

Thanks for adding this @vvidovic ! This is exciting. Left some discussions.

Also, could you help run smoke test on hyperstack to make sure it works properly?

I have run the following tests:

# Check credentials:
python sky/cli.py check

python sky/cli.py launch -c test-single-instance --cloud hyperstack --gpus RTX-A6000 echo hi
python sky/cli.py stop test-single-instance
python sky/cli.py start test-single-instance

python sky/cli.py launch -c test-single-instance echo hi
python sky/cli.py down test-single-instance

# Terminate a stopped instance:
python sky/cli.py launch -c test-single-instance --cloud hyperstack --gpus RTX-A6000 echo hi
python sky/cli.py stop test-single-instance; sky down test-single-instance

python sky/cli.py launch --cloud hyperstack --gpus RTX-A6000 -c test-autostop -i 1 echo hi
python sky/cli.py launch --cloud hyperstack --gpus RTX-A6000 -c test-autodown -i 1 --down echo hi
# Make sure it works with other clouds:
python sky/cli.py launch -c test-all echo hi

I tried to run the following, but these are failing:

pytest tests/test_smoke.py --hyperstack

There are a lot of errors related to the storage, here is the end of the output:

...
==================================================================================================== short test summary info ====================================================================================================
FAILED tests/test_smoke.py::test_example_app - FileNotFoundError: [Errno 2] No such file or directory: 'sky'
...
ERROR tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket[s3://digitalcorpora-StoreType.S3] - sky.exceptions.ResourcesUnavailableError: Storage 'store: s3' specified, but AWS access is disabled. To fix, enable AWS by running `sky check`. More info: https://docs.skypilot.co/en/latest/getting-started/installation.h...
101 failed, 4 passed, 209 skipped, 5104 warnings, 4 errors in 7.92s

I didn't implement any storage-related functions for the Hyperstack, but I am not sure this is causing this issue.

@cblmemo
Copy link
Collaborator

cblmemo commented Apr 19, 2025

@cblmemo - thanks for the review. There are 2 issues left:

  • more descriptive error message (remove or not)
  • simplify filtering instances method

I made a few other changes:

I implemented changes using a force push to my fork. This has a downside in that there is no history of changes I performed based on your review, but it makes the future merge simpler. If you don't like this approach, I will continue using normal commits and maybe rebase them at the end (for a simpler merge).

Thanks for your consideration! Yeah, force push will make review harder due to the missing history..

Actually, we have the squash & merge feature enabled in our repo, which means that you can just use the normal git merge and all changes will be combined into one commit to our repo. So I dont think you need to worry about that :)

@cblmemo
Copy link
Collaborator

cblmemo commented Apr 19, 2025

Thanks for adding this @vvidovic ! This is exciting. Left some discussions.
Also, could you help run smoke test on hyperstack to make sure it works properly?

I have run the following tests:

# Check credentials:
python sky/cli.py check

python sky/cli.py launch -c test-single-instance --cloud hyperstack --gpus RTX-A6000 echo hi
python sky/cli.py stop test-single-instance
python sky/cli.py start test-single-instance

python sky/cli.py launch -c test-single-instance echo hi
python sky/cli.py down test-single-instance

# Terminate a stopped instance:
python sky/cli.py launch -c test-single-instance --cloud hyperstack --gpus RTX-A6000 echo hi
python sky/cli.py stop test-single-instance; sky down test-single-instance

python sky/cli.py launch --cloud hyperstack --gpus RTX-A6000 -c test-autostop -i 1 echo hi
python sky/cli.py launch --cloud hyperstack --gpus RTX-A6000 -c test-autodown -i 1 --down echo hi
# Make sure it works with other clouds:
python sky/cli.py launch -c test-all echo hi

I tried to run the following, but these are failing:

pytest tests/test_smoke.py --hyperstack

There are a lot of errors related to the storage, here is the end of the output:

...
==================================================================================================== short test summary info ====================================================================================================
FAILED tests/test_smoke.py::test_example_app - FileNotFoundError: [Errno 2] No such file or directory: 'sky'
...
ERROR tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket[s3://digitalcorpora-StoreType.S3] - sky.exceptions.ResourcesUnavailableError: Storage 'store: s3' specified, but AWS access is disabled. To fix, enable AWS by running `sky check`. More info: https://docs.skypilot.co/en/latest/getting-started/installation.h...
101 failed, 4 passed, 209 skipped, 5104 warnings, 4 errors in 7.92s

I didn't implement any storage-related functions for the Hyperstack, but I am not sure this is causing this issue.

There are some tests that requires a cloud with storage enabled, like aws or gcp. You can add a mark to skip them. Reference:

@pytest.mark.no_fluidstack # Fluidstack does not support T4 gpus for now
@pytest.mark.no_vast # Vast has low availability of T4 GPUs

You might also need to modify some places in the smoke test to enable hyperstack in the smoke test. You can reference to one recent PR: #4573

@vvidovic
Copy link
Contributor Author

@cblmemo - it seems that I had issues with test configurations. I needed to add the new Hyperstack cloud in few places in test files. There was one error in catalog (Hyperstack vs hyperstack) so I hope everything works now.
Unfortunately, the Hyperstack is currently low on resources - no RTX-A4000 or RTX-A600 instances are currently available.
image

I removed those from my local catalog and the single test run was successful:

pytest tests/smoke_tests/test_basic.py --hyperstack -k test_minimal

D 04-22 14:45:18 skypilot_config.py:263] using default user config file: ~/.sky/config.yaml
D 04-22 14:45:18 skypilot_config.py:282] using default project config file: .sky.yaml
D 04-22 14:45:18 skypilot_config.py:303] final config: {}
bringing up nodes...
[minimal] Test started. Log: less -r /tmp/minimal-_rjiu_yy.log
[minimal] Passed.
[minimal] Log: less -r /tmp/minimal-_rjiu_yy.log
[minimal] 
.
1 passed, 656 warnings in 330.78s (0:05:30)

I will try to run as many tests as possible and let you know the results.

Tests skipped for Hyperstack:
- tests requiring aws or gcp
- tests using other clouds
- K80s not available
- T4 not available
- Spot instances not supported
@cblmemo
Copy link
Collaborator

cblmemo commented Apr 27, 2025

@cblmemo - it seems that I had issues with test configurations. I needed to add the new Hyperstack cloud in few places in test files. There was one error in catalog (Hyperstack vs hyperstack) so I hope everything works now. Unfortunately, the Hyperstack is currently low on resources - no RTX-A4000 or RTX-A600 instances are currently available. image

I removed those from my local catalog and the single test run was successful:

pytest tests/smoke_tests/test_basic.py --hyperstack -k test_minimal

D 04-22 14:45:18 skypilot_config.py:263] using default user config file: ~/.sky/config.yaml
D 04-22 14:45:18 skypilot_config.py:282] using default project config file: .sky.yaml
D 04-22 14:45:18 skypilot_config.py:303] final config: {}
bringing up nodes...
[minimal] Test started. Log: less -r /tmp/minimal-_rjiu_yy.log
[minimal] Passed.
[minimal] Log: less -r /tmp/minimal-_rjiu_yy.log
[minimal] 
.
1 passed, 656 warnings in 330.78s (0:05:30)

I will try to run as many tests as possible and let you know the results.

Thanks! Ideally all the smoke test should pass. Please ping me when all of them are resolved and I'll take another pass ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants