-
Notifications
You must be signed in to change notification settings - Fork 633
Initial Hyperstack cloud implementation #5220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9d8a7fc
to
02cab37
Compare
If I am not mistaken, tests are failing because of the missing catalog for the Hyperstack. This should be resolved after merging the PR for the skypilot-catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @vvidovic ! This is exciting. Left some discussions.
Also, could you help run smoke test on hyperstack to make sure it works properly?
sky/backends/backend_utils.py
Outdated
@@ -897,7 +898,7 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str): | |||
elif isinstance(cloud, clouds.Fluidstack): | |||
config = auth.setup_fluidstack_authentication(config) | |||
else: | |||
assert False, cloud | |||
assert False, f'{cloud} - adding authentication to cluster config not implemented.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the process of adding the Hyperstack provider, there were many errors with non-descriptive messages. I was thinking that this kind of more descriptive error message could help future implementations to easier find where from the error was thrown.
Of course, it is quite easy to remove that if this argument doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch!
I thought that since here we are asserting false, this code path should not be entered and thus adding extra information is not of big use. But if you actually encountered this code path, a valid choice would be raising some error instead - that would be more informative. Could you help changing this to a ValueError
or sth? thanks!
sky/clouds/hyperstack.py
Outdated
@classmethod | ||
def _check_credentials(cls) -> Tuple[bool, Optional[str]]: | ||
try: | ||
assert os.path.exists( | ||
os.path.expanduser(hyperstack_utils.HYPERSTACK_API_KEY_PATH)) | ||
except AssertionError: | ||
return False, ('Failed to access Hyperstack Cloud' | ||
' with credentials. ' | ||
'To configure credentials, go to:\n ' | ||
' https://console.hyperstack.cloud/api-keys \n ' | ||
'to obtain an API key, ' | ||
'then add save the contents ' | ||
'to ~/.hyperstack/api_key \n') | ||
except requests.exceptions.ConnectionError: | ||
return False, ('Failed to verify Hyperstack Cloud credentials. ' | ||
'Check your network connection ' | ||
'and try again.') | ||
return True, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will a connection error being raised? this seems only checking if there is a path in local system?
also, can we directly use the result of os.path.exists instead of assert it and catching it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I improperly copied some other cloud logic, let me fix that.
I believe the Fluidstack check is the one this was based on:
skypilot/sky/clouds/fluidstack.py
Line 261 in 34973e2
def _check_compute_credentials(cls) -> Tuple[bool, Optional[str]]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
sky/clouds/hyperstack.py
Outdated
@classmethod | ||
def query_status(cls, name: str, tag_filters: Dict[str, str], | ||
region: Optional[str], zone: Optional[str], | ||
**kwargs) -> List[status_lib.ClusterStatus]: | ||
status_map = { | ||
'booting': status_lib.ClusterStatus.INIT, | ||
'active': status_lib.ClusterStatus.UP, | ||
'unhealthy': status_lib.ClusterStatus.INIT, | ||
'terminating': None, | ||
'terminated': None, | ||
} | ||
# TODO(ewzeng): filter by hash_filter_string to be safe | ||
status_list = [] | ||
vms = hyperstack_utils.HyperstackClient().list_instances() | ||
possible_names = [f'{name}-head', f'{name}-worker'] | ||
for node in vms: | ||
if node.get('name') in possible_names: | ||
node_status = status_map[node['status']] | ||
if node_status is not None: | ||
status_list.append(node_status) | ||
return status_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 1036 to 1041 in cad0356
@classmethod | |
def query_status(cls, name: str, tag_filters: Dict[str, str], | |
region: Optional[str], zone: Optional[str], | |
**kwargs) -> List['status_lib.ClusterStatus']: | |
# TODO(suquark): deprecate this method | |
assert False, 'This code path should not be used.' |
this code path should be deprecated? can we remove it to reduce confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, this method was not used and statuses were not correct for the Hyperstack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
[--api-key-path API_KEY_PATH] | ||
|
||
If neither --api-key nor --api-key-path are provided, this script will parse | ||
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`). | |
`~/.hyperstack/credentials.yaml` to look for IAM API key (`iam_api_key`). |
should this be IAM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, should the path be ~/.hyperstack/api_key
?
`~/.hyperstack/credentials.yaml` to look for IBM API key (`iam_api_key`). | |
`~/.hyperstack/api_key` to look for IBM API key (`iam_api_key`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a precedence between those cli args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an outdated comment copied from the IBM cloud fetcher.
How do you mean to add a precedence? The first one found is used:
api_key
argapi_key_path
arg- default API key path
I think that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry, i mean adding the precedence in the comments, which is basically this list's order.
encoding='utf-8') as f: | ||
api_key_res = f.read().strip() | ||
else: | ||
# Read from ~/.lambda_cloud/lambda_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be hyperstack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
def __init__(self): | ||
self.api_key = read_contents( | ||
os.path.expanduser(HYPERSTACK_API_KEY_PATH)).strip() | ||
|
||
def get_flavors(self): | ||
response = requests.get(f'{ENDPOINT}/core/flavors', | ||
headers={'api_key': self.api_key}) | ||
raise_hyperstack_error(response) | ||
response_json = response.json() | ||
flavor_groups = response_json['data'] | ||
flavors = [ | ||
flavor for group in flavor_groups for flavor in group['flavors'] | ||
] | ||
|
||
return flavors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type annotation for the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added List[dict]
. Done.
security_rules = [ | ||
dict( | ||
direction='ingress', | ||
protocol='tcp', | ||
ethertype='IPv4', | ||
remote_ip_prefix='0.0.0.0/0', | ||
port_range_min=22, | ||
port_range_max=22, | ||
) | ||
] | ||
if ports: | ||
for p in ports: | ||
security_rules.append( | ||
dict( | ||
direction='ingress', | ||
protocol='tcp', | ||
ethertype='IPv4', | ||
remote_ip_prefix='0.0.0.0/0', | ||
port_range_min=p, | ||
port_range_max=p, | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a function to generate one single rule and reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you maybe know if there is a SkyPilot configuration approach that would enable us to set the remote_ip_prefix
(the IP address range from which it is allowed to connect to the port)?
(sorry for the digression)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - currently we assume all ip can access the skypilot's vm. To confirm, this is essentially the outbound port, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue #5288 to keep track of it. Feel free to discuss any related topics under that issue :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean the outbound port, but the outbound IP address.
It is just an additional check on the VM side that it is not available to the entire Internet but only to some specific IP or IP range of addresses.
For example, you could want to allow access from the:
- AWS (because some other component of your system is hosted in AWS)
- your company VPN (for example, you run some internal LLM model in the cloud, but want to allow only your employees to connect to it)
For example, on the Hypercloud, you can configure a Firewall inbound rule with:
- ethernet type: IPv4 or IPv6
- protocol: TCP, UDP or ICMP
- port range: the port(s) that will be opened on the VM instance
- remote IPs: IP subnet or IP address of the remote client that can connect to the "port range"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Currently we dont have this limitation built-in. Just reposted this in the issue and I think we can skip this limit in this PR :))
|
||
return instance_ids | ||
|
||
def list_ssh_keys(self, environment_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type annotation for the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added List[dict]
. Done.
sky/provision/hyperstack/instance.py
Outdated
add_active: bool = True, | ||
add_pending: bool = True, | ||
add_stopped: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit confusing. Can we have a status filter like this?
skypilot/sky/provision/runpod/instance.py
Lines 19 to 35 in 134b8fe
def _filter_instances(cluster_name_on_cloud: str, | |
status_filters: Optional[List[str]], | |
head_only: bool = False) -> Dict[str, Any]: | |
instances = utils.list_instances() | |
possible_names = [f'{cluster_name_on_cloud}-head'] | |
if not head_only: | |
possible_names.append(f'{cluster_name_on_cloud}-worker') | |
filtered_instances = {} | |
for instance_id, instance in instances.items(): | |
if (status_filters is not None and | |
instance['status'] not in status_filters): | |
continue | |
if instance.get('name') in possible_names: | |
filtered_instances[instance_id] = instance | |
return filtered_instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An initial implementation was as you suggested, but it looked more confusing to me. The intent of the function was not clear: "Do we want to show active, pending and/or stopped VMs?"
However, you are correct that 3 booleans look a bit confusing too.
Let me spend some more time thinking about this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can add a "status groups" to filter?
(PENDING
, ACTIVE
, STOPPED
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the "status group to filter" is exactly the status_filters: Optional[List[str]]
argument in the above function? Could you elaborate on the differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 9 recognised Hyperstack statuses (from the Hypercloud responses or web UI):
- 1 of those is
ACTIVE
- 1 of those is
STOPPED
- all other statuses are
PENDING
(BUILD
,CREATING
,STARTING
,REBOOTING
,HARD_REBOOT
&DELETING
)
That means that each time we would like to filter by Hyperstack status group "PENDING" we would need to add all 6 of those to the list and when we want to combine for example "ACTIVE" and "PENDING" it would be harder to understand that we want "all pending statuses" + "active status".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Then adding a status group to filter sounds great to me!
sky/provision/hyperstack/instance.py
Outdated
# subprocess_utils.run_in_parallel(get_internal_ip, | ||
# list(running_instances.values())) | ||
head_instance_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# subprocess_utils.run_in_parallel(get_internal_ip, | |
# list(running_instances.values())) | |
head_instance_id = None | |
head_instance_id = None |
accidental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
86f5244
to
beb3523
Compare
Initial implementation of the SkyPilot provisioning for the Hyperstack cloud.
@cblmemo - thanks for the review. There are 2 issues left:
I made a few other changes:
I implemented changes using a force push to my fork. This has a downside in that there is no history of changes I performed based on your review, but it makes the future merge simpler. |
I have run the following tests:
I tried to run the following, but these are failing:
There are a lot of errors related to the storage, here is the end of the output:
I didn't implement any storage-related functions for the Hyperstack, but I am not sure this is causing this issue. |
Thanks for your consideration! Yeah, force push will make review harder due to the missing history.. Actually, we have the squash & merge feature enabled in our repo, which means that you can just use the normal git merge and all changes will be combined into one commit to our repo. So I dont think you need to worry about that :) |
There are some tests that requires a cloud with storage enabled, like aws or gcp. You can add a mark to skip them. Reference: skypilot/tests/smoke_tests/test_sky_serve.py Lines 320 to 321 in d40331d
You might also need to modify some places in the smoke test to enable hyperstack in the smoke test. You can reference to one recent PR: #4573 |
Grouping statuses in 3 groups, relevant for the business logic.
The first letter must be lowercase.
@cblmemo - it seems that I had issues with test configurations. I needed to add the new Hyperstack cloud in few places in test files. There was one error in catalog ( I removed those from my local catalog and the single test run was successful:
I will try to run as many tests as possible and let you know the results. |
Tests skipped for Hyperstack: - tests requiring aws or gcp - tests using other clouds - K80s not available - T4 not available - Spot instances not supported
Thanks! Ideally all the smoke test should pass. Please ping me when all of them are resolved and I'll take another pass ;) |
Initial implementation of the SkyPilot provisioning for the Hyperstack cloud.
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)