-
Notifications
You must be signed in to change notification settings - Fork 633
[k8s] Enable multiple kubernetes contexts for failover #3968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Took a quick look
sky/clouds/kubernetes.py
Outdated
allowed_contexts = skypilot_config.get_nested( | ||
('kubernetes', 'allowed_contexts'), None) | ||
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[commentary, no action required] I am liking the idea of using regions (instead of clouds) to do multi-kubernetes. In the future, if we want to enable multi-k8s out of the box, we can simply return all contexts here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I found it more clear to have the following mapping:
k8s contexts -> local cloud config profiles.
Because both of them contains:
- the identity to use for accessing the resource pool (k8s: user + namespace; cloud config: account)
- the resource pool to look at (k8s: cluster; cloud config: project to use)
I think the current way is a simple workaround for now, but we may need to have a better design in the future. The main confusion with using region may come from: multiple context can map to the same k8s cluster with different namespace or user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress
, while others may need ports: loadbalancer
. Same ofr other fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I am updating the code for always showing context for region, I also realized that there are more places to be updated, especially the code for failover Kubernetes._get_feasible_launchable_resources
. If we have two clusters with different resource set, our failover will likely disregard all the Kubernetes clusters if the cluster without the resource is the current activate context.
Marking this PR to draft for now to fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Tested it and works nicely. Left some comments.
sky/clouds/kubernetes.py
Outdated
allowed_contexts = skypilot_config.get_nested( | ||
('kubernetes', 'allowed_contexts'), None) | ||
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress
, while others may need ports: loadbalancer
. Same ofr other fields).
We realized that we need to update the code for checking resource feasibility on a kubernetes cluster to support different context and make failover fully functional. Changed this PR to draft for now to fix that issue. |
Fixed the feasible resources checking and seems working with multiple kubernetes containing different resources. Test setup
TODO:
|
Co-authored-by: Romil Bhardwaj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thanks @Michaelvll! Tested failover on GKE + local. Left some comments
This should be ready for another look. @romilbhardwaj : ) Future TODOs:
|
test = Test( | ||
'kubernetes-context-failover', | ||
[ | ||
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 3, 4, 5, 6, 7, 8"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this test will fail if the sky local up
cluster is not setup, can we add a quick error message for the dev running this test at this line? Something along the lines of: "Unable to find mocked GPUs in the sky local up
cluster. Please read the instructions for test_kubernetes_context_failover
on how to set it up".
Or better yet, automate the sky local up
setup :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatic local up
is a bit scary, as when I tried to do it, it turned out to me that we may have multiple tests in the future using the same local k8s cluster, and can cause issue if everyone is trying to modify that cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @Michaelvll! LGTM.
This allows user to specify the following
~/.sky/config.yaml
to enable SkyPilot to failover through different kubernetes contexts.TODO:
region
vscontext
show-gpus
should show resources from allallowed_contexts
(left for future)Tested (run the relevant ones):
bash format.sh
sky launch -c test --cloud kubernetes --cpus 4 echo hi
with two k8s clusters, one with nodes having less than 4 CPUs, one with nodes with more than 4 CPUs; it correctly failover through the first k8s cluster to the second oneallowed_contexts
, andsky exec
/sky launch
again on the existing SkyPilot clusterpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh