Skip to content

[k8s] sky check detects unlabeled nodes #5065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 2, 2025
Merged

Conversation

SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Mar 28, 2025

Adds a hint to kubernetes sky check for any nodes with accelerators that are not properly labeled for use with SkyPilot.

Screenshot 2025-04-02 at 1 02 21 PM

The logic to detect contexts without proper labels can later be used to automatically label GPUs on those contexts.

Related issue: #4958
Fixes #5023

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Manually run sky check kubernetes on a cluster w/o labeling it beforehand
  • Manually run gpu label formatter script to check that jobs are only launched on unlabeled nodes.
$ python -m sky.utils.kubernetes.gpu_labeler   
Found 2 unlabeled GPU nodes in the cluster
Using default RuntimeClass for GPU labeling.
Created GPU labeler job for node ip-192-168-14-191.ec2.internal
Created GPU labeler job for node ip-192-168-24-7.ec2.internal
GPU labeling started - this may take 10 min or more to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs -n kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`. 
$ ... # after some time
$ kubectl describe nodes | grep "skypilot.co"
                    skypilot.co/accelerator=v100
                    skypilot.co/accelerator=t4
$ python -m sky.utils.kubernetes.gpu_labeler 
No unlabeled GPU nodes found in the cluster. If you have unlabeled GPU nodes, please ensure that they have the resource `nvidia.com/gpu: <number of GPUs>` in their capacity.

@SeungjinYang SeungjinYang changed the title sky check detects unlabeled nodes [k8s] sky check detects unlabeled nodes Mar 28, 2025
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @SeungjinYang!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SeungjinYang! LGTM.

@cg505 cg505 removed their request for review April 2, 2025 19:06
@SeungjinYang SeungjinYang merged commit 6cf52ce into master Apr 2, 2025
20 checks passed
@SeungjinYang SeungjinYang deleted the autolabel-gpus branch April 2, 2025 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] GPU labeler script should skip already labeled nodes
2 participants