Skip to content

[k8s] GPU labeler script should skip already labeled nodes #5023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SeungjinYang opened this issue Mar 24, 2025 · 0 comments · Fixed by #5065
Closed

[k8s] GPU labeler script should skip already labeled nodes #5023

SeungjinYang opened this issue Mar 24, 2025 · 0 comments · Fixed by #5065

Comments

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented Mar 24, 2025

sky/utils/kubernetes/gpu_labeler.py is a utility script a user can run to label GPU nodes in their k8s cluster. See https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#automatically-labelling-nodes for how this script may be used.

The python script labels GPU nodes by finding all nodes with nvidia.com/gpu resource on it, and scheduling a pod on each node which adds the necessary gpu label (specifically, skypilot.co/accelerator: <gpu_name> label). The relevant logic is copied here:

        # Get the list of nodes with GPUs
        gpu_nodes = []
        for node in nodes:
            if kubernetes_utils.get_gpu_resource_key() in node.status.capacity:
                gpu_nodes.append(node)
        ... # launch labeling job on each node

While this script works, the script launches a labeling job on every node with GPU resource - regardless of if the node has already been labeled.

One could imagine a k8s cluster with GPU nodes that have been labeled in the past, but had additional nodes join the cluster to better scale workloads. In such cases, a user may run the GPU labeler script to label the nodes that have just joined the cluster, but the script will schedule pods even on already labeled nodes. This is inefficient, and we'd like to avoid this.

We could check, in the for loop mentioned above, if the node already has a skypilot.co/accelerator label. If the node does, we should not launch a job to label that node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant