You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The python script labels GPU nodes by finding all nodes with nvidia.com/gpu resource on it, and scheduling a pod on each node which adds the necessary gpu label (specifically, skypilot.co/accelerator: <gpu_name> label). The relevant logic is copied here:
# Get the list of nodes with GPUs
gpu_nodes = []
for node in nodes:
if kubernetes_utils.get_gpu_resource_key() in node.status.capacity:
gpu_nodes.append(node)
... # launch labeling job on each node
While this script works, the script launches a labeling job on every node with GPU resource - regardless of if the node has already been labeled.
One could imagine a k8s cluster with GPU nodes that have been labeled in the past, but had additional nodes join the cluster to better scale workloads. In such cases, a user may run the GPU labeler script to label the nodes that have just joined the cluster, but the script will schedule pods even on already labeled nodes. This is inefficient, and we'd like to avoid this.
We could check, in the for loop mentioned above, if the node already has a skypilot.co/accelerator label. If the node does, we should not launch a job to label that node.
The text was updated successfully, but these errors were encountered:
sky/utils/kubernetes/gpu_labeler.py
is a utility script a user can run to label GPU nodes in their k8s cluster. See https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#automatically-labelling-nodes for how this script may be used.The python script labels GPU nodes by finding all nodes with
nvidia.com/gpu
resource on it, and scheduling a pod on each node which adds the necessary gpu label (specifically,skypilot.co/accelerator: <gpu_name>
label). The relevant logic is copied here:While this script works, the script launches a labeling job on every node with GPU resource - regardless of if the node has already been labeled.
One could imagine a k8s cluster with GPU nodes that have been labeled in the past, but had additional nodes join the cluster to better scale workloads. In such cases, a user may run the GPU labeler script to label the nodes that have just joined the cluster, but the script will schedule pods even on already labeled nodes. This is inefficient, and we'd like to avoid this.
We could check, in the for loop mentioned above, if the node already has a
skypilot.co/accelerator
label. If the node does, we should not launch a job to label that node.The text was updated successfully, but these errors were encountered: