-
Notifications
You must be signed in to change notification settings - Fork 633
Ray head node defaulting to --num-cpus 500m
(invalid value for --num-cpus
)
#5315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did the error appearing coincide with a SkyPilot update? If not, maybe something changed with how RunPod is setting up their containers. Any chance you can check the value of |
First off, no coincidence with a recent update—I froze both the API server on my cluster and my local CLI install on
...and I verified (I'm not sure how to really run |
Not sure if this is helpful (i.e. if it's the about same for all automatic Ray installs done by SkyPilot), but this is the full command that's logged:
|
Looks like the Ray opts in question are templated at |
I will be honest, something really really weird is going on. There is no way the kubernetes provisioner should have any impact on runpod. Have you ever used kubernetes at all? If you ssh into your serve controller, does |
I can't spend more time on this today, but my only remotely viable idea rn is: the serve controller is on k8s with standard 500m CPU req. We won't hit this error on initial provisioning skypilot/sky/clouds/kubernetes.py Lines 539 to 541 in baba253
skypilot/sky/provision/kubernetes/instance.py Line 1033 in baba253
skypilot/sky/provision/kubernetes/instance.py Line 1061 in baba253
|
@turtlebasket Also, could you run |
Ah woops, thought this was an issue with RP instances and not the controller 🤦♂ Inspected the controller pod and it was indeed using 0.5 cores, despite my most recent config specifying 1+. I must have deployed with 0.5 and left it running. It wasn't being recreated by subsequent launches but I thought it was, seeing the |
Also excuse my ignorance about the implementation details here, but why is the k8s cpu claim passed directly on the second |
First start path: Unhealthy ray path: provisioner.py _post_provision_setup, which uses bad value from cluster_info. |
@turtlebasket could you try this patch? #5340 |
(On
1.0.0.dev20250413
) I'm encountering:…when running
sky serve up service.yml --env-file <env>
with the below config:Strangely, the above deployment config was working perfectly a day or two ago—this behavior basically appeared overnight with no new config changes on my end. My first thought was this could be the controller, but my sky config specifies a 1 core and 2G mem, so I don't think the controller is the cause of the fractional-core param being passed in here...
The text was updated successfully, but these errors were encountered: