Ray head node defaulting to `--num-cpus 500m` (invalid value for `--num-cpus`) #5315

turtlebasket · 2025-04-22T23:20:08Z

(On 1.0.0.dev20250413) I'm encountering:

RuntimeError: Failed to start ray on the head node (exit code 1). Error: 
===== stdout ===== 
2025-04-22 23:07:08,623 INFO scripts.py:1163 -- Did not find any active Ray processes.
Usage: ray start [OPTIONS]
Try 'ray start --help' for help.

Error: Invalid value for '--num-cpus': '500m' is not a valid integer.

===== stderr =====command terminated with exit code 1

…when running sky serve up service.yml --env-file <env> with the below config:

service:
  readiness_probe: /healthcheck
  replica_policy:
    min_replicas: 2
    max_replicas: 12
    target_qps_per_replica: 10
  tls:
    certfile: certs/skyserve.crt
    keyfile: certs/skyserve.key

resources:
  image_id: docker:pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
  cloud: runpod
  region: US
  ports: 8000
  cpus: 4+
  memory: 8+
  disk_size: 32
  accelerators:
    - RTX4090
  use_spot: false

workdir: .

envs:
  HF_TOKEN: ${HF_TOKEN}
  HF_MODEL: ${HF_MODEL}

setup: |
  # <many (valid) setup commands>

run: |
  source /venv/bin/activate
  python main.py

Strangely, the above deployment config was working perfectly a day or two ago—this behavior basically appeared overnight with no new config changes on my end. My first thought was this could be the controller, but my sky config specifies a 1 core and 2G mem, so I don't think the controller is the cause of the fractional-core param being passed in here...

The text was updated successfully, but these errors were encountered:

cg505 · 2025-04-23T04:09:56Z

Did the error appearing coincide with a SkyPilot update? If not, maybe something changed with how RunPod is setting up their containers. Any chance you can check the value of psutil.cpu_count() on the failing RunPod instances?

turtlebasket · 2025-04-23T04:23:09Z

First off, no coincidence with a recent update—I froze both the API server on my cluster and my local CLI install on 1.0.0.dev20250413 due to an API mismatch issue when setting everything up earlier this week, and haven't updated since. As for the RP instance, I can see that we're at least provisioning an instance with the right specs:

Each replica will use the following resources (estimated):
Considered resources (1 node):
-------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------
 RunPod   1x_RTX4090_SECURE   16      24        RTX4090:1      US            0.74          ✔     
-------------------------------------------------------------------------------------------------

...and I verified psutil.cpu_count() returns correct values on large RunPod Secure Cloud instances I provision myself, so all the more bewildering why Ray is being run with only 1 fractional core.

(I'm not sure how to really run psutil.cpu_count() in this situtation... setup/run commands happen after ray is initialized, yes? Would this involve manually editing the CLI/API server code?)

turtlebasket · 2025-04-23T04:24:36Z

Not sure if this is helpful (i.e. if it's the about same for all automatic Ray installs done by SkyPilot), but this is the full command that's logged:

I 04-22 22:57:37 instance_setup.py:375] Running command on head node: $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 RAY_worker_maximum_startup_concurrency=$(( 3 * $(nproc --all) )) $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) start --head --disable-usage-stats --port=6380 --dashboard-port=8266 --min-worker-port 11002 --object-manager-port=8076 --temp-dir=/tmp/ray_skypilot --object-store-memory=500000000 --num-cpus=500m || exit 1;which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;$([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c 'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w", encoding="utf-8"))';while `RAY_ADDRESS=127.0.0.1:6380 $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) status | grep -q "No cluster status."`; do sleep 0.5; echo "Waiting ray cluster to be initialized"; done;

turtlebasket · 2025-04-23T20:06:49Z

Looks like the Ray opts in question are templated at sky/provision/kubernetes/instance.py:1059. Problem more broadly seems to be that one can't just take Kubernetes resource claims and pass them as Ray opts...?

cg505 · 2025-04-24T00:15:35Z

I will be honest, something really really weird is going on. There is no way the kubernetes provisioner should have any impact on runpod. Have you ever used kubernetes at all? If you ssh into your serve controller, does kubectl get nodes return anything?
Could you provide the full debug logs in case there are any additional hints?

cg505 · 2025-04-24T00:22:09Z

I can't spend more time on this today, but my only remotely viable idea rn is: the serve controller is on k8s with standard 500m CPU req. We won't hit this error on initial provisioning

skypilot/sky/clouds/kubernetes.py

Lines 539 to 541 in baba253

    
           # 'num-cpus' must be an integer, but we should not set it to 0 if 
        
           # cpus is <1. 
        
           'num-cpus': str(max(int(cpus), 1)),

but we could hit it if the ray cluster degrades and we try to restart ray - in this case the cpu request value (via cluster_info) is used directly

skypilot/sky/provision/kubernetes/instance.py

Line 1033 in baba253

cpu_request = head_spec.containers[0].resources.requests['cpu']

skypilot/sky/provision/kubernetes/instance.py

Line 1061 in baba253

'num-cpus': cpu_request,

cg505 · 2025-04-24T00:22:56Z

@turtlebasket Also, could you run sky status -ru and post the full output?

turtlebasket · 2025-04-24T00:58:56Z

Ah woops, thought this was an issue with RP instances and not the controller 🤦‍♂

Inspected the controller pod and it was indeed using 0.5 cores, despite my most recent config specifying 1+. I must have deployed with 0.5 and left it running. It wasn't being recreated by subsequent launches but I thought it was, seeing the Launching serve controller on Kubernetes. -> Pod is up.

turtlebasket · 2025-04-24T01:00:00Z

Also excuse my ignorance about the implementation details here, but why is the k8s cpu claim passed directly on the second ray start attempt?

cg505 · 2025-04-24T01:25:56Z

First start path:
kubernetes-ray.yml.j2 uses {{ray_head_start_command}}, set in clouds/kubernetes.py make_deploy_resources_variables

Unhealthy ray path: provisioner.py _post_provision_setup, which uses bad value from cluster_info.
Technically first launch also hits this but ray should be healthy theoretically

cg505 · 2025-04-24T01:30:21Z

@turtlebasket could you try this patch? #5340

Michaelvll assigned cg505 Apr 23, 2025

cg505 linked a pull request Apr 24, 2025 that will close this issue

[k8s] handle millicpu CPU request values #5340

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray head node defaulting to `--num-cpus 500m` (invalid value for `--num-cpus`) #5315

Ray head node defaulting to `--num-cpus 500m` (invalid value for `--num-cpus`) #5315

turtlebasket commented Apr 22, 2025 •

edited

Loading

cg505 commented Apr 23, 2025

turtlebasket commented Apr 23, 2025 •

edited

Loading

turtlebasket commented Apr 23, 2025 •

edited

Loading

turtlebasket commented Apr 23, 2025 •

edited

Loading

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

turtlebasket commented Apr 24, 2025 •

edited

Loading

turtlebasket commented Apr 24, 2025 •

edited

Loading

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

Ray head node defaulting to --num-cpus 500m (invalid value for --num-cpus) #5315

Ray head node defaulting to --num-cpus 500m (invalid value for --num-cpus) #5315

Comments

turtlebasket commented Apr 22, 2025 • edited Loading

cg505 commented Apr 23, 2025

turtlebasket commented Apr 23, 2025 • edited Loading

turtlebasket commented Apr 23, 2025 • edited Loading

turtlebasket commented Apr 23, 2025 • edited Loading

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

turtlebasket commented Apr 24, 2025 • edited Loading

turtlebasket commented Apr 24, 2025 • edited Loading

cg505 commented Apr 24, 2025

cg505 commented Apr 24, 2025

Ray head node defaulting to `--num-cpus 500m` (invalid value for `--num-cpus`) #5315

Ray head node defaulting to `--num-cpus 500m` (invalid value for `--num-cpus`) #5315

turtlebasket commented Apr 22, 2025 •

edited

Loading

turtlebasket commented Apr 23, 2025 •

edited

Loading

turtlebasket commented Apr 23, 2025 •

edited

Loading

turtlebasket commented Apr 23, 2025 •

edited

Loading

turtlebasket commented Apr 24, 2025 •

edited

Loading

turtlebasket commented Apr 24, 2025 •

edited

Loading