Skip to content

Ray head node defaulting to --num-cpus 500m (invalid value for --num-cpus) #5315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
turtlebasket opened this issue Apr 22, 2025 · 11 comments · May be fixed by #5340
Open

Ray head node defaulting to --num-cpus 500m (invalid value for --num-cpus) #5315

turtlebasket opened this issue Apr 22, 2025 · 11 comments · May be fixed by #5340
Assignees

Comments

@turtlebasket
Copy link

turtlebasket commented Apr 22, 2025

(On 1.0.0.dev20250413) I'm encountering:

RuntimeError: Failed to start ray on the head node (exit code 1). Error: 
===== stdout ===== 
2025-04-22 23:07:08,623 INFO scripts.py:1163 -- Did not find any active Ray processes.
Usage: ray start [OPTIONS]
Try 'ray start --help' for help.

Error: Invalid value for '--num-cpus': '500m' is not a valid integer.

===== stderr =====command terminated with exit code 1

…when running sky serve up service.yml --env-file <env> with the below config:

service:
  readiness_probe: /healthcheck
  replica_policy:
    min_replicas: 2
    max_replicas: 12
    target_qps_per_replica: 10
  tls:
    certfile: certs/skyserve.crt
    keyfile: certs/skyserve.key

resources:
  image_id: docker:pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
  cloud: runpod
  region: US
  ports: 8000
  cpus: 4+
  memory: 8+
  disk_size: 32
  accelerators:
    - RTX4090
  use_spot: false

workdir: .

envs:
  HF_TOKEN: ${HF_TOKEN}
  HF_MODEL: ${HF_MODEL}

setup: |
  # <many (valid) setup commands>

run: |
  source /venv/bin/activate
  python main.py

Strangely, the above deployment config was working perfectly a day or two ago—this behavior basically appeared overnight with no new config changes on my end. My first thought was this could be the controller, but my sky config specifies a 1 core and 2G mem, so I don't think the controller is the cause of the fractional-core param being passed in here...

@cg505
Copy link
Collaborator

cg505 commented Apr 23, 2025

Did the error appearing coincide with a SkyPilot update? If not, maybe something changed with how RunPod is setting up their containers. Any chance you can check the value of psutil.cpu_count() on the failing RunPod instances?

@turtlebasket
Copy link
Author

turtlebasket commented Apr 23, 2025

First off, no coincidence with a recent update—I froze both the API server on my cluster and my local CLI install on 1.0.0.dev20250413 due to an API mismatch issue when setting everything up earlier this week, and haven't updated since. As for the RP instance, I can see that we're at least provisioning an instance with the right specs:

Each replica will use the following resources (estimated):
Considered resources (1 node):
-------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------
 RunPod   1x_RTX4090_SECURE   16      24        RTX4090:1      US            0.74          ✔     
-------------------------------------------------------------------------------------------------

...and I verified psutil.cpu_count() returns correct values on large RunPod Secure Cloud instances I provision myself, so all the more bewildering why Ray is being run with only 1 fractional core.

(I'm not sure how to really run psutil.cpu_count() in this situtation... setup/run commands happen after ray is initialized, yes? Would this involve manually editing the CLI/API server code?)

@turtlebasket
Copy link
Author

turtlebasket commented Apr 23, 2025

Not sure if this is helpful (i.e. if it's the about same for all automatic Ray installs done by SkyPilot), but this is the full command that's logged:

I 04-22 22:57:37 instance_setup.py:375] Running command on head node: $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 RAY_worker_maximum_startup_concurrency=$(( 3 * $(nproc --all) )) $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) start --head --disable-usage-stats --port=6380 --dashboard-port=8266 --min-worker-port 11002 --object-manager-port=8076 --temp-dir=/tmp/ray_skypilot --object-store-memory=500000000 --num-cpus=500m || exit 1;which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;$([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c 'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w", encoding="utf-8"))';while `RAY_ADDRESS=127.0.0.1:6380 $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) status | grep -q "No cluster status."`; do sleep 0.5; echo "Waiting ray cluster to be initialized"; done;

@turtlebasket
Copy link
Author

turtlebasket commented Apr 23, 2025

Looks like the Ray opts in question are templated at sky/provision/kubernetes/instance.py:1059. Problem more broadly seems to be that one can't just take Kubernetes resource claims and pass them as Ray opts...?

@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

I will be honest, something really really weird is going on. There is no way the kubernetes provisioner should have any impact on runpod. Have you ever used kubernetes at all? If you ssh into your serve controller, does kubectl get nodes return anything?
Could you provide the full debug logs in case there are any additional hints?

@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

I can't spend more time on this today, but my only remotely viable idea rn is: the serve controller is on k8s with standard 500m CPU req. We won't hit this error on initial provisioning

# 'num-cpus' must be an integer, but we should not set it to 0 if
# cpus is <1.
'num-cpus': str(max(int(cpus), 1)),
but we could hit it if the ray cluster degrades and we try to restart ray - in this case the cpu request value (via cluster_info) is used directly
cpu_request = head_spec.containers[0].resources.requests['cpu']
'num-cpus': cpu_request,

@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

@turtlebasket Also, could you run sky status -ru and post the full output?

@turtlebasket
Copy link
Author

turtlebasket commented Apr 24, 2025

Ah woops, thought this was an issue with RP instances and not the controller 🤦‍♂

Inspected the controller pod and it was indeed using 0.5 cores, despite my most recent config specifying 1+. I must have deployed with 0.5 and left it running. It wasn't being recreated by subsequent launches but I thought it was, seeing the Launching serve controller on Kubernetes. -> Pod is up.

@turtlebasket
Copy link
Author

turtlebasket commented Apr 24, 2025

Also excuse my ignorance about the implementation details here, but why is the k8s cpu claim passed directly on the second ray start attempt?

@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

First start path:
kubernetes-ray.yml.j2 uses {{ray_head_start_command}}, set in clouds/kubernetes.py make_deploy_resources_variables

Unhealthy ray path: provisioner.py _post_provision_setup, which uses bad value from cluster_info.
Technically first launch also hits this but ray should be healthy theoretically

@cg505 cg505 linked a pull request Apr 24, 2025 that will close this issue
5 tasks
@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

@turtlebasket could you try this patch? #5340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants