-
Notifications
You must be signed in to change notification settings - Fork 635
[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The OOM issue might be related to this issue: https://discuss.ray.io/t/how-to-get-gcs-server-momery-distribution-to-debug-memory-continued-increasement/10030/4 and ray-project/ray#34619 We can consider reducing the value of Based on our current way for launching the skypilot job, each spot job will have ray tasks > 16 (1 for each node for However, that said, we still need to figure out why the top 1 memory consumption is caused by the multiprocessing. |
An additional logging: (<task-name>, pid=1598212) I 10-06 15:35:09 cloud_vm_ray_backend.py:1807] Launching on AWS us-west-2 (us-west-2b)
Traceback (most recent call last):
File "/home/ubuntu/.sky/sky_app/sky_job_55", line 452, in <module>
returncodes = ray.get(futures)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: <ip>, ID: 752666e00e5bf1cca24c49c6a40e0a68b0ba099d9f8b04ce6f5929a3) where the task (task ID: 6aae942a0d68c18ef28b98818a2417ea163325070a000000, name=<task-name>,, pid=1598212, memory used=0.09GB) was running was 29.39GB / 30.84GB (0.952786), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip <ip>`. To see the logs of the worker, use `ray logs worker-cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060*out -ip <ip>. Top 10 memory users:
PID MEM(GB) COMMAND
950070 15.12 /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1598333 5.11 /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1611919 5.06 /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1732 0.54 /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray_skypilot/...
1883 0.15 /opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
1977 0.12 /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
1952252 0.11 /opt/conda/bin/python3 /tmp/skypilot_ray_up_j5y2hven.py
1766 0.11 /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
949935 0.09 python3 -u /home/ubuntu/.sky/sky_app/sky_job_53
1598164 0.09 python3 -u /home/ubuntu/.sky/sky_app/sky_job_54
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero. > sudo env "PATH=$PATH" py-spy dump --pid 950070
Process 950070: /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=10, pipe_handle=12) --multiprocessing-fork
Python v3.10.6 (/opt/conda/bin/python3.10)
Thread 950070 (idle): "MainThread"
_run_one_task (sky/spot/controller.py:209)
run (sky/spot/controller.py:338)
_run_controller (sky/spot/controller.py:406)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_main (multiprocessing/spawn.py:129)
spawn_main (multiprocessing/spawn.py:116)
<module> (<string>:1) |
It seems our while loop in the controller will increase the memory consumption by ~5MB per iteration (20 seconds), which means after a day (24 hours), there will be 21GB memory consumption. skypilot/sky/spot/controller.py Lines 209 to 275 in ea12df2
|
Bisect the git history and found the commit that causes this OOM issue: #2288 |
A user reported two issues on a spot controller with OOM happened:
sky spot --controller
:FAILED_CONTROLLER
insky spot queue
while the spot controller process is still running, causing the job to continue and untracked, i.e., resources leakage. The other spot jobs were not affected.This is a serious issue, as this causes a leakage of the spot jobs.
We offered a workaround to ask the user log into the spot controller kill the spot controller process and manually
sky down
the staled cluster.However, several important TODOs (sorted by priority):
FAILED_CONTROLLER
, to kill the controller process and clean up the clusters (having it in skylet should be an option). Added in [Spot] Cleanup zombie controller processes for OOM corner cases #2670Some additional information, the user have about 50 spot jobs finished while 4 spot jobs running, and each job has 4 or 16 nodes.
The text was updated successfully, but these errors were encountered: