Skip to content

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Michaelvll opened this issue Oct 6, 2023 · 4 comments · Fixed by #2675
Closed
Labels

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Oct 6, 2023

A user reported two issues on a spot controller with OOM happened:

  1. A spot job is killed due to OOM with the following error in sky spot --controller:
Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_48", line 452, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xxx.xx.xx.xxx, ID: 752666e00e5bf1cca24c49c6a40e0a68b0ba099d9f8b04ce6f5929a3) where the task (task ID: 1571137c1ceea2f1a9ed70b6ec66c436e5fde3c403000000, name=<task-name>,, pid=2936, memory used=0.10GB) was running was 29.30GB / 30.84GB (0.950025), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: fc5c3a323b9b91b8eeb4252276fe5711a946bfb4257c85202b9b8f6b) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xxx.xx.xx.xxx`. To see the logs of the worker, use `ray logs worker-fc5c3a323b9b91b8eeb4252276fe5711a946bfb4257c85202b9b8f6b*out -ip xxx.xx.xx.xxx. Top 10 memory users:
PID	MEM(GB)	COMMAND
18410	26.84	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1732	0.48	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray_skypilot/...
1766	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
1977	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
2936	0.10	ray::<task-name>,
1883	0.10	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
18323	0.10	python3 -u /home/ubuntu/.sky/sky_app/sky_job_48
18260	0.08	ray::JobSupervisor
616770	0.08	python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 18410 -...
18376	0.08	python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 2936 --...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Shared connection to xxx.xx.xx.xxx closed.
  1. Another spot job is shown FAILED_CONTROLLER in sky spot queue while the spot controller process is still running, causing the job to continue and untracked, i.e., resources leakage. The other spot jobs were not affected.

This is a serious issue, as this causes a leakage of the spot jobs.

We offered a workaround to ask the user log into the spot controller kill the spot controller process and manually sky down the staled cluster.

However, several important TODOs (sorted by priority):

  1. We need a safety net for the controllers turn into FAILED_CONTROLLER, to kill the controller process and clean up the clusters (having it in skylet should be an option). Added in [Spot] Cleanup zombie controller processes for OOM corner cases #2670
  2. Why the OOM happens when there are only 4 spot jobs

Some additional information, the user have about 50 spot jobs finished while 4 spot jobs running, and each job has 4 or 16 nodes.

@Michaelvll Michaelvll added the P0 label Oct 6, 2023
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 6, 2023

  1. why the OOM happens when there are only 4 spot jobs

The OOM issue might be related to this issue: https://discuss.ray.io/t/how-to-get-gcs-server-momery-distribution-to-debug-memory-continued-increasement/10030/4 and ray-project/ray#34619

We can consider reducing the value of RAY_maximum_gcs_destroyed_actor_cached_count to less than 100k.

Based on our current way for launching the skypilot job, each spot job will have ray tasks > 16 (1 for each node for setup) + 16 (1 for each node for run) + 1 (ray job supervisor) = 33. According to the reference above each dead ray task can cause 15MB memory consumption, leading to 50 * 30 * 15 / 1024 = 22GB for the dead ray tasks, which is close to the memory upper bound of the spot controller 32GB, considering there are also some ray up workloads that can take a lot of memory.

However, that said, we still need to figure out why the top 1 memory consumption is caused by the multiprocessing.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 6, 2023

An additional logging:

(<task-name>, pid=1598212) I 10-06 15:35:09 cloud_vm_ray_backend.py:1807] Launching on AWS us-west-2 (us-west-2b)
Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_55", line 452, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: <ip>, ID: 752666e00e5bf1cca24c49c6a40e0a68b0ba099d9f8b04ce6f5929a3) where the task (task ID: 6aae942a0d68c18ef28b98818a2417ea163325070a000000, name=<task-name>,, pid=1598212, memory used=0.09GB) was running was 29.39GB / 30.84GB (0.952786), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip <ip>`. To see the logs of the worker, use `ray logs worker-cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060*out -ip <ip>. Top 10 memory users:
PID	MEM(GB)	COMMAND
950070	15.12	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1598333	5.11	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1611919	5.06	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1732	0.54	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray_skypilot/...
1883	0.15	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
1977	0.12	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
1952252	0.11	/opt/conda/bin/python3 /tmp/skypilot_ray_up_j5y2hven.py
1766	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
949935	0.09	python3 -u /home/ubuntu/.sky/sky_app/sky_job_53
1598164	0.09	python3 -u /home/ubuntu/.sky/sky_app/sky_job_54
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
> sudo env "PATH=$PATH" py-spy dump --pid 950070
Process 950070: /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=10, pipe_handle=12) --multiprocessing-fork
Python v3.10.6 (/opt/conda/bin/python3.10)

Thread 950070 (idle): "MainThread"
    _run_one_task (sky/spot/controller.py:209)
    run (sky/spot/controller.py:338)
    _run_controller (sky/spot/controller.py:406)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 6, 2023

It seems our while loop in the controller will increase the memory consumption by ~5MB per iteration (20 seconds), which means after a day (24 hours), there will be 21GB memory consumption.

time.sleep(spot_utils.JOB_STATUS_CHECK_GAP_SECONDS)
# Check the network connection to avoid false alarm for job failure.
# Network glitch was observed even in the VM.
try:
backend_utils.check_network_connection()
except exceptions.NetworkError:
logger.info(
'Network is not available. Retrying again in '
f'{spot_utils.JOB_STATUS_CHECK_GAP_SECONDS} seconds.')
continue
# NOTE: we do not check cluster status first because race condition
# can occur, i.e. cluster can be down during the job status check.
job_status = spot_utils.get_job_status(self._backend, cluster_name)
if job_status == job_lib.JobStatus.SUCCEEDED:
end_time = spot_utils.get_job_timestamp(self._backend,
cluster_name,
get_end_time=True)
# The job is done.
spot_state.set_succeeded(self._job_id,
task_id,
end_time=end_time,
callback_func=callback_func)
logger.info(
f'Spot job {self._job_id} (task: {task_id}) SUCCEEDED. '
f'Cleaning up the spot cluster {cluster_name}.')
# Only clean up the spot cluster, not the storages, because
# tasks may share storages.
recovery_strategy.terminate_cluster(cluster_name=cluster_name)
return True
# For single-node jobs, nonterminated job_status indicates a
# healthy cluster. We can safely continue monitoring.
# For multi-node jobs, since the job may not be set to FAILED
# immediately (depending on user program) when only some of the
# nodes are preempted, need to check the actual cluster status.
if (job_status is not None and not job_status.is_terminal() and
task.num_nodes == 1):
continue
if job_status in [
job_lib.JobStatus.FAILED, job_lib.JobStatus.FAILED_SETUP
]:
# Add a grace period before the check of preemption to avoid
# false alarm for job failure.
time.sleep(5)
# Pull the actual cluster status from the cloud provider to
# determine whether the cluster is preempted.
(cluster_status,
handle) = backend_utils.refresh_cluster_status_handle(
cluster_name,
force_refresh_statuses=set(status_lib.ClusterStatus))
if cluster_status != status_lib.ClusterStatus.UP:
# The cluster is (partially) preempted. It can be down, INIT
# or STOPPED, based on the interruption behavior of the cloud.
# Spot recovery is needed (will be done later in the code).
cluster_status_str = ('' if cluster_status is None else
f' (status: {cluster_status.value})')
logger.info(
f'Cluster is preempted{cluster_status_str}. Recovering...')
else:
if job_status is not None and not job_status.is_terminal():
# The multi-node job is still running, continue monitoring.
continue

sky spot launch -n test-long-running-job --cloud aws --cpus 2 --num-nodes 16 sleep 1000000000000000

@Michaelvll
Copy link
Collaborator Author

Bisect the git history and found the commit that causes this OOM issue: #2288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant