[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

Michaelvll · 2023-10-06T00:17:19Z

A user reported two issues on a spot controller with OOM happened:

A spot job is killed due to OOM with the following error in sky spot --controller:

Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_48", line 452, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xxx.xx.xx.xxx, ID: 752666e00e5bf1cca24c49c6a40e0a68b0ba099d9f8b04ce6f5929a3) where the task (task ID: 1571137c1ceea2f1a9ed70b6ec66c436e5fde3c403000000, name=<task-name>,, pid=2936, memory used=0.10GB) was running was 29.30GB / 30.84GB (0.950025), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: fc5c3a323b9b91b8eeb4252276fe5711a946bfb4257c85202b9b8f6b) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xxx.xx.xx.xxx`. To see the logs of the worker, use `ray logs worker-fc5c3a323b9b91b8eeb4252276fe5711a946bfb4257c85202b9b8f6b*out -ip xxx.xx.xx.xxx. Top 10 memory users:
PID	MEM(GB)	COMMAND
18410	26.84	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1732	0.48	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray_skypilot/...
1766	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
1977	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
2936	0.10	ray::<task-name>,
1883	0.10	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
18323	0.10	python3 -u /home/ubuntu/.sky/sky_app/sky_job_48
18260	0.08	ray::JobSupervisor
616770	0.08	python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 18410 -...
18376	0.08	python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 2936 --...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Shared connection to xxx.xx.xx.xxx closed.

Another spot job is shown FAILED_CONTROLLER in sky spot queue while the spot controller process is still running, causing the job to continue and untracked, i.e., resources leakage. The other spot jobs were not affected.

This is a serious issue, as this causes a leakage of the spot jobs.

We offered a workaround to ask the user log into the spot controller kill the spot controller process and manually sky down the staled cluster.

However, several important TODOs (sorted by priority):

We need a safety net for the controllers turn into FAILED_CONTROLLER, to kill the controller process and clean up the clusters (having it in skylet should be an option). Added in [Spot] Cleanup zombie controller processes for OOM corner cases #2670
Why the OOM happens when there are only 4 spot jobs

Some additional information, the user have about 50 spot jobs finished while 4 spot jobs running, and each job has 4 or 16 nodes.

The text was updated successfully, but these errors were encountered:

Michaelvll · 2023-10-06T02:47:10Z

why the OOM happens when there are only 4 spot jobs

The OOM issue might be related to this issue: https://discuss.ray.io/t/how-to-get-gcs-server-momery-distribution-to-debug-memory-continued-increasement/10030/4 and ray-project/ray#34619

We can consider reducing the value of RAY_maximum_gcs_destroyed_actor_cached_count to less than 100k.

Based on our current way for launching the skypilot job, each spot job will have ray tasks > 16 (1 for each node for setup) + 16 (1 for each node for run) + 1 (ray job supervisor) = 33. According to the reference above each dead ray task can cause 15MB memory consumption, leading to 50 * 30 * 15 / 1024 = 22GB for the dead ray tasks, which is close to the memory upper bound of the spot controller 32GB, considering there are also some ray up workloads that can take a lot of memory.

However, that said, we still need to figure out why the top 1 memory consumption is caused by the multiprocessing.

Michaelvll · 2023-10-06T19:01:02Z

An additional logging:

(<task-name>, pid=1598212) I 10-06 15:35:09 cloud_vm_ray_backend.py:1807] Launching on AWS us-west-2 (us-west-2b)
Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_55", line 452, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: <ip>, ID: 752666e00e5bf1cca24c49c6a40e0a68b0ba099d9f8b04ce6f5929a3) where the task (task ID: 6aae942a0d68c18ef28b98818a2417ea163325070a000000, name=<task-name>,, pid=1598212, memory used=0.09GB) was running was 29.39GB / 30.84GB (0.952786), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip <ip>`. To see the logs of the worker, use `ray logs worker-cc44e735f5d501996315691dcc97f6e752974ba935bd9ab27ac20060*out -ip <ip>. Top 10 memory users:
PID	MEM(GB)	COMMAND
950070	15.12	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1598333	5.11	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1611919	5.06	/opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_...
1732	0.54	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray_skypilot/...
1883	0.15	/opt/conda/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
1977	0.12	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
1952252	0.11	/opt/conda/bin/python3 /tmp/skypilot_ray_up_j5y2hven.py
1766	0.11	/opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
949935	0.09	python3 -u /home/ubuntu/.sky/sky_app/sky_job_53
1598164	0.09	python3 -u /home/ubuntu/.sky/sky_app/sky_job_54
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

> sudo env "PATH=$PATH" py-spy dump --pid 950070
Process 950070: /opt/conda/bin/python3 -Wignore -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=10, pipe_handle=12) --multiprocessing-fork
Python v3.10.6 (/opt/conda/bin/python3.10)

Thread 950070 (idle): "MainThread"
    _run_one_task (sky/spot/controller.py:209)
    run (sky/spot/controller.py:338)
    _run_controller (sky/spot/controller.py:406)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)

Michaelvll · 2023-10-06T20:59:03Z

It seems our while loop in the controller will increase the memory consumption by ~5MB per iteration (20 seconds), which means after a day (24 hours), there will be 21GB memory consumption.

skypilot/sky/spot/controller.py

Lines 209 to 275 in ea12df2

    
           time.sleep(spot_utils.JOB_STATUS_CHECK_GAP_SECONDS) 
        
           # Check the network connection to avoid false alarm for job failure. 
        
           # Network glitch was observed even in the VM. 
        
           try: 
        
               backend_utils.check_network_connection() 
        
           except exceptions.NetworkError: 
        
               logger.info( 
        
                   'Network is not available. Retrying again in ' 
        
                   f'{spot_utils.JOB_STATUS_CHECK_GAP_SECONDS} seconds.') 
        
               continue 
        
           # NOTE: we do not check cluster status first because race condition 
        
           # can occur, i.e. cluster can be down during the job status check. 
        
           job_status = spot_utils.get_job_status(self._backend, cluster_name) 
        
           if job_status == job_lib.JobStatus.SUCCEEDED: 
        
               end_time = spot_utils.get_job_timestamp(self._backend, 
        
                                                       cluster_name, 
        
                                                       get_end_time=True) 
        
               # The job is done. 
        
               spot_state.set_succeeded(self._job_id, 
        
                                        task_id, 
        
                                        end_time=end_time, 
        
                                        callback_func=callback_func) 
        
               logger.info( 
        
                   f'Spot job {self._job_id} (task: {task_id}) SUCCEEDED. ' 
        
                   f'Cleaning up the spot cluster {cluster_name}.') 
        
               # Only clean up the spot cluster, not the storages, because 
        
               # tasks may share storages. 
        
               recovery_strategy.terminate_cluster(cluster_name=cluster_name) 
        
               return True 
        
           # For single-node jobs, nonterminated job_status indicates a 
        
           # healthy cluster. We can safely continue monitoring. 
        
           # For multi-node jobs, since the job may not be set to FAILED 
        
           # immediately (depending on user program) when only some of the 
        
           # nodes are preempted, need to check the actual cluster status. 
        
           if (job_status is not None and not job_status.is_terminal() and 
        
                   task.num_nodes == 1): 
        
               continue 
        
           if job_status in [ 
        
                   job_lib.JobStatus.FAILED, job_lib.JobStatus.FAILED_SETUP 
        
           ]: 
        
               # Add a grace period before the check of preemption to avoid 
        
               # false alarm for job failure. 
        
               time.sleep(5) 
        
           # Pull the actual cluster status from the cloud provider to 
        
           # determine whether the cluster is preempted. 
        
           (cluster_status, 
        
            handle) = backend_utils.refresh_cluster_status_handle( 
        
                cluster_name, 
        
                force_refresh_statuses=set(status_lib.ClusterStatus)) 
        
           if cluster_status != status_lib.ClusterStatus.UP: 
        
               # The cluster is (partially) preempted. It can be down, INIT 
        
               # or STOPPED, based on the interruption behavior of the cloud. 
        
               # Spot recovery is needed (will be done later in the code). 
        
               cluster_status_str = ('' if cluster_status is None else 
        
                                     f' (status: {cluster_status.value})') 
        
               logger.info( 
        
                   f'Cluster is preempted{cluster_status_str}. Recovering...') 
        
           else: 
        
               if job_status is not None and not job_status.is_terminal(): 
        
                   # The multi-node job is still running, continue monitoring. 
        
                   continue

sky spot launch -n test-long-running-job --cloud aws --cpus 2 --num-nodes 16 sleep 1000000000000000

Michaelvll · 2023-10-06T23:47:52Z

Bisect the git history and found the commit that causes this OOM issue: #2288

Michaelvll added the P0 label Oct 6, 2023

Michaelvll mentioned this issue Oct 6, 2023

[Spot] Cleanup zombie controller processes for OOM corner cases #2670

Closed

7 tasks

Michaelvll mentioned this issue Oct 7, 2023

[Spot] Fix OOM for long running spot controller #2675

Merged

7 tasks

Michaelvll closed this as completed in #2675 Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

Comments

Michaelvll commented Oct 6, 2023 • edited Loading

Michaelvll commented Oct 6, 2023 • edited Loading

Michaelvll commented Oct 6, 2023 • edited Loading

Michaelvll commented Oct 6, 2023 • edited Loading

Michaelvll commented Oct 6, 2023

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading

Michaelvll commented Oct 6, 2023 •

edited

Loading