PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

albertz · 2024-01-06T01:30:37Z

RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-05-22-15-32 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
Hostname: cn-504
...
...
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1502, in TorchBackend.masked_select
    line: out_raw = torch.masked_select(in_raw, mask_raw)
    locals:
      out_raw = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.masked_select = <global> <built-in method masked_select of type object at 0x7fe9049aeaa0>
      in_raw = <local> tensor[19, 75, 10025] n=14285625 (54Mb) x∈[-10.679, 17.053] μ=-0.066 σ=1.119 grad CloneBackward0 cuda:0
      mask_raw = <local> tensor[19, 75, 1] bool n=1425 (1.4Kb) x∈[False, True] μ=0.735 σ=0.442 cuda:0 
OutOfMemoryError: CUDA out of memory. Tried to allocate 242.00 MiB. GPU 0 has a total capacty of 22.03 GiB of which 66.88 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Module call stack:
(No module call frames.)

Retry after OOM now.
Restart RETURNN in train epoch 222, global train step 107953.





WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py'). 
RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-06-00-53-33 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config', '++restart_after_train_exception', '222,1']
Hostname: cn-504
Installed native_signal_handler.so.
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
CUDA_VISIBLE_DEVICES is set to '1'.
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from
 cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operat
ion not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
(CUDA not available)
...
EXCEPTION
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/util/diagnose_gpu.py", line 97, in diagnose_no_gpu
    line: torch.cuda.init()
    locals:
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.cuda = <global> <module 'torch.cuda' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py'>
      torch.cuda.init = <global> <function init at 0x7f20ad764ea0>
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 265, in init
    line: _lazy_init()
    locals:
      _lazy_init = <global> <function _lazy_init at 0x7f20ad764fe0>
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    line: torch._C._cuda_init()
    locals:
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch._C = <global> <module 'torch._C' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/_C.cpython-311-x86_64-linu
x-gnu.so'>
      torch._C._cuda_init = <global> <built-in function _cuda_init>
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? 
Error 304: OS call failed or operation not supported on this OS
Sat Jan  6 01:53:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                      Off| 00000000:17:00.0 Off |                    0 |
|  0%   69C    P0              139W / 150W|  18356MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10                      Off| 00000000:65:00.0 Off |                    0 |
|  0%   54C    P0               63W / 150W|  22496MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10                      Off| 00000000:CA:00.0 Off |                    0 |
|  0%   36C    P0               55W / 150W|  21906MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10                      Off| 00000000:E3:00.0 Off |                    0 |
|  0%   61C    P0              154W / 150W|  16908MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2622486      C   /usr/bin/python3                          18354MiB |
|    1   N/A  N/A   2681751      C   ...envs/py3.11-torch2.1/bin/python3.11    22494MiB |
|    2   N/A  N/A   2665596      C   ...nn/envs/fairseq_newtorch/bin/python    21904MiB |
|    3   N/A  N/A   2678568      C   /usr/bin/python3                          16906MiB |
+---------------------------------------------------------------------------------------+
CUDA_VISIBLE_DEVICES: 1
LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/local/cudnn-11.X-v8.4/lib64:/usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/extras/CUPTI/lib64:/usr/local/cudnn-
10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-
9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-1
0.0/extras/CUPTI/lib64:/usr/local/lib:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
torch.cuda.init() failed: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
Unhandled exception <class 'Exception'> in thread <_MainThread(MainThread, started 139780166780736)>, proc 2681751.
...
Exception: No GPU device found, but config requested 'gpu' device.
torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS

So the error: torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS.

Our RETURNN restart logic basically simply calls execv. Probably we should do some CUDA shutdown/cleanup before. Maybe call the C atexit handlers.

Related CPython issue: python/cpython#61026

Related other issues:
nedbat/coveragepy#43
qtile/qtile#2043

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-06T10:28:28Z

One option would be to make sure there is a parent process trampoline, which could be RETURNN itself, which would:

Start a subprocess with the main task.
The subprocess might signal in some way that it wants a restart.
If there was no special signal, the parent would just stop once the subprocess dies.
This could maybe even be extended to directly cover some failure cases (e.g. CPU OOM) and to restart then if there was process.

albertz · 2024-01-17T10:48:56Z

We have a solution now. See the use_train_proc_manager option.

albertz · 2024-01-17T10:51:00Z

Note: We can still have also an option like forward_auto_split_batch_on_oom but for training, even though it might not be totally correct in all cases due to batch norm.

albertz changed the title ~~PyTorch recover after OOM with restart does not work with CUDA~~ PyTorch recover after CUDA OOM with restart does not work with CUDA Jan 6, 2024

albertz mentioned this issue Jan 6, 2024

execv (et al.) should invoke atexit handlers before executing new code python/cpython#61026

Open

albertz closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

albertz commented Jan 6, 2024 •

edited

Loading

albertz commented Jan 6, 2024

Uh oh!

albertz commented Jan 17, 2024

Uh oh!

albertz commented Jan 17, 2024

Uh oh!

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

Comments

albertz commented Jan 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

albertz commented Jan 6, 2024

Uh oh!

albertz commented Jan 17, 2024

Uh oh!

albertz commented Jan 17, 2024

Uh oh!

albertz commented Jan 6, 2024 •

edited

Loading