Skip to content

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
albertz opened this issue Jan 6, 2024 · 3 comments
Closed

PyTorch recover after CUDA OOM with restart does not work with CUDA #1489

albertz opened this issue Jan 6, 2024 · 3 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 6, 2024

RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-05-22-15-32 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
Hostname: cn-504
...
...
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1502, in TorchBackend.masked_select
    line: out_raw = torch.masked_select(in_raw, mask_raw)
    locals:
      out_raw = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.masked_select = <global> <built-in method masked_select of type object at 0x7fe9049aeaa0>
      in_raw = <local> tensor[19, 75, 10025] n=14285625 (54Mb) x∈[-10.679, 17.053] μ=-0.066 σ=1.119 grad CloneBackward0 cuda:0
      mask_raw = <local> tensor[19, 75, 1] bool n=1425 (1.4Kb) x∈[False, True] μ=0.735 σ=0.442 cuda:0 
OutOfMemoryError: CUDA out of memory. Tried to allocate 242.00 MiB. GPU 0 has a total capacty of 22.03 GiB of which 66.88 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Module call stack:
(No module call frames.)

Retry after OOM now.
Restart RETURNN in train epoch 222, global train step 107953.





WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py'). 
RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-06-00-53-33 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config', '++restart_after_train_exception', '222,1']
Hostname: cn-504
Installed native_signal_handler.so.
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
CUDA_VISIBLE_DEVICES is set to '1'.
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from
 cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operat
ion not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
(CUDA not available)
...
EXCEPTION
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/util/diagnose_gpu.py", line 97, in diagnose_no_gpu
    line: torch.cuda.init()
    locals:
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.cuda = <global> <module 'torch.cuda' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py'>
      torch.cuda.init = <global> <function init at 0x7f20ad764ea0>
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 265, in init
    line: _lazy_init()
    locals:
      _lazy_init = <global> <function _lazy_init at 0x7f20ad764fe0>
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    line: torch._C._cuda_init()
    locals:
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch._C = <global> <module 'torch._C' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/_C.cpython-311-x86_64-linu
x-gnu.so'>
      torch._C._cuda_init = <global> <built-in function _cuda_init>
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? 
Error 304: OS call failed or operation not supported on this OS
Sat Jan  6 01:53:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                      Off| 00000000:17:00.0 Off |                    0 |
|  0%   69C    P0              139W / 150W|  18356MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10                      Off| 00000000:65:00.0 Off |                    0 |
|  0%   54C    P0               63W / 150W|  22496MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10                      Off| 00000000:CA:00.0 Off |                    0 |
|  0%   36C    P0               55W / 150W|  21906MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10                      Off| 00000000:E3:00.0 Off |                    0 |
|  0%   61C    P0              154W / 150W|  16908MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2622486      C   /usr/bin/python3                          18354MiB |
|    1   N/A  N/A   2681751      C   ...envs/py3.11-torch2.1/bin/python3.11    22494MiB |
|    2   N/A  N/A   2665596      C   ...nn/envs/fairseq_newtorch/bin/python    21904MiB |
|    3   N/A  N/A   2678568      C   /usr/bin/python3                          16906MiB |
+---------------------------------------------------------------------------------------+
CUDA_VISIBLE_DEVICES: 1
LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/local/cudnn-11.X-v8.4/lib64:/usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/extras/CUPTI/lib64:/usr/local/cudnn-
10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-
9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-1
0.0/extras/CUPTI/lib64:/usr/local/lib:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
torch.cuda.init() failed: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
Unhandled exception <class 'Exception'> in thread <_MainThread(MainThread, started 139780166780736)>, proc 2681751.
...
Exception: No GPU device found, but config requested 'gpu' device.
torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS

So the error: torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS.

Our RETURNN restart logic basically simply calls execv. Probably we should do some CUDA shutdown/cleanup before. Maybe call the C atexit handlers.

Related CPython issue: python/cpython#61026

Related other issues:
nedbat/coveragepy#43
qtile/qtile#2043

@albertz
Copy link
Member Author

albertz commented Jan 6, 2024

One option would be to make sure there is a parent process trampoline, which could be RETURNN itself, which would:

  • Start a subprocess with the main task.
  • The subprocess might signal in some way that it wants a restart.
  • If there was no special signal, the parent would just stop once the subprocess dies.
    This could maybe even be extended to directly cover some failure cases (e.g. CPU OOM) and to restart then if there was process.

@albertz albertz changed the title PyTorch recover after OOM with restart does not work with CUDA PyTorch recover after CUDA OOM with restart does not work with CUDA Jan 6, 2024
@albertz
Copy link
Member Author

albertz commented Jan 17, 2024

We have a solution now. See the use_train_proc_manager option.

@albertz albertz closed this as completed Jan 17, 2024
@albertz
Copy link
Member Author

albertz commented Jan 17, 2024

Note: We can still have also an option like forward_auto_split_batch_on_oom but for training, even though it might not be totally correct in all cases due to batch norm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant