You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-05-22-15-32 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
Hostname: cn-504
...
...
...
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1502, in TorchBackend.masked_select
line: out_raw = torch.masked_select(in_raw, mask_raw)
locals:
out_raw = <not found>
torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
torch.masked_select = <global> <built-in method masked_select of type object at 0x7fe9049aeaa0>
in_raw = <local> tensor[19, 75, 10025] n=14285625 (54Mb) x∈[-10.679, 17.053] μ=-0.066 σ=1.119 grad CloneBackward0 cuda:0
mask_raw = <local> tensor[19, 75, 1] bool n=1425 (1.4Kb) x∈[False, True] μ=0.735 σ=0.442 cuda:0
OutOfMemoryError: CUDA out of memory. Tried to allocate 242.00 MiB. GPU 0 has a total capacty of 22.03 GiB of which 66.88 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 19.44 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Module call stack:
(No module call frames.)
Retry after OOM now.
Restart RETURNN in train epoch 222, global train step 107953.
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
RETURNN starting up, version 1.20240105.140136+git.7844bc91, date/time 2024-01-06-00-53-33 (UTC+0000), pid 2681751, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config', '++restart_after_train_exception', '222,1']
Hostname: cn-504
Installed native_signal_handler.so.
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
CUDA_VISIBLE_DEVICES is set to '1'.
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from
cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operat
ion not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
(CUDA not available)
...
EXCEPTION
Traceback (most recent call last):
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/util/diagnose_gpu.py", line 97, in diagnose_no_gpu
line: torch.cuda.init()
locals:
torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
torch.cuda = <global> <module 'torch.cuda' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py'>
torch.cuda.init = <global> <function init at 0x7f20ad764ea0>
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 265, in init
line: _lazy_init()
locals:
_lazy_init = <global> <function _lazy_init at 0x7f20ad764fe0>
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
line: torch._C._cuda_init()
locals:
torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
torch._C = <global> <module 'torch._C' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/_C.cpython-311-x86_64-linu
x-gnu.so'>
torch._C._cuda_init = <global> <built-in function _cuda_init>
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?
Error 304: OS call failed or operation not supported on this OS
Sat Jan 6 01:53:47 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10 Off| 00000000:17:00.0 Off | 0 |
| 0% 69C P0 139W / 150W| 18356MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 Off| 00000000:65:00.0 Off | 0 |
| 0% 54C P0 63W / 150W| 22496MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10 Off| 00000000:CA:00.0 Off | 0 |
| 0% 36C P0 55W / 150W| 21906MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10 Off| 00000000:E3:00.0 Off | 0 |
| 0% 61C P0 154W / 150W| 16908MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2622486 C /usr/bin/python3 18354MiB |
| 1 N/A N/A 2681751 C ...envs/py3.11-torch2.1/bin/python3.11 22494MiB |
| 2 N/A N/A 2665596 C ...nn/envs/fairseq_newtorch/bin/python 21904MiB |
| 3 N/A N/A 2678568 C /usr/bin/python3 16906MiB |
+---------------------------------------------------------------------------------------+
CUDA_VISIBLE_DEVICES: 1
LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/local/cudnn-11.X-v8.4/lib64:/usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/extras/CUPTI/lib64:/usr/local/cudnn-
10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-
9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-1
0.0/extras/CUPTI/lib64:/usr/local/lib:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
torch.cuda.init() failed: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
Unhandled exception <class 'Exception'> in thread <_MainThread(MainThread, started 139780166780736)>, proc 2681751.
...
Exception: No GPU device found, but config requested 'gpu' device.
torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
So the error: torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS.
Our RETURNN restart logic basically simply calls execv. Probably we should do some CUDA shutdown/cleanup before. Maybe call the C atexit handlers.
One option would be to make sure there is a parent process trampoline, which could be RETURNN itself, which would:
Start a subprocess with the main task.
The subprocess might signal in some way that it wants a restart.
If there was no special signal, the parent would just stop once the subprocess dies.
This could maybe even be extended to directly cover some failure cases (e.g. CPU OOM) and to restart then if there was process.
albertz
changed the title
PyTorch recover after OOM with restart does not work with CUDA
PyTorch recover after CUDA OOM with restart does not work with CUDA
Jan 6, 2024
Note: We can still have also an option like forward_auto_split_batch_on_oom but for training, even though it might not be totally correct in all cases due to batch norm.
Uh oh!
There was an error while loading. Please reload this page.
So the error:
torch.cuda.init() failed: RuntimeError Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
.Our RETURNN restart logic basically simply calls
execv
. Probably we should do some CUDA shutdown/cleanup before. Maybe call the C atexit handlers.Related CPython issue: python/cpython#61026
Related other issues:
nedbat/coveragepy#43
qtile/qtile#2043
The text was updated successfully, but these errors were encountered: