-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Description
🐛 Bug
Running the test suite fails on our system. The issue seems to be with TestTorchDeviceTypeCUDA
where starting with test_blas_alpha_beta_empty_cuda_float16
all tests fail with RuntimeError: CUDA error: an illegal memory access was encountered
To Reproduce
Steps to reproduce the behavior:
- python run_tests.py
One of the traceback is:
Traceback (most recent call last):
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
method(*args, **kwargs)
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
method(*args, **kwargs)
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 241, in instantiated_test
result = test(self, device_arg, dtype)
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 411, in dep_fn
return fn(slf, device, *args, **kwargs)
File "test_torch.py", line 13909, in test_blas_alpha_beta_empty
torch.addmv(input=input, mat=mat, vec=vec, alpha=alpha, beta=beta))
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1080, in assertEqual
exact_device=exact_device)
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 971, in _compareTensors
return _compare_tensors_internal(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)
File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/__init__.py", line 122, in _compare_tensors_internal
if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
RuntimeError: CUDA error: an illegal memory access was encountered
Maybe related to #21819 or #36722
Environment
PyTorch version: 1.6.0-rc2
Is debug build: N/A
CUDA used to build PyTorch: N/A
OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0
CMake version: version 3.15.3
Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80
Nvidia driver version: 450.36.06
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.17.3