Skip to content

Error in test suite: an illegal memory access was encountered #41340

@Flamefire

Description

@Flamefire

🐛 Bug

Running the test suite fails on our system. The issue seems to be with TestTorchDeviceTypeCUDA where starting with test_blas_alpha_beta_empty_cuda_float16 all tests fail with RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Steps to reproduce the behavior:

  1. python run_tests.py

One of the traceback is:

Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 241, in instantiated_test
    result = test(self, device_arg, dtype)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 411, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_torch.py", line 13909, in test_blas_alpha_beta_empty
    torch.addmv(input=input, mat=mat, vec=vec, alpha=alpha, beta=beta))
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1080, in assertEqual
    exact_device=exact_device)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 971, in _compareTensors
    return _compare_tensors_internal(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/__init__.py", line 122, in _compare_tensors_internal
    if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
RuntimeError: CUDA error: an illegal memory access was encountered

Maybe related to #21819 or #36722

Environment

PyTorch version: 1.6.0-rc2
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0
CMake version: version 3.15.3

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 450.36.06
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.3

cc @ezyang @gchanan @zou3519 @csarofeen @ptrblck @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: cublasProblem related to cublas supportmodule: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions