Skip to content

Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 24, 2022
Merged

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Feb 21, 2022

Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel

Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used

@facebook-github-bot
Copy link

facebook-github-bot commented Feb 21, 2022

💊 CI failures summary and remediations

As of commit 860566b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @malfet. Let's merge after the CI passes.

@datumbox
Copy link
Contributor

@malfet The failing test is due to an issue on a dependency: pyreadline/pyreadline#65

@pmeier Any thoughts on how we could get around it until this is patched upstream?

@malfet
Copy link
Contributor Author

malfet commented Feb 21, 2022

@malfet The failing test is due to an issue on a dependency: pyreadline/pyreadline#65

[Edit] Ok, for some reason h5py only depends on pyreadline on Windows. Added constraint that it should only be installed on py-3.9 or older on Win

@pmeier
Copy link
Collaborator

pmeier commented Feb 21, 2022

This seems to be a problem with h5py from the defaults channel. The one from conda-forge should work fine conda-forge/h5py-feedstock#93. Is there a reason we use the defaults channel over conda-forge in the environments?

@malfet
Copy link
Contributor Author

malfet commented Feb 22, 2022

This seems to be a problem with h5py from the defaults channel. The one from conda-forge should work fine conda-forge/h5py-feedstock#93. Is there a reason we use the defaults channel over conda-forge in the environments?
conda-forge environments are too often broken (in my mind, conda-forge is akin to FedoraCore - there are lots of cool new features, but half of the time basic ones do not work) Another reason is that packages built against default channel can be installed into conda-forge, but not other way around. Moreover, both pytorch and torchvision are already present in conda-forge, so what's the point of duplicating community work?

@pmeier
Copy link
Collaborator

pmeier commented Feb 23, 2022

Moreover, both pytorch and torchvision are already present in conda-forge, so what's the point of duplicating community work?

I'm not sure how this relates to the question. To elaborate, I was asking why we are using the defaults channel in our unittest not build environments. As pointed out, the h5py version from there is outdated, whereas the now failing behavior is long fixed on conda-forge.

I personally never had issues with stuff from conda-forge over defaults for development or test environments, but I also don't have that much experience.

@datumbox
Copy link
Contributor

@malfet What is the recommended way to move forward? As you know the release is near and we need to fix CI jobs to avoid a bodged release. I'm also concerned that the job unittest_windows_gpu_py3.8 is failing with memory access issues on this branch. Given it doesn't fail on other branches I believe it's due to the switch of cuda versions.

The constraint added above didn't seem to work and the unittest_windows_cpu_py3.10 still fails. If we can't use conda-forge, what should we use?

Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel
@malfet malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from e23189d to 46f1697 Compare February 24, 2022 00:39
@malfet
Copy link
Contributor Author

malfet commented Feb 24, 2022

Ok, looks like gaussian blur fails with kernel sizes of 23x23, which sounds suspiciously familiar
Here is run of the test with cuda-memcheck:

(C:\Users\circleci\project\env) C:\Users\circleci\project\test>"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cuda-memcheck.exe" pytest test_transforms_tensor.py -k test_gaussian_blur[1-meth_kwargs1
========= CUDA-MEMCHECK
================================================================================== test session starts ===================================================================================
platform win32 -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: C:\Users\circleci\project, configfile: pytest.ini
plugins: cov-3.0.0, mock-3.6.1
collected 3066 items / 3064 deselected / 2 selected

test_transforms_tensor.py .FE                                                                                                                                                       [100%]

========================================================================================= ERRORS =========================================================================================
______________________________________________________________ ERROR at teardown of test_gaussian_blur[1-meth_kwargs1-cuda] ______________________________________________________________
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\conftest.py", line 104, in prevent_leaking_rng
    torch.cuda.set_rng_state(cuda_rng_state)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\random.py", line 64, in set_rng_state
    _lazy_call(cb)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\__init__.py", line 155, in _lazy_call
    callable()
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\random.py", line 62, in cb
    default_generator.set_state(new_state_copy)
RuntimeError: CUDA error: unspecified launch failure
======================================================================================== FAILURES ========================================================================================
________________________________________________________________________ test_gaussian_blur[1-meth_kwargs1-cuda] _________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 963, in test_gaussian_blur
    _test_class_op(
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 85, in _test_class_op
    _test_transform_vs_scripted_on_batch(f, scripted_fn, batch_tensors)
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 36, in _test_transform_vs_scripted_on_batch
    transformed_batch = transform(batch_tensors)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\nn\modules\module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "c:\users\circleci\project\torchvision\transforms\transforms.py", line 1817, in forward
    return F.gaussian_blur(img, self.kernel_size, [sigma, sigma])
  File "c:\users\circleci\project\torchvision\transforms\functional.py", line 1326, in gaussian_blur
    output = F_t.gaussian_blur(t_img, kernel_size, sigma)
  File "c:\users\circleci\project\torchvision\transforms\functional_tensor.py", line 774, in gaussian_blur
    img = conv2d(img, kernel, groups=img.shape[-3])
RuntimeError: CUDA error: unspecified launch failure
================================================================================ short test summary info =================================================================================
ERROR test_transforms_tensor.py::test_gaussian_blur[1-meth_kwargs1-cuda] - RuntimeError: CUDA error: unspecified launch failure
FAILED test_transforms_tensor.py::test_gaussian_blur[1-meth_kwargs1-cuda] - RuntimeError: CUDA error: unspecified launch failure
================================================================= 1 failed, 1 passed, 3064 deselected, 1 error in 35.57s =================================================================
========= Invalid __shared__ read of size 4
=========     at 0x00001d10 in volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1
=========     by thread (95,0,0) in block (24,0,0)
=========     Address 0x0000250c is out of bounds
=========     Device Frame:volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1 (volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1 : 0x1d10)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x76888]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x76bb1]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x7b0da]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuProfilerStop + 0x11cc6a) [0x33d9ea]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x17069d]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuProfilerStop + 0xf0c72) [0x3119f2]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x38bdb]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x390af]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x39394]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuLaunchKernel + 0x234) [0x20fc44]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll [0x3896]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll [0x26fd]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::getPlatform + 0xe9) [0x1d54529]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::transformTensor + 0x1bc1) [0x1dc0651]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::transformTensor + 0xbe6a) [0x1dca8fa]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::ConvDgradShader::isSplitK + 0x49b) [0x1ddcd9b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Descriptor::initialize_internal + 0x618e) [0x5c67ce]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Descriptor::initialize_internal + 0x6eb1) [0x5c74f1]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineContainer<1012,113664>::execute_internal_impl + 0x2a) [0x54f27a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::TensorDesc::operator== + 0x2d2) [0x544612]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineContainer<1,4096>::execute_internal_impl + 0xd241) [0x55c4d1]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::execute + 0x103f) [0x54eebf]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Tensor::Tensor + 0x18b6) [0x5ab246]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Tensor::Tensor + 0xbe1) [0x5aa571]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::convolutionForward + 0x10b) [0x65609b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnnConvolutionForward + 0x331) [0x657081]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x4263) [0x48863]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x83a7) [0x4c9a7]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x7736) [0x4bd36]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x1ae5) [0x460e5]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x752f) [0x4bb2f]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_add_relu + 0x16ec) [0x43fec]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution + 0xc5) [0x428e5]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::cuda::view_as_real + 0x14adc) [0x456680c]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::cuda::bucketize_outf + 0x3df7a) [0x450361a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::cudnn_convolution::call + 0x242) [0x70175b2]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::_convolution + 0xf5e) [0x692064e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::xlogy_ + 0x40e) [0x72d1bee]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::bmm + 0x1a1ed) [0x72975bd]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::_convolution::call + 0x2d6) [0x6d5a226]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::convolution + 0x164) [0x6928914]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::xlogy_ + 0xc6b) [0x72d244b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::bmm + 0x1a2ca) [0x729769a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::TensorMaker::make_tensor + 0x88e49) [0x6d40779]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::convolution::redispatch + 0x123) [0x6dc39a3]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::GraphRoot::apply + 0x157b1) [0x7bfd851]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::GraphRoot::apply + 0xc6c8) [0x7bf4768]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::convolution::call + 0x26f) [0x6d71b6f]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::conv2d + 0x1be) [0x69277be]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeimplicitautograd::where + 0x1db4) [0x73ac8e4]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeimplicitautograd::broadcast_to + 0x2a7a3) [0x738d953]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::conv2d::call + 0x219) [0x70ba239]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::conv2d + 0x64) [0x67106d4]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_python.dll (torch::FunctionSignature::operator= + 0x1096fc) [0x14d77c]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_python.dll (torch::FunctionSignature::operator= + 0x12f7ab) [0x17382b]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyMethodDef_RawFastCallKeywords + 0x410) [0x126fe0]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyObject_MakeTpCall + 0x106) [0x125fa6]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyEval_GetFuncDesc + 0x408) [0x2036b8]
=========
...

@IvanYashchuk, does it sounds familiar?

@IvanYashchuk
Copy link

No, I don't know the source of the problem.

@malfet malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from b0890f5 to 737af02 Compare February 24, 2022 07:27
@malfet malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from 737af02 to 860566b Compare February 24, 2022 07:56
@malfet malfet merged commit e5a5f0b into main Feb 24, 2022
@malfet malfet deleted the malfet/move_win_gpu-to-11.3 branch February 24, 2022 08:27
jdsgomes pushed a commit to jdsgomes/vision that referenced this pull request Feb 24, 2022
Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel

Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used
datumbox pushed a commit that referenced this pull request Feb 24, 2022
Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel

Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used

Co-authored-by: Nikita Shulga <[email protected]>
facebook-github-bot pushed a commit that referenced this pull request Feb 25, 2022
Summary:
Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel

Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used

Reviewed By: jdsgomes

Differential Revision: D34475316

fbshipit-source-id: 463ef028d315942efae956e8a5b314f8868a2975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants