Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

malfet · 2022-02-21T18:26:14Z

Also, remove reference to conda-forge, all CUDA toolchain should be
available from NVIDIA channel

Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used

facebook-github-bot · 2022-02-21T18:26:20Z

💊 CI failures summary and remediations

As of commit 860566b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

datumbox

LGTM, thanks @malfet. Let's merge after the CI passes.

datumbox · 2022-02-21T18:38:53Z

@malfet The failing test is due to an issue on a dependency: pyreadline/pyreadline#65

@pmeier Any thoughts on how we could get around it until this is patched upstream?

malfet · 2022-02-21T18:41:14Z

@malfet The failing test is due to an issue on a dependency: pyreadline/pyreadline#65

[Edit] Ok, for some reason h5py only depends on pyreadline on Windows. Added constraint that it should only be installed on py-3.9 or older on Win

pmeier · 2022-02-21T20:41:38Z

This seems to be a problem with h5py from the defaults channel. The one from conda-forge should work fine conda-forge/h5py-feedstock#93. Is there a reason we use the defaults channel over conda-forge in the environments?

malfet · 2022-02-22T18:44:32Z

This seems to be a problem with h5py from the defaults channel. The one from conda-forge should work fine conda-forge/h5py-feedstock#93. Is there a reason we use the defaults channel over conda-forge in the environments?
conda-forge environments are too often broken (in my mind, conda-forge is akin to FedoraCore - there are lots of cool new features, but half of the time basic ones do not work) Another reason is that packages built against default channel can be installed into conda-forge, but not other way around. Moreover, both pytorch and torchvision are already present in conda-forge, so what's the point of duplicating community work?

pmeier · 2022-02-23T07:14:13Z

Moreover, both pytorch and torchvision are already present in conda-forge, so what's the point of duplicating community work?

I'm not sure how this relates to the question. To elaborate, I was asking why we are using the defaults channel in our unittest not build environments. As pointed out, the h5py version from there is outdated, whereas the now failing behavior is long fixed on conda-forge.

I personally never had issues with stuff from conda-forge over defaults for development or test environments, but I also don't have that much experience.

datumbox · 2022-02-23T08:49:13Z

@malfet What is the recommended way to move forward? As you know the release is near and we need to fix CI jobs to avoid a bodged release. I'm also concerned that the job unittest_windows_gpu_py3.8 is failing with memory access issues on this branch. Given it doesn't fail on other branches I believe it's due to the switch of cuda versions.

The constraint added above didn't seem to work and the unittest_windows_cpu_py3.10 still fails. If we can't use conda-forge, what should we use?

Also, remove reference to conda-forge, all CUDA toolchain should be available from NVIDIA channel

malfet · 2022-02-24T06:56:55Z

Ok, looks like gaussian blur fails with kernel sizes of 23x23, which sounds suspiciously familiar
Here is run of the test with cuda-memcheck:

(C:\Users\circleci\project\env) C:\Users\circleci\project\test>"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cuda-memcheck.exe" pytest test_transforms_tensor.py -k test_gaussian_blur[1-meth_kwargs1
========= CUDA-MEMCHECK
================================================================================== test session starts ===================================================================================
platform win32 -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: C:\Users\circleci\project, configfile: pytest.ini
plugins: cov-3.0.0, mock-3.6.1
collected 3066 items / 3064 deselected / 2 selected

test_transforms_tensor.py .FE                                                                                                                                                       [100%]

========================================================================================= ERRORS =========================================================================================
______________________________________________________________ ERROR at teardown of test_gaussian_blur[1-meth_kwargs1-cuda] ______________________________________________________________
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\conftest.py", line 104, in prevent_leaking_rng
    torch.cuda.set_rng_state(cuda_rng_state)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\random.py", line 64, in set_rng_state
    _lazy_call(cb)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\__init__.py", line 155, in _lazy_call
    callable()
  File "C:\Users\circleci\project\env\lib\site-packages\torch\cuda\random.py", line 62, in cb
    default_generator.set_state(new_state_copy)
RuntimeError: CUDA error: unspecified launch failure
======================================================================================== FAILURES ========================================================================================
________________________________________________________________________ test_gaussian_blur[1-meth_kwargs1-cuda] _________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 963, in test_gaussian_blur
    _test_class_op(
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 85, in _test_class_op
    _test_transform_vs_scripted_on_batch(f, scripted_fn, batch_tensors)
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 36, in _test_transform_vs_scripted_on_batch
    transformed_batch = transform(batch_tensors)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\nn\modules\module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "c:\users\circleci\project\torchvision\transforms\transforms.py", line 1817, in forward
    return F.gaussian_blur(img, self.kernel_size, [sigma, sigma])
  File "c:\users\circleci\project\torchvision\transforms\functional.py", line 1326, in gaussian_blur
    output = F_t.gaussian_blur(t_img, kernel_size, sigma)
  File "c:\users\circleci\project\torchvision\transforms\functional_tensor.py", line 774, in gaussian_blur
    img = conv2d(img, kernel, groups=img.shape[-3])
RuntimeError: CUDA error: unspecified launch failure
================================================================================ short test summary info =================================================================================
ERROR test_transforms_tensor.py::test_gaussian_blur[1-meth_kwargs1-cuda] - RuntimeError: CUDA error: unspecified launch failure
FAILED test_transforms_tensor.py::test_gaussian_blur[1-meth_kwargs1-cuda] - RuntimeError: CUDA error: unspecified launch failure
================================================================= 1 failed, 1 passed, 3064 deselected, 1 error in 35.57s =================================================================
========= Invalid __shared__ read of size 4
=========     at 0x00001d10 in volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1
=========     by thread (95,0,0) in block (24,0,0)
=========     Address 0x0000250c is out of bounds
=========     Device Frame:volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1 (volta_scudnn_128x32_3dconv_fprop_xregs_large_nn_v1 : 0x1d10)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x76888]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x76bb1]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x7b0da]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuProfilerStop + 0x11cc6a) [0x33d9ea]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x17069d]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuProfilerStop + 0xf0c72) [0x3119f2]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x38bdb]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x390af]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll [0x39394]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nv_dispswi.inf_amd64_8fb2f986cb3224d8\nvcuda64.dll (cuLaunchKernel + 0x234) [0x20fc44]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll [0x3896]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll [0x26fd]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::getPlatform + 0xe9) [0x1d54529]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::transformTensor + 0x1bc1) [0x1dc0651]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::transformTensor + 0xbe6a) [0x1dca8fa]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::ConvDgradShader::isSplitK + 0x49b) [0x1ddcd9b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Descriptor::initialize_internal + 0x618e) [0x5c67ce]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Descriptor::initialize_internal + 0x6eb1) [0x5c74f1]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineContainer<1012,113664>::execute_internal_impl + 0x2a) [0x54f27a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cask_cudnn::TensorDesc::operator== + 0x2d2) [0x544612]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineContainer<1,4096>::execute_internal_impl + 0xd241) [0x55c4d1]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::EngineInterface::execute + 0x7e) [0x4e163e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::execute + 0x103f) [0x54eebf]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Tensor::Tensor + 0x18b6) [0x5ab246]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::backend::Tensor::Tensor + 0xbe1) [0x5aa571]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnn::cnn::convolutionForward + 0x10b) [0x65609b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll (cudnnConvolutionForward + 0x331) [0x657081]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x4263) [0x48863]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x83a7) [0x4c9a7]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x7736) [0x4bd36]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x1ae5) [0x460e5]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_transpose + 0x752f) [0x4bb2f]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution_add_relu + 0x16ec) [0x43fec]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::cudnn_convolution + 0xc5) [0x428e5]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::cuda::view_as_real + 0x14adc) [0x456680c]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::cuda::bucketize_outf + 0x3df7a) [0x450361a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::cudnn_convolution::call + 0x242) [0x70175b2]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::_convolution + 0xf5e) [0x692064e]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::xlogy_ + 0x40e) [0x72d1bee]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::bmm + 0x1a1ed) [0x72975bd]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::_convolution::call + 0x2d6) [0x6d5a226]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::convolution + 0x164) [0x6928914]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::xlogy_ + 0xc6b) [0x72d244b]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeexplicitautograd::bmm + 0x1a2ca) [0x729769a]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::TensorMaker::make_tensor + 0x88e49) [0x6d40779]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::convolution::redispatch + 0x123) [0x6dc39a3]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::GraphRoot::apply + 0x157b1) [0x7bfd851]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::GraphRoot::apply + 0xc6c8) [0x7bf4768]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::convolution::call + 0x26f) [0x6d71b6f]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::native::conv2d + 0x1be) [0x69277be]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeimplicitautograd::where + 0x1db4) [0x73ac8e4]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::compositeimplicitautograd::broadcast_to + 0x2a7a3) [0x738d953]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::_ops::conv2d::call + 0x219) [0x70ba239]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_cpu.dll (at::conv2d + 0x64) [0x67106d4]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_python.dll (torch::FunctionSignature::operator= + 0x1096fc) [0x14d77c]
=========     Host Frame:C:\Users\circleci\project\env\lib\site-packages\torch\lib\torch_python.dll (torch::FunctionSignature::operator= + 0x12f7ab) [0x17382b]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyMethodDef_RawFastCallKeywords + 0x410) [0x126fe0]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyObject_MakeTpCall + 0x106) [0x125fa6]
=========     Host Frame:C:\Users\circleci\project\env\python38.dll (PyEval_GetFuncDesc + 0x408) [0x2036b8]
=========
...

@IvanYashchuk, does it sounds familiar?

IvanYashchuk · 2022-02-24T07:12:27Z

No, I don't know the source of the problem.

Also, remove reference to conda-forge, all CUDA toolchain should be available from NVIDIA channel Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used

Also, remove reference to conda-forge, all CUDA toolchain should be available from NVIDIA channel Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used Co-authored-by: Nikita Shulga <[email protected]>

Summary: Also, remove reference to conda-forge, all CUDA toolchain should be available from NVIDIA channel Install h5py from pip on Windows and skip failing gaussian_blur tests if Win+CUDA11.3 is used Reviewed By: jdsgomes Differential Revision: D34475316 fbshipit-source-id: 463ef028d315942efae956e8a5b314f8868a2975

pytorch-bot bot added the ciflow/default label Feb 21, 2022

facebook-github-bot added the cla signed label Feb 21, 2022

datumbox approved these changes Feb 21, 2022

View reviewed changes

datumbox added enhancement module: ci labels Feb 21, 2022

malfet added 2 commits February 23, 2022 16:39

Update hardcoded WIN-CUDA from 11.1 with 11.3

0b7dbe8

Also, remove reference to conda-forge, all CUDA toolchain should be available from NVIDIA channel

Install h5py from pip on Windows

46f1697

malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from e23189d to 46f1697 Compare February 24, 2022 00:39

malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from b0890f5 to 737af02 Compare February 24, 2022 07:27

Skip failing tests

860566b

malfet force-pushed the malfet/move_win_gpu-to-11.3 branch from 737af02 to 860566b Compare February 24, 2022 07:56

malfet merged commit e5a5f0b into main Feb 24, 2022

malfet deleted the malfet/move_win_gpu-to-11.3 branch February 24, 2022 08:27

This was referenced Feb 24, 2022

Single Channel GaussianBlur over 23x23 kernels fails on Windows #5464

Open

CI job binary_win_conda_py3.10_cu111 fails due to unavailable PyTorch binaries #5466

Closed

jdsgomes mentioned this pull request Feb 24, 2022

[v0.12] Release Tracker #5396

Closed

seemethere mentioned this pull request Feb 25, 2022

ci: Limit scope of unittest to one python version #5479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

Uh oh!

malfet commented Feb 21, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 21, 2022 •

edited

Loading

Uh oh!

datumbox left a comment

Uh oh!

datumbox commented Feb 21, 2022

Uh oh!

malfet commented Feb 21, 2022 •

edited

Loading

Uh oh!

pmeier commented Feb 21, 2022

Uh oh!

malfet commented Feb 22, 2022

Uh oh!

pmeier commented Feb 23, 2022

Uh oh!

datumbox commented Feb 23, 2022

Uh oh!

malfet commented Feb 24, 2022

Uh oh!

IvanYashchuk commented Feb 24, 2022

Uh oh!

Uh oh!

Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

Update hardcoded WIN-CUDA from 11.1 with 11.3 #5451

Uh oh!

Conversation

malfet commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

datumbox commented Feb 21, 2022

Uh oh!

malfet commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmeier commented Feb 21, 2022

Uh oh!

malfet commented Feb 22, 2022

Uh oh!

pmeier commented Feb 23, 2022

Uh oh!

datumbox commented Feb 23, 2022

Uh oh!

malfet commented Feb 24, 2022

Uh oh!

IvanYashchuk commented Feb 24, 2022

Uh oh!

Uh oh!

malfet commented Feb 21, 2022 •

edited

Loading

facebook-github-bot commented Feb 21, 2022 •

edited

Loading

malfet commented Feb 21, 2022 •

edited

Loading