Stop using ctypes to interface with CUDA libraries. #31160

ezyang · 2019-12-12T02:32:27Z

Stack from ghstack:

Don't use RTLD_GLOBAL to load _C. #31162 Don't use RTLD_GLOBAL to load _C.
Pick parallel MKL implementation over sequential implementation. #31165 Pick parallel MKL implementation over sequential implementation.
Use libmkl_rt, or statically link against MKL #31177 Use libmkl_rt, or statically link against MKL
Uniformly apply Windows logic in cpp_extensions everywhere #31161 Actually get rid of Windows hacks in cpp_extensions.
Stop using ctypes to interface with CUDA libraries. #31160 Stop using ctypes to interface with CUDA libraries.
Move AutogradMeta and DeviceGuardImplInterface virtual methods out-of-line. #31176 Move AutogradMeta and DeviceGuardImplInterface virtual methods out-of-line.
Add missing TORCH_CUDA_API annotation to throw_nccl_error #31157 Add missing TORCH_CUDA_API annotation to throw_nccl_error
Remove dead CAFFE2_LIBS variable #31155 Remove dead CAFFE2_LIBS variable
Remove LibIRC logic from cmake. #31152 Remove LibIRC logic from cmake.

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang [email protected]

This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]

kostmo · 2019-12-12T03:13:11Z

CircleCI build failures summary

As of commit 80a5bcd:

1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

1 failure not recognized by patterns:

Job	Step	Status
^{pytorch_windows_build}	^Build	New in PR

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]

peterjc123 · 2019-12-12T06:18:46Z

torch/CMakeLists.txt

    )

+# NB: This has to match the condition under which the JIT test directory
+#     is included (at the time of writing that's in caffe2/CMakeLists.txt).
+if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM)


Why is this disabled for MSVC?

@apaszke would know best, but judging from the comment above, I think he was channeling the fact that JIT tests are already disabled on Windows:

if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM) add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit) if (USE_DISTRIBUTED) add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc) endif() endif()

But why is this disabled on Windows? The condition was added in #8792; we can't easily ask Anders why he added it, but probably at the time it was failing on Windows and so he disabled it instead of fixing it.

peterjc123 · 2019-12-12T06:21:21Z

torch/backends/cudnn/__init__.py

-    th_dll_path = os.path.join(os.path.dirname(
-        os.path.dirname(os.path.dirname(__file__))), 'lib')
-    test_env['PATH'] = ';'.join([th_dll_path, old_path])
-    proc = Popen(['where', 'cudnn64*.dll'], stdout=PIPE,


There's no RPATH for Windows, so we will have to depend on the user's environment. If a user compiled PyTorch by himself and the CUDA libraries are not in PATH, then the user will encounter a DLL load failure.

@peterjc123 I suppose what I don't understand, is prior to hitting the code here, we will have loaded _C.dll into the process. This indirectly depends on cudnn64.dll. So how come this code works today? (Even if later, when we load the libraries with ctypes, we have to do all this faffing about, it doesn't seem like it would apply here.)

Yes, I'm wrong here. Currently, it is using PATH and we didn't add any path logic before loading _C.dll. But in Python 3.8, they changed this behaviour. Please see https://github.com/pytorch/pytorch/pull/28536/files#r357489428 for more details.

Taking a look at the binary packaging for 1.3.1, it looks like cudnn64.dll is distributed with PyTorch, and I guess (????) that you're assumed to have put the CUDA runtime into your PATH (certainly, conda activate + cudatoolkit will get it in your PATH)

Yes, that's true. For binary builds, it is getting into PATH through

Conda package: Relies on cudatoolkit, which it will be added into PATH by conda itself.

Pip package: The CUDA runtime DLLs are copied into [PY_LIB_DIR]/torch/lib and then we add that dir into PATH.
But for developers that build PyTorch on their own, it will directly depend on their setting of PATH. But generally speaking, it should be added during the installation of CUDA.

peterjc123 · 2019-12-12T06:22:27Z

torch/cuda/__init__.py

-    th_dll_path = os.path.join(os.path.dirname(
-        os.path.dirname(__file__)), 'lib')
-    test_env['PATH'] = ';'.join([th_dll_path, py_dll_path, old_path])
-    proc = Popen(['where', 'cudart64*.dll'], stdout=PIPE,


The same story for cudart under Windows.

peterjc123 · 2019-12-12T06:23:59Z

torch/cuda/nvtx.py

-    WINDOWS_HOME = 'C:/Program Files/NVIDIA Corporation/NvToolsExt'
-    NVTOOLEXT_HOME = os.getenv('NVTOOLSEXT_PATH', WINDOWS_HOME)
-    if os.path.exists(NVTOOLEXT_HOME):
-        lib_paths = glob.glob(NVTOOLEXT_HOME + '/bin/x64/nvToolsExt*.dll')


Same story for nvtoolsext. What's worse is that CUDA installation scripts don't put this one in PATH.

This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]

ezyang · 2019-12-13T00:31:57Z

Test failure here is pretty exciting:

Dec 12 08:23:16 [ RUN      ] JitTest.ADFormulas
Dec 12 08:23:16 unknown file: Failure
Dec 12 08:23:16 C++ exception with description "a[i].allclose(b[i]) INTERNAL ASSERT FAILED at /var/lib/jenkins/workspace/test/cpp/jit/test_utils.cpp:18, please report a bug to PyTorch.  (assertAllClose at /var/lib/jenkins/workspace/test/cpp/jit/test_utils.cpp:18)
Dec 12 08:23:16 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f3d1ea504fa in /var/lib/jenkins/workspace/build/lib/libc10.so)
Dec 12 08:23:16 frame #1: torch::jit::assertAllClose(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x5fb (0x4c826b in build/bin/test_jit)
Dec 12 08:23:16 frame #2: torch::jit::testADFormulas() + 0x1dae (0x5549ae in build/bin/test_jit)
Dec 12 08:23:16 frame #3: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x43 (0x588a13 in build/bin/test_jit)
Dec 12 08:23:16 frame #4: build/bin/test_jit() [0x57ebbb]
Dec 12 08:23:16 frame #5: build/bin/test_jit() [0x57eef5]
Dec 12 08:23:16 frame #6: build/bin/test_jit() [0x57f14d]
Dec 12 08:23:16 frame #7: testing::internal::UnitTestImpl::RunAllTests() + 0xbc2 (0x580112 in build/bin/test_jit)
Dec 12 08:23:16 frame #8: testing::UnitTest::Run() + 0x8b (0x58042b in build/bin/test_jit)
Dec 12 08:23:16 frame #9: main + 0xcb (0x446b2b in build/bin/test_jit)
Dec 12 08:23:16 frame #10: __libc_start_main + 0xf0 (0x7f3d1e149830 in /lib/x86_64-linux-gnu/libc.so.6)
Dec 12 08:23:16 frame #11: _start + 0x29 (0x44c359 in build/bin/test_jit)
Dec 12 08:23:16 " thrown in the test body.

This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]

ezyang · 2019-12-13T02:59:35Z

After studying the original patchset, and thinking about RTLD_GLOBAL, I think there isn't actually any reason for this patchset to exist for getting rid of RTLD_GLOBAL. So I'm going to get rid of it and see if RTLD_GLOBAL still works.

This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 9062867 Pull Request resolved: pytorch/pytorch#31160

Summary: Fixes #33016, Continuation of #31160 Pull Request resolved: #33678 Differential Revision: D20249187 Pulled By: ezyang fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed

ezyang requested a review from apaszke as a code owner December 12, 2019 02:32

This was referenced Dec 12, 2019

Remove LibIRC logic from cmake. #31152

Closed

Remove dead CAFFE2_LIBS variable #31155

Closed

Add missing TORCH_CUDA_API annotation to throw_nccl_error #31157

Closed

Use libmkl_rt, or statically link against MKL #31159

Merged

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Dec 12, 2019

This was referenced Dec 12, 2019

Uniformly apply Windows logic in cpp_extensions everywhere #31161

Closed

Don't use RTLD_GLOBAL to load _C. #31162

Closed

ezyang mentioned this pull request Dec 12, 2019

Pick parallel MKL implementation over sequential implementation. #31165

Closed

peterjc123 reviewed Dec 12, 2019

View reviewed changes

This was referenced Dec 12, 2019

Move AutogradMeta and DeviceGuardImplInterface virtual methods out-of-line. #31176

Closed

Use libmkl_rt, or statically link against MKL #31177

Closed

ezyang closed this Jan 27, 2020

ezyang mentioned this pull request Feb 20, 2020

CuDNN backend not available in nightly (20200205) #33016

Closed

peterbell10 mentioned this pull request Feb 23, 2020

Stop using ctypes to interface with CUDA libraries. #33678

Closed

facebook-github-bot deleted the gh/ezyang/578/head branch February 27, 2020 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop using ctypes to interface with CUDA libraries. #31160

Stop using ctypes to interface with CUDA libraries. #31160

Uh oh!

ezyang commented Dec 12, 2019 •

edited

Loading

Uh oh!

kostmo commented Dec 12, 2019 •

edited

Loading

Uh oh!

peterjc123 Dec 12, 2019

Uh oh!

ezyang Dec 13, 2019

Uh oh!

peterjc123 Dec 12, 2019

Uh oh!

ezyang Dec 13, 2019

Uh oh!

peterjc123 Dec 13, 2019 •

edited

Loading

Uh oh!

ezyang Dec 13, 2019

Uh oh!

peterjc123 Dec 13, 2019

Uh oh!

peterjc123 Dec 12, 2019

Uh oh!

peterjc123 Dec 12, 2019

Uh oh!

ezyang commented Dec 13, 2019

Uh oh!

ezyang commented Dec 13, 2019

Uh oh!

Uh oh!

Stop using ctypes to interface with CUDA libraries. #31160

Stop using ctypes to interface with CUDA libraries. #31160

Uh oh!

Conversation

ezyang commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostmo commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CircleCI build failures summary

Detailed failure analysis

1 failure not recognized by patterns:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterjc123 Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Dec 13, 2019

Uh oh!

ezyang commented Dec 13, 2019

Uh oh!

Uh oh!

ezyang commented Dec 12, 2019 •

edited

Loading

kostmo commented Dec 12, 2019 •

edited

Loading

peterjc123 Dec 13, 2019 •

edited

Loading