Skip to content

Stop using ctypes to interface with CUDA libraries. #31160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented Dec 12, 2019

Stack from ghstack:

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang [email protected]

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang <[email protected]>

[ghstack-poisoned]
@kostmo
Copy link
Member

kostmo commented Dec 12, 2019

CircleCI build failures summary

As of commit 80a5bcd:

  • 1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

1 failure not recognized by patterns:

Job Step Status
CircleCI pytorch_windows_build Build New in PR

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang <[email protected]>

[ghstack-poisoned]
)

# NB: This has to match the condition under which the JIT test directory
# is included (at the time of writing that's in caffe2/CMakeLists.txt).
if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this disabled for MSVC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apaszke would know best, but judging from the comment above, I think he was channeling the fact that JIT tests are already disabled on Windows:


  if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM)
    add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit)
    if (USE_DISTRIBUTED)
      add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
    endif()
  endif()

But why is this disabled on Windows? The condition was added in #8792; we can't easily ask Anders why he added it, but probably at the time it was failing on Windows and so he disabled it instead of fixing it.

th_dll_path = os.path.join(os.path.dirname(
os.path.dirname(os.path.dirname(__file__))), 'lib')
test_env['PATH'] = ';'.join([th_dll_path, old_path])
proc = Popen(['where', 'cudnn64*.dll'], stdout=PIPE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no RPATH for Windows, so we will have to depend on the user's environment. If a user compiled PyTorch by himself and the CUDA libraries are not in PATH, then the user will encounter a DLL load failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterjc123 I suppose what I don't understand, is prior to hitting the code here, we will have loaded _C.dll into the process. This indirectly depends on cudnn64.dll. So how come this code works today? (Even if later, when we load the libraries with ctypes, we have to do all this faffing about, it doesn't seem like it would apply here.)

Copy link
Collaborator

@peterjc123 peterjc123 Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm wrong here. Currently, it is using PATH and we didn't add any path logic before loading _C.dll. But in Python 3.8, they changed this behaviour. Please see https://github.com/pytorch/pytorch/pull/28536/files#r357489428 for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at the binary packaging for 1.3.1, it looks like cudnn64.dll is distributed with PyTorch, and I guess (????) that you're assumed to have put the CUDA runtime into your PATH (certainly, conda activate + cudatoolkit will get it in your PATH)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. For binary builds, it is getting into PATH through

  1. Conda package: Relies on cudatoolkit, which it will be added into PATH by conda itself.
  2. Pip package: The CUDA runtime DLLs are copied into [PY_LIB_DIR]/torch/lib and then we add that dir into PATH.
    But for developers that build PyTorch on their own, it will directly depend on their setting of PATH. But generally speaking, it should be added during the installation of CUDA.

th_dll_path = os.path.join(os.path.dirname(
os.path.dirname(__file__)), 'lib')
test_env['PATH'] = ';'.join([th_dll_path, py_dll_path, old_path])
proc = Popen(['where', 'cudart64*.dll'], stdout=PIPE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same story for cudart under Windows.

WINDOWS_HOME = 'C:/Program Files/NVIDIA Corporation/NvToolsExt'
NVTOOLEXT_HOME = os.getenv('NVTOOLSEXT_PATH', WINDOWS_HOME)
if os.path.exists(NVTOOLEXT_HOME):
lib_paths = glob.glob(NVTOOLEXT_HOME + '/bin/x64/nvToolsExt*.dll')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story for nvtoolsext. What's worse is that CUDA installation scripts don't put this one in PATH.

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang <[email protected]>

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented Dec 13, 2019

Test failure here is pretty exciting:

Dec 12 08:23:16 [ RUN      ] JitTest.ADFormulas
Dec 12 08:23:16 unknown file: Failure
Dec 12 08:23:16 C++ exception with description "a[i].allclose(b[i]) INTERNAL ASSERT FAILED at /var/lib/jenkins/workspace/test/cpp/jit/test_utils.cpp:18, please report a bug to PyTorch.  (assertAllClose at /var/lib/jenkins/workspace/test/cpp/jit/test_utils.cpp:18)
Dec 12 08:23:16 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f3d1ea504fa in /var/lib/jenkins/workspace/build/lib/libc10.so)
Dec 12 08:23:16 frame #1: torch::jit::assertAllClose(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x5fb (0x4c826b in build/bin/test_jit)
Dec 12 08:23:16 frame #2: torch::jit::testADFormulas() + 0x1dae (0x5549ae in build/bin/test_jit)
Dec 12 08:23:16 frame #3: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x43 (0x588a13 in build/bin/test_jit)
Dec 12 08:23:16 frame #4: build/bin/test_jit() [0x57ebbb]
Dec 12 08:23:16 frame #5: build/bin/test_jit() [0x57eef5]
Dec 12 08:23:16 frame #6: build/bin/test_jit() [0x57f14d]
Dec 12 08:23:16 frame #7: testing::internal::UnitTestImpl::RunAllTests() + 0xbc2 (0x580112 in build/bin/test_jit)
Dec 12 08:23:16 frame #8: testing::UnitTest::Run() + 0x8b (0x58042b in build/bin/test_jit)
Dec 12 08:23:16 frame #9: main + 0xcb (0x446b2b in build/bin/test_jit)
Dec 12 08:23:16 frame #10: __libc_start_main + 0xf0 (0x7f3d1e149830 in /lib/x86_64-linux-gnu/libc.so.6)
Dec 12 08:23:16 frame #11: _start + 0x29 (0x44c359 in build/bin/test_jit)
Dec 12 08:23:16 " thrown in the test body.

This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang <[email protected]>

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented Dec 13, 2019

After studying the original patchset, and thinking about RTLD_GLOBAL, I think there isn't actually any reason for this patchset to exist for getting rid of RTLD_GLOBAL. So I'm going to get rid of it and see if RTLD_GLOBAL still works.

@ezyang ezyang closed this Jan 27, 2020
xxtEchjovs44 pushed a commit to xxtEchjovs44/pytorch that referenced this pull request Jan 29, 2020
This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.

Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.

Signed-off-by: Edward Z. Yang <[email protected]>

ghstack-source-id: 9062867
Pull Request resolved: pytorch/pytorch#31160
@facebook-github-bot facebook-github-bot deleted the gh/ezyang/578/head branch February 27, 2020 15:19
facebook-github-bot pushed a commit that referenced this pull request Mar 11, 2020
Summary:
Fixes #33016, Continuation of #31160
Pull Request resolved: #33678

Differential Revision: D20249187

Pulled By: ezyang

fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants