-
Notifications
You must be signed in to change notification settings - Fork 24.6k
Stop using ctypes to interface with CUDA libraries. #31160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]
CircleCI build failures summaryAs of commit 80a5bcd:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI. Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 10 times. |
This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]
) | ||
|
||
# NB: This has to match the condition under which the JIT test directory | ||
# is included (at the time of writing that's in caffe2/CMakeLists.txt). | ||
if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this disabled for MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apaszke would know best, but judging from the comment above, I think he was channeling the fact that JIT tests are already disabled on Windows:
if (BUILD_TEST AND NOT MSVC AND NOT USE_ROCM)
add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit)
if (USE_DISTRIBUTED)
add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
endif()
endif()
But why is this disabled on Windows? The condition was added in #8792; we can't easily ask Anders why he added it, but probably at the time it was failing on Windows and so he disabled it instead of fixing it.
th_dll_path = os.path.join(os.path.dirname( | ||
os.path.dirname(os.path.dirname(__file__))), 'lib') | ||
test_env['PATH'] = ';'.join([th_dll_path, old_path]) | ||
proc = Popen(['where', 'cudnn64*.dll'], stdout=PIPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no RPATH for Windows, so we will have to depend on the user's environment. If a user compiled PyTorch by himself and the CUDA libraries are not in PATH
, then the user will encounter a DLL load failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterjc123 I suppose what I don't understand, is prior to hitting the code here, we will have loaded _C.dll
into the process. This indirectly depends on cudnn64.dll
. So how come this code works today? (Even if later, when we load the libraries with ctypes, we have to do all this faffing about, it doesn't seem like it would apply here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm wrong here. Currently, it is using PATH
and we didn't add any path logic before loading _C.dll
. But in Python 3.8, they changed this behaviour. Please see https://github.com/pytorch/pytorch/pull/28536/files#r357489428 for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at the binary packaging for 1.3.1, it looks like cudnn64.dll
is distributed with PyTorch, and I guess (????) that you're assumed to have put the CUDA runtime into your PATH (certainly, conda activate + cudatoolkit will get it in your PATH)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true. For binary builds, it is getting into PATH
through
- Conda package: Relies on
cudatoolkit
, which it will be added into PATH by conda itself. - Pip package: The CUDA runtime DLLs are copied into [PY_LIB_DIR]/torch/lib and then we add that dir into PATH.
But for developers that build PyTorch on their own, it will directly depend on their setting ofPATH
. But generally speaking, it should be added during the installation of CUDA.
th_dll_path = os.path.join(os.path.dirname( | ||
os.path.dirname(__file__)), 'lib') | ||
test_env['PATH'] = ';'.join([th_dll_path, py_dll_path, old_path]) | ||
proc = Popen(['where', 'cudart64*.dll'], stdout=PIPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same story for cudart under Windows.
WINDOWS_HOME = 'C:/Program Files/NVIDIA Corporation/NvToolsExt' | ||
NVTOOLEXT_HOME = os.getenv('NVTOOLSEXT_PATH', WINDOWS_HOME) | ||
if os.path.exists(NVTOOLEXT_HOME): | ||
lib_paths = glob.glob(NVTOOLEXT_HOME + '/bin/x64/nvToolsExt*.dll') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same story for nvtoolsext. What's worse is that CUDA installation scripts don't put this one in PATH
.
This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]
Test failure here is pretty exciting:
|
This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]
After studying the original patchset, and thinking about RTLD_GLOBAL, I think there isn't actually any reason for this patchset to exist for getting rid of RTLD_GLOBAL. So I'm going to get rid of it and see if RTLD_GLOBAL still works. |
This has the upside of no longer forcing us to hardcode the enum values in Python, and also should let us remove the ungodly hacks we use to load the libraries on Windows. Also, it turns out that most of our cuDNN interface in Python is dead now so I have removed it. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 9062867 Pull Request resolved: pytorch/pytorch#31160
Stack from ghstack:
This has the upside of no longer forcing us to hardcode the enum values
in Python, and also should let us remove the ungodly hacks we use to
load the libraries on Windows.
Also, it turns out that most of our cuDNN interface in Python is dead
now so I have removed it.
Signed-off-by: Edward Z. Yang [email protected]