Skip to content

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeff-hawke opened this issue Sep 11, 2018 · 2 comments
Closed

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

jeff-hawke opened this issue Sep 11, 2018 · 2 comments
Assignees

Comments

@jeff-hawke
Copy link

Issue description

Building Pytorch 0.4.1 from source on a Nvidia Drive PX2 (Driveworks 0.6) currently does not work, due to an odd Nvidia print statement which breaks the CUDA architecture detection: running any CUDA process results in the following print statement to std out:
nvrm_gpu: Bug 200215060 workaround enabled.
Unfortunately there's nothing I can do (rather, that I've found) to work around this or suppress it - it's part of the CUDA 9.0 install which comes with the Driveworks SDK for these PX2s.

This breaks the CUDA architecture detection in <pytorch_root>/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake here from function CUDA_DETECT_INSTALLED_GPUS:

  • This function writes and compiles a short cpp program which prints the CUDA device architectures, and caches the program output in CMakeCache.
  • Instead of printing 6.1 6.2 to stdout as expected on this device, this additional print statement results in CUDA_GPU_DETECT_OUTPUT being set to: nvrm_gpu: Bug 200215060 workaround enabled.\n6.1 6.2
  • This newline breaks CMakeCache, which doesn't handle newlines in cached variables.
  • In addition, the output results in a number of message(SEND_ERROR <>) from each parsed string (stating that 'nvrm_gpu:', 'Bug', '200215060', ... aren't valid architectures, understandably).

This can be fixed by adding a one-line addition to line 90 of this .cmake file, parsing the program output to ensure it has a list of sensible possible architecture version in the compute_capabilities variable (floats, e.g, 6.1, 6.2, etc)
string(REGEX MATCHALL "[0-9]+\\.[0-9]+" compute_capabilities "${compute_capabilities}")

With this patch, pytorch 0.4.1 builds happily.

If you have any other suggestions for a fix or workaround, I'd be happy to try them.

Code example

Reproduceable on multiple PX2s with this version of driveworks, python 3.5, a fresh checkout of 0.4.1, and the following install command:

MAX_JOBS=1 python3 setup.py install --user

System Info

CUDA used to build PyTorch: 9.0.225
OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 3.5
CUDA runtime version: 9.0.225
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.7.0.4
/usr/lib/aarch64-linux-gnu/libcudnn_static_v7.a

@soumith
Copy link
Member

soumith commented Sep 11, 2018

@jeff-hawke as a workaround, you can do this:

TORCH_CUDA_ARCH_LIST="6.1;6.2" python setup.py install

That skips detection entirely and uses the provided architecture values.

@jeff-hawke
Copy link
Author

Thanks @soumith - that works perfectly as well!
If possible, it'd be great to add this to the CMake module itself, to make the default behaviour handle this. I've attached a quick diff of a patch which resolves this.
px2_build_fix.diff.zip

@soumith soumith self-assigned this Sep 17, 2018
kwrobot pushed a commit to Kitware/CMake that referenced this issue Sep 20, 2018
Working around CUDA-level nvrm_gpu log statements to stdout on some
embedded platforms (ex. Drive PX2).

See-also: pytorch/pytorch#11518 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants