Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

jeff-hawke · 2018-09-11T15:59:02Z

Issue description

Building Pytorch 0.4.1 from source on a Nvidia Drive PX2 (Driveworks 0.6) currently does not work, due to an odd Nvidia print statement which breaks the CUDA architecture detection: running any CUDA process results in the following print statement to std out:
nvrm_gpu: Bug 200215060 workaround enabled.
Unfortunately there's nothing I can do (rather, that I've found) to work around this or suppress it - it's part of the CUDA 9.0 install which comes with the Driveworks SDK for these PX2s.

This breaks the CUDA architecture detection in <pytorch_root>/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake here from function CUDA_DETECT_INSTALLED_GPUS:

This function writes and compiles a short cpp program which prints the CUDA device architectures, and caches the program output in CMakeCache.
Instead of printing 6.1 6.2 to stdout as expected on this device, this additional print statement results in CUDA_GPU_DETECT_OUTPUT being set to: nvrm_gpu: Bug 200215060 workaround enabled.\n6.1 6.2
This newline breaks CMakeCache, which doesn't handle newlines in cached variables.
In addition, the output results in a number of message(SEND_ERROR <>) from each parsed string (stating that 'nvrm_gpu:', 'Bug', '200215060', ... aren't valid architectures, understandably).

This can be fixed by adding a one-line addition to line 90 of this .cmake file, parsing the program output to ensure it has a list of sensible possible architecture version in the compute_capabilities variable (floats, e.g, 6.1, 6.2, etc)
string(REGEX MATCHALL "[0-9]+\\.[0-9]+" compute_capabilities "${compute_capabilities}")

With this patch, pytorch 0.4.1 builds happily.

If you have any other suggestions for a fix or workaround, I'd be happy to try them.

Code example

Reproduceable on multiple PX2s with this version of driveworks, python 3.5, a fresh checkout of 0.4.1, and the following install command:

MAX_JOBS=1 python3 setup.py install --user

System Info

CUDA used to build PyTorch: 9.0.225
OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 3.5
CUDA runtime version: 9.0.225
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.7.0.4
/usr/lib/aarch64-linux-gnu/libcudnn_static_v7.a

The text was updated successfully, but these errors were encountered:

soumith · 2018-09-11T16:01:31Z

@jeff-hawke as a workaround, you can do this:

TORCH_CUDA_ARCH_LIST="6.1;6.2" python setup.py install

That skips detection entirely and uses the provided architecture values.

jeff-hawke · 2018-09-12T09:18:56Z

Thanks @soumith - that works perfectly as well!
If possible, it'd be great to add this to the CMake module itself, to make the default behaviour handle this. I've attached a quick diff of a patch which resolves this.
px2_build_fix.diff.zip

Working around CUDA-level nvrm_gpu log statements to stdout on some embedded platforms (ex. Drive PX2). See-also: pytorch/pytorch#11518 (comment)

soumith self-assigned this Sep 17, 2018

soumith mentioned this issue Sep 19, 2018

[FindCUDA] Workaround CUDA logging on some embedded platforms #11851

Closed

facebook-github-bot closed this as completed in 0927386 Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

jeff-hawke commented Sep 11, 2018

soumith commented Sep 11, 2018

Uh oh!

jeff-hawke commented Sep 12, 2018

Uh oh!

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

Pytorch v0.4.1 fails to build on Nvidia Drive PX2 #11518

Comments

jeff-hawke commented Sep 11, 2018

Issue description

Code example

System Info

soumith commented Sep 11, 2018

Uh oh!

jeff-hawke commented Sep 12, 2018

Uh oh!