Building PyTorch with ROCm #258

briansp2020 · 2018-10-10T13:42:01Z

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

I'm trying to build PyTorch to run on ROCm (Ubuntu 18.04) and am having issues. I tried the following.

I followed https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm but it seems to have failed at pyyaml (https://gist.github.com/briansp2020/114bd75ff0182197cf7efc7af265e89c)
I got over the error by installing wheel. However, the build still failed later (https://gist.github.com/briansp2020/2719353d626968082410011dc36608cf)
I tried build it in tensorflow docker and I get https://gist.github.com/briansp2020/2a109c0f1d40b45299cb73a76a255767

It seems the wiki is old and I needed to get latest rocSPARSE (https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases) to get past the CMake phase. Unfortunately, build still failed(https://gist.github.com/briansp2020/52047cf73d8d59ddd72f730d779b952c)...

Do you have up to date instruction on how to build PyTorch with ROCm? My goal is to run fast.ai on Vega FE with ROCm.

Thanks!

iotamudelta · 2018-10-10T15:31:07Z

We are in the process of releasing a new version of ROCm for PyTorch soon. The wiki and docker file will be updated then. Stay tuned!

jithunnair-amd · 2018-10-10T16:12:34Z

@briansp2020 Please don't forget to run Step 6 of https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm before you start the build process. It seems you didn't "hipify" using that step, which is why your code is still referencing CUDA headers.

briansp2020 · 2018-10-11T11:12:57Z

I was able to make some more progress. I did ran hipify but it failed because I did not set environment variables for language. After setting the environment variable and installing hipSPARSE & rocSPARSE to ROCm/TensorFlow docker, I was able to make progress. The build process reached 87% (https://gist.github.com/briansp2020/159742c61da5bb205b7214ac980ff092).

Some errors I still get are

In file included from /root/pytorch/aten/src/THC/THCTensorMathReduce.cu:1:
/root/pytorch/aten/src/THC/THCTensorMathReduce.cuh:121:61: error: '__builtin_inff': no overloaded function has restriction specifiers that are compatible with the ambient context 'THCTensor_kernel_renorm'
if (THCNumerics::eq(value, scalar_cast<AccT, float>(INFINITY))) {
^
/usr/include/x86_64-linux-gnu/bits/inf.h:26:34: note: expanded from macro 'INFINITY'

LLVM ERROR: Cannot select: 0xb7325640: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb773aa58, 0xb5979530, FrameIndex:i32<0>, undef:i32
0xb5979530: i32 = srl 0xb622f9f0, Constant:i32<16>
0xb622f9f0: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3e0d508, 0xb5bfbd00, undef:i64
0xb5bfbd00: i64,ch = CopyFromReg 0x3e0d508, Register:i64 %47
0x668b4f50: i64 = Register %47
0xb575d268: i64 = undef
0xb7fbc7c0: i32 = Constant<16>
0xb5b29038: i32 = FrameIndex<0>
0xb5c0d4b8: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx906
LLVM ERROR: Cannot select: 0x6688ad48: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb405a170, 0x66b1be88, FrameIndex:i32<0>, undef:i32
0x66b1be88: i32 = srl 0xb54e5028, Constant:i32<16>
0xb54e5028: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3eb5498, 0x6b130040, undef:i64
0x6b130040: i64,ch = CopyFromReg 0x3eb5498, Register:i64 %47
0x6aecdfc8: i64 = Register %47
0xb6fbef08: i64 = undef
0x6b71b7b8: i32 = Constant<16>
0xb3fde388: i32 = FrameIndex<0>
0x66924668: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx900
LLVM ERROR: Cannot select: 0xb91ced38: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb6d397d8, 0xb55c15d0, FrameIndex:i32<0>, undef:i32
0xb55c15d0: i32 = srl 0x67a21678, Constant:i32<16>
0x67a21678: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3ef94f8, 0xb6d5a770, undef:i64
0xb6d5a770: i64,ch = CopyFromReg 0x3ef94f8, Register:i64 %47
0xb6a24280: i64 = Register %47
0xb53ad3f0: i64 = undef
0x66205d58: i32 = Constant<16>
0xb91dacc8: i32 = FrameIndex<0>
0x6c344ed8: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx803
clang-7: error: linker command failed with exit code 7 (use -v to see invocation)

Any advice would be appreciated. I'll keep looking.

Thanks!

briansp2020 · 2018-10-21T15:59:57Z

FYI. I saw that the Dockerfile was updated and tried to build pytorch & fast.ai. I was able to build them eventually but noticed that rocm_agent_enumerator installed in the docker image does not match the latest (https://github.com/RadeonOpenCompute/rocminfo/blob/master/rocm_agent_enumerator) and fails to run under Python 3.6.

briansp2020 · 2018-10-21T19:38:48Z

After building docker image using updated docker file (https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm). I needed to uninstall hip-thrust and get the latest hip-thrust from git (https://github.com/ROCmSoftwarePlatform/Thrust.git) before building PyTorch. I summarized steps I took to build PyTorch using Python3.6 here (https://gist.github.com/briansp2020/717f5cab006889f056b36eacee7dd5d7).

After building PyTorch successfully in Python3.6, I tried fast.ai but ran into issues. Running the lesson1 notebook, Fine-tuning and differential learning rate annealing step shows very low accuracy. This screenshot shows the comparison between ROCm output and the output from the notebook that came with fast.ai repo. https://imgur.com/a/g4f5InS The ROCm version shows .55 accuracy when it should be .99.

Any ideas?

iotamudelta · 2018-10-22T16:52:43Z

Could you elaborate what you mean by/observe when you say:

rocm_agent_enumerator failing under python 3.6
hip-thrust package not working

It's hard to comment on the convergence/accuracy issue you observe directly without knowing what model/kernels are running.

briansp2020 · 2018-10-22T19:38:09Z

The rocm_agent_enumerator file in docker image has no parenthesis around print statement at line 141. The latest file in git has them. It seems to work fine in Python 2 but it generates error when using Python 3.6.

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("gfx000")?
File "/opt/rocm/bin/rocm_agent_enumerator", line 141
print "gfx000"
^

I tried building pytorch again without updating thrust and it worked this time...

briansp2020 · 2018-10-23T02:47:29Z

After building PyTorch, how do I make sure it is working ok? I ran some scripts under test directory and got bunch of errors (https://gist.github.com/briansp2020/c532898bb4a63a65d37281536df850e2). I'm not sure what the expected behavior is.

briansp2020 · 2018-10-23T18:26:54Z

Figured out what the issue was with thrust. I tried both https://github.com/pytorch/pytorch.git and https://github.com/ROCmSoftwarePlatform/pytorch.git. Code from pytorch repo requires updated thrust.

briansp2020 · 2018-10-23T20:50:54Z

It looks like https://github.com/pytorch/pytorch.git works better for me. When using ROCmSoftwarePlatform/pytorch.git, all training in fast.ai lesson1 fails. Using pytorch/pytorch.git, training seems to work until all the layers are unfrozen as shown here (https://imgur.com/a/g4f5InS).

iotamudelta · 2018-10-23T23:12:56Z

@briansp2020 before going to more complex tests, could you run
PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose (or python3 if you are running that)? If your compilation is good, there will be no errors.

briansp2020 · 2018-10-23T23:35:45Z

Test output from pytorch/pytorch at https://gist.github.com/briansp2020/8ddfdce9b1bc7da335cd5a09f166a8a9

jithunnair-amd · 2018-10-23T23:47:53Z

@iotamudelta the failing test in Brian's run is skipped for python2, which is what our CI runs on. @briansp2020 can you rerun with python2 and report the results?

briansp2020 · 2018-10-24T03:09:44Z

Test output from pytorch/pytorch using Python2.
https://gist.github.com/briansp2020/15c5792174c3cee11dfc03c8d999225f

It ran more tests before failing.

briansp2020 · 2018-10-24T13:17:08Z

Test and build output from ROCmSoftwarePlatform/pytorch using Python3.
https://gist.github.com/briansp2020/af9610196f5cb36c1b6dcec126aa2658

I see that more tests are run and passed. However, test_multinomial_invalid_probs_cuda still fails. Since test_multinomial is skipped on ROCm, test_multinomial_invalid_probs_cuda should be skipped as well.

I also ran into a build error. It seems compiler crashed while compiling unique_ops_hip.cc.

parser at end of file

Per-module optimization passes

Running pass 'Function Pass Manager' on module '/root/pytorch/caffe2/operators/hip/unique_ops_hip.cc'.

Running pass 'Combine redundant instructions' on function '@_ZN6thrust6system4cuda6detail4cub_28DeviceRadixSortUpsweepKernelINS3_21DeviceRadixSortPolicyIliiE9Policy700ELb0ELb0EliEEvPKT2_PT3_SB_iiNS3_13GridEvenShareISB_EE'
/opt/rocm/hcc/bin/clang-7(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x17678fa]
/opt/rocm/hcc/bin/clang-7(_ZN4llvm3sys17RunSignalHandlersEv+0x4c)[0x1765c2c]
/opt/rocm/hcc/bin/clang-7[0x1765d97]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2c56230390]
/opt/rocm/hcc/bin/clang-7[0x1378daa]
/opt/rocm/hcc/bin/clang-7[0x1379a08]
/opt/rocm/hcc/bin/clang-7(_ZN4llvm24InstructionCombiningPass13runOnFunctionERNS_8FunctionE+0x232)[0x137a042]
/opt/rocm/hcc/bin/clang-7(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x2ca)[0x125d39a]
/opt/rocm/hcc/bin/clang-7(_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE+0x33)[0x125d463]
/opt/rocm/hcc/bin/clang-7(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x32c)[0x125cf0c]
/opt/rocm/hcc/bin/clang-7(_ZN5clang17EmitBackendOutputERNS_17DiagnosticsEngineERKNS_19HeaderSearchOptionsERKNS_14CodeGenOptionsERKNS_13TargetOptionsERKNS_11LangOptionsERKN4llvm10DataLayoutEPNSE_6ModuleENS_13BackendActionESt10unique_ptrINSE_17raw_pwrite_streamESt14default_deleteISM_EEb+0xc47)[0x195cbb7]
/opt/rocm/hcc/bin/clang-7[0x2088602]
/opt/rocm/hcc/bin/clang-7(_ZN5clang8ParseASTERNS_4SemaEbb+0x370)[0x2814310]
/opt/rocm/hcc/bin/clang-7(_ZN5clang13CodeGenAction13ExecuteActionEv+0x37)[0x2087ba7]
/opt/rocm/hcc/bin/clang-7(_ZN5clang14FrontendAction7ExecuteEv+0x126)[0x1d5e7c6]
/opt/rocm/hcc/bin/clang-7(_ZN5clang16CompilerInstance13ExecuteActionERNS_14FrontendActionE+0x146)[0x1d281c6]
/opt/rocm/hcc/bin/clang-7(_ZN5clang25ExecuteCompilerInvocationEPNS_16CompilerInstanceE+0x96c)[0x1df9c0c]
/opt/rocm/hcc/bin/clang-7(_Z8cc1_mainN4llvm8ArrayRefIPKcEES2_Pv+0xa18)[0x8f9248]
/opt/rocm/hcc/bin/clang-7(main+0x17e3)[0x8974c3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f2c54f8d830]
/opt/rocm/hcc/bin/clang-7(_start+0x29)[0x8f6a19]
clang-7: error: unable to execute command: Segmentation fault (core dumped)
clang-7: error: clang frontend command failed due to signal (use -v to see invocation)
HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 37396a09268c7b90c63b91dbd66fe6c3b765bfee) (ssh://gerritgit/compute/ec/hcc-tot/llvm 4144ab076c5720d34d0f36c158ef3b094fe0d339) (based on HCC 1.2.18383-7685003-37396a0-4144ab0 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin
clang-7: note: diagnostic msg: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
[ 83%] Building HIPCC object caffe2/CMakeFiles/caffe2_hip.dir/sgd/hip/caffe2_hip_generated_fp16_momentum_sgd_op_hip.cc.o
[ 83%] Building HIPCC object caffe2/CMakeFiles/caffe2_hip.dir/sgd/hip/caffe2_hip_generated_momentum_sgd_op_hip.cc.o
clang-7: note: diagnostic msg:

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-b16c04.cpp
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-efcfe7.cpp
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-b16c04.sh
clang-7: note: diagnostic msg:

Full build output up to the error message is in the gist as well. Do I need updated compiler to build it properly?

Edit: fixed the link to gist

briansp2020 · 2018-10-24T13:27:50Z

Made mistake in previous post. Correct link to gist is
https://gist.github.com/briansp2020/af9610196f5cb36c1b6dcec126aa2658

briansp2020 · 2018-10-24T16:00:59Z

Rebuilt ROCmSoftwarePlatform/pytorch using Python3 and, this time, it compiled without crashing.
Skipping test_multinomial_invalid_probs_cuda allowed the test to proceed further and tests now fail at test_nn.py.
https://gist.github.com/briansp2020/fcbcd0cb1d962cc39d4eecd4f1a80f6d

iotamudelta · 2018-10-24T16:09:39Z

If you are using our docker file and compile with python2 (note that your host system must be ROCm 1.9), all unit tests after filtering with PYTORCH_TEST_WITH_ROCM=1 will pass either on pytorch/pytorch or this fork. I suspect you are not using our docker file? Which ROCm repository are you using?

briansp2020 · 2018-10-24T16:29:55Z

For Python2, I used the docker file as is to build ROCmSoftwarePlatform/pytorch. pytorch/pytorch won't compile because of thrust.
For Python3, I modified it to install Python3. The file in gist https://gist.github.com/briansp2020/7a7debad0d7f7aabecce1f7c8d39858a
I had to fix rocm_agent_enumerator. I'm using the thrust from git to build ROCmSoftwarePlatform/pytorch since that is what was needed to build pytorch/pytorch.

Host is ROCm1.9.1

$ uname -a
Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg --list | grep -i rocm
ii hsa-ext-rocr-dev 1.1.9-9-ge4ab040 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime extensions for ROCm platforms
ii hsa-rocr-dev 1.1.9-9-ge4ab040 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime for ROCm platforms
ii rocm-clang-ocl 0.3.0-7997136 amd64 OpenCL compilation with clang compiler.
ii rocm-dev 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-device-libs 0.0.1 amd64 Radeon Open Compute - device libraries
ii rocm-dkms 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-libs 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-opencl 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-opencl-dev 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-smi 1.0.0-72-gec1da05 amd64 System Management Interface for ROCm
ii rocm-utils 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocminfo 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
ii rocr_debug_agent 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime debug agent

jithunnair-amd · 2018-10-24T16:35:32Z

Test output from pytorch/pytorch using Python2.
https://gist.github.com/briansp2020/15c5792174c3cee11dfc03c8d999225f

It ran more tests before failing.

This particular run failed because you didn't have the "hypothesis" python module installed in your system. If you see similar import errors from python in the future, look for the matching python module to install. Sorry, as Johannes mentioned already, this is a WIP. :)

briansp2020 · 2018-10-24T17:05:18Z

@jithunnair-amd Yes. I ran the test again with hypothesis and was able to go further.

I'm at the moment fixing up test script to run the test in Python3 environment. I'm just keeping this thread up to date just in case I find something useful for the development team.

It seems the test script does not quite work in Python 3.6 environment that I'm interested in using. I was able to run test_nn.py and the result is in this gist.
https://gist.github.com/briansp2020/0fb92b6ef2b927386b89904a88974f8f

jithunnair-amd · 2018-10-24T17:11:57Z

Unfortunately, python2 is the only one thoroughly tested at the moment, so that's the one we would have stronger confidence on working. Does python2 not work for fast.ai?

Your python3 log above looks good! I don't see any failures. Why do you say "the test script does not quite work in Python 3.6 environment that I'm interested in using"?

briansp2020 · 2018-10-24T18:36:19Z

Fast.ai requires Python3.6.

I fixed up the test script to get that result. I can send a pull request if you like. Just some minor change in conditions.

Also, I installed librosa and some tests fail looking for hipFFT. Is hipFFT part of separate package? It seems it should be part of rocFFT...
https://gist.github.com/briansp2020/fe1605a5aa8fde77448d76032b9e2703

briansp2020 · 2018-10-24T18:42:53Z

My pull request to get the tests running on Python 3.6. :)
#291

iotamudelta · 2018-10-24T18:44:50Z

The hipfft API is part of rocFFT. So it's interesting that there are missing symbols. We haven't tried with librosa, so we'll need to see if we can reproduce and fix.

Thanks for the PR!

briansp2020 · 2018-10-25T04:48:41Z

Just wondering. Do you have target release date for PyTorch for ROCm?

briansp2020 · 2018-10-26T16:11:29Z

I was looking into why Fast.ai does not converge when PyTorch on ROCm and noticed that the loss vs learning rate curve looks very different between the version I get from ROCm and the version from Fast.ai repo. ROCm version is much narrower and the minimum does not seem as low either. Is this expected behavior? It seems I'd have to adjust learning rate to make the model converge. See below screenshot for comparison.
https://imgur.com/a/oOWvwoo

briansp2020 · 2018-11-02T20:07:22Z

FYI
I noticed that packages were updated early this morning. However, using these latest packages in docker container causes issues with compiling PyTorch. Output in Python 2 (https://gist.github.com/briansp2020/2b149154eaac6da7073ad559951301b6) and 3 (https://gist.github.com/briansp2020/58bba440fd7925787a73a966d75847b7).

Linker seems to crash
Stack dump:
0. Program arguments: /opt/rocm/hcc/bin/llc -O2 -mtriple amdgcn--amdhsa-amdgiz -mcpu=gfx900 -filetype=obj -o /tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin /tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin.opt.bc

Running pass 'CallGraph Pass Manager' on module '/tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin.opt.bc'.
Running pass 'Simple Register Coalescing' on function '@ZN2at6native13reduce_kernelILi512ENS0_8ReduceOpIsZNS0_15sum_kernel_implIssEEvRNS_14TensorIteratorEEUlssE_EEEEvT0'
/opt/rocm/hcc/bin/llc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x15554ba]
/opt/rocm/hcc/bin/llc(_ZN4llvm3sys17RunSignalHandlersEv+0x4c)[0x15537ec]
/opt/rocm/hcc/bin/llc[0x1553957]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fac40cfb390]
/opt/rocm/hcc/bin/llc[0xdfad58]
/opt/rocm/hcc/bin/llc[0xdff624]
/opt/rocm/hcc/bin/llc(_ZN4llvm12LiveInterval15refineSubRangesERNS_20BumpPtrAllocatorImplINS_15MallocAllocatorELm4096ELm4096EEENS_11LaneBitmaskESt8functionIFvRNS0_8SubRangeEEE+0x2c4)[0xca17e4]
/opt/rocm/hcc/bin/llc[0xdfe127]
/opt/rocm/hcc/bin/llc[0xe00f77]
/opt/rocm/hcc/bin/llc[0xe04066]
/opt/rocm/hcc/bin/llc(_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE+0x91)[0xd17d11]
/opt/rocm/hcc/bin/llc(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x2ca)[0x10129ea]
/opt/rocm/hcc/bin/llc[0xaa22de]
/opt/rocm/hcc/bin/llc(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x32c)[0x101255c]
/opt/rocm/hcc/bin/llc[0x672105]
/opt/rocm/hcc/bin/llc(main+0x2f6)[0x613f86]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fac3fa58830]
/opt/rocm/hcc/bin/llc(_start+0x29)[0x666939]
/opt/rocm/hcc/bin/clamp-device: line 231: 16263 Segmentation fault (core dumped) $LLC $KMOPTLLC -mtriple amdgcn--amdhsa-amdgiz -mcpu=$AMDGPU_TARGET -filetype=obj -o $2 $2.opt.bc
Generating AMD GCN kernel failed in llc for target: gfx900
clang-7: error: linker command failed with exit code 139 (use -v to see invocation)

iotamudelta · 2018-11-02T20:42:24Z

Thanks! We are aware of this, it'll be fixed by release. Do you have a working docker image around still?

briansp2020 · 2018-11-02T22:01:45Z

Yes. I do have a working docker image.

Citronnade · 2018-11-07T03:47:08Z

I'm running into the same issue with the provided dockerfile and building for gfx900. Is there a quick fix I can use to allow it to compile?

iotamudelta · 2018-11-07T04:17:06Z

@Citronnade adjust the repository to http://repo.radeon.com/rocm/misc/facebook/apt/.apt_1.9.white_rabbit/debian/

matthewmax09 · 2018-11-12T08:22:12Z

Hello,

Just want to check if this is a concern, when I hipify the source code I get an error.

(py35) root@de18c9004c48:/data/pytorch# python tools/amd_build/build_pytorch_amd.py
error: patch failed: torch/cuda/init.py:123
error: torch/cuda/init.py: patch does not apply

The reason for asking is that after building pytorch and running the test, 12 tests fail with this sumary:

Ran 2048 tests in 58.064s

FAILED (failures=12, skipped=271)
Traceback (most recent call last):
File "test/run_test.py", line 394, in
main()
File "test/run_test.py", line 386, in main
raise RuntimeError(message)
RuntimeError: test_cuda failed!

jithunnair-amd · 2018-11-12T15:18:34Z

@matthewmax09 Yes, you should not get an error during hipification normally. This usually occurs when you have a partially or fully hipified directory and you run the hipification script again. You might want to try resetting your repo or starting on a fresh one.

matthewmax09 · 2018-11-13T07:54:59Z

@jithunnair-amd

Ok, I recloned https://github.com/ROCmSoftwarePlatform/pytorch and ran hippify again. No errors this time! Thank you for that advice! But I am still getting the same errors during testing and pasted them in this gist.

https://gist.github.com/matthewmax09/265139c78b27100dd75d95bacc9298af

It seems to me that the tests were expecting the tensor to be less than or equal to 1e-05. Which wasn't the case, resulting in the failure.

odellus · 2018-11-25T04:12:59Z

I'm seeing the same error.

iotamudelta · 2019-04-23T20:40:00Z

Closing this issue as we are now at ROCm 2.3 and tests should have stabilized. @odellus @matthewmax09 please update and re-open if this is not the case, thank you!

jithunnair-amd changed the title ~~Builiding PyTorch with ROCm~~ Building PyTorch with ROCm Oct 24, 2018

iotamudelta self-assigned this Jan 4, 2019

iotamudelta closed this as completed Apr 23, 2019

briansp2020 mentioned this issue Jun 27, 2019

Fastai/torchvision support #427

Closed

Building PyTorch with ROCm #258

Building PyTorch with ROCm #258

Comments

briansp2020 commented Oct 10, 2018

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

iotamudelta commented Oct 10, 2018

Uh oh!

jithunnair-amd commented Oct 10, 2018

Uh oh!

briansp2020 commented Oct 11, 2018

Uh oh!

briansp2020 commented Oct 21, 2018

Uh oh!

briansp2020 commented Oct 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iotamudelta commented Oct 22, 2018

Uh oh!

briansp2020 commented Oct 22, 2018

Uh oh!

briansp2020 commented Oct 23, 2018

Uh oh!

briansp2020 commented Oct 23, 2018

Uh oh!

briansp2020 commented Oct 23, 2018

Uh oh!

iotamudelta commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

briansp2020 commented Oct 23, 2018

Uh oh!

jithunnair-amd commented Oct 23, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

iotamudelta commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

jithunnair-amd commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

jithunnair-amd commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 24, 2018

Uh oh!

iotamudelta commented Oct 24, 2018

Uh oh!

briansp2020 commented Oct 25, 2018

Uh oh!

briansp2020 commented Oct 26, 2018

Uh oh!

briansp2020 commented Nov 2, 2018

Uh oh!

iotamudelta commented Nov 2, 2018

Uh oh!

briansp2020 commented Nov 2, 2018

Uh oh!

Citronnade commented Nov 7, 2018

Uh oh!

iotamudelta commented Nov 7, 2018

Uh oh!

matthewmax09 commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Nov 12, 2018

Uh oh!

matthewmax09 commented Nov 13, 2018

Uh oh!

briansp2020 commented Oct 21, 2018 •

edited

Loading

iotamudelta commented Oct 23, 2018 •

edited

Loading

briansp2020 commented Oct 24, 2018 •

edited

Loading

matthewmax09 commented Nov 12, 2018 •

edited

Loading