Skip to content

Building PyTorch with ROCm #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
briansp2020 opened this issue Oct 10, 2018 · 37 comments
Closed

Building PyTorch with ROCm #258

briansp2020 opened this issue Oct 10, 2018 · 37 comments
Assignees

Comments

@briansp2020
Copy link

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

I'm trying to build PyTorch to run on ROCm (Ubuntu 18.04) and am having issues. I tried the following.

  1. I followed https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm but it seems to have failed at pyyaml (https://gist.github.com/briansp2020/114bd75ff0182197cf7efc7af265e89c)
    I got over the error by installing wheel. However, the build still failed later (https://gist.github.com/briansp2020/2719353d626968082410011dc36608cf)

  2. I tried build it in tensorflow docker and I get https://gist.github.com/briansp2020/2a109c0f1d40b45299cb73a76a255767

It seems the wiki is old and I needed to get latest rocSPARSE (https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases) to get past the CMake phase. Unfortunately, build still failed(https://gist.github.com/briansp2020/52047cf73d8d59ddd72f730d779b952c)...

Do you have up to date instruction on how to build PyTorch with ROCm? My goal is to run fast.ai on Vega FE with ROCm.

Thanks!

@iotamudelta
Copy link

We are in the process of releasing a new version of ROCm for PyTorch soon. The wiki and docker file will be updated then. Stay tuned!

@jithunnair-amd
Copy link
Collaborator

@briansp2020 Please don't forget to run Step 6 of https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm before you start the build process. It seems you didn't "hipify" using that step, which is why your code is still referencing CUDA headers.

@briansp2020
Copy link
Author

I was able to make some more progress. I did ran hipify but it failed because I did not set environment variables for language. After setting the environment variable and installing hipSPARSE & rocSPARSE to ROCm/TensorFlow docker, I was able to make progress. The build process reached 87% (https://gist.github.com/briansp2020/159742c61da5bb205b7214ac980ff092).

Some errors I still get are

In file included from /root/pytorch/aten/src/THC/THCTensorMathReduce.cu:1:
/root/pytorch/aten/src/THC/THCTensorMathReduce.cuh:121:61: error: '__builtin_inff': no overloaded function has restriction specifiers that are compatible with the ambient context 'THCTensor_kernel_renorm'
if (THCNumerics::eq(value, scalar_cast<AccT, float>(INFINITY))) {
^
/usr/include/x86_64-linux-gnu/bits/inf.h:26:34: note: expanded from macro 'INFINITY'

LLVM ERROR: Cannot select: 0xb7325640: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb773aa58, 0xb5979530, FrameIndex:i32<0>, undef:i32
0xb5979530: i32 = srl 0xb622f9f0, Constant:i32<16>
0xb622f9f0: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3e0d508, 0xb5bfbd00, undef:i64
0xb5bfbd00: i64,ch = CopyFromReg 0x3e0d508, Register:i64 %47
0x668b4f50: i64 = Register %47
0xb575d268: i64 = undef
0xb7fbc7c0: i32 = Constant<16>
0xb5b29038: i32 = FrameIndex<0>
0xb5c0d4b8: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx906
LLVM ERROR: Cannot select: 0x6688ad48: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb405a170, 0x66b1be88, FrameIndex:i32<0>, undef:i32
0x66b1be88: i32 = srl 0xb54e5028, Constant:i32<16>
0xb54e5028: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3eb5498, 0x6b130040, undef:i64
0x6b130040: i64,ch = CopyFromReg 0x3eb5498, Register:i64 %47
0x6aecdfc8: i64 = Register %47
0xb6fbef08: i64 = undef
0x6b71b7b8: i32 = Constant<16>
0xb3fde388: i32 = FrameIndex<0>
0x66924668: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx900
LLVM ERROR: Cannot select: 0xb91ced38: ch = store<(store 8 into %ir.dense.coerce.fca.0.gep16, addrspace 5), trunc to i64> 0xb6d397d8, 0xb55c15d0, FrameIndex:i32<0>, undef:i32
0xb55c15d0: i32 = srl 0x67a21678, Constant:i32<16>
0x67a21678: i32,ch = load<(dereferenceable invariant load 4 from i16 addrspace(4)* undef, align 16, addrspace 4)> 0x3ef94f8, 0xb6d5a770, undef:i64
0xb6d5a770: i64,ch = CopyFromReg 0x3ef94f8, Register:i64 %47
0xb6a24280: i64 = Register %47
0xb53ad3f0: i64 = undef
0x66205d58: i32 = Constant<16>
0xb91dacc8: i32 = FrameIndex<0>
0x6c344ed8: i32 = undef
In function: ZN2at6native5apply23sparseElementwiseKernelI12TensorCAddOpINS_4HalfEEmS4_EEvT_NS_4cuda6detail10TensorInfoIT1_T0_EENS9_IlSB_EESC_SB
Generating AMD GCN kernel failed in llc for target: gfx803
clang-7: error: linker command failed with exit code 7 (use -v to see invocation)

Any advice would be appreciated. I'll keep looking.

Thanks!

@briansp2020
Copy link
Author

FYI. I saw that the Dockerfile was updated and tried to build pytorch & fast.ai. I was able to build them eventually but noticed that rocm_agent_enumerator installed in the docker image does not match the latest (https://github.com/RadeonOpenCompute/rocminfo/blob/master/rocm_agent_enumerator) and fails to run under Python 3.6.

@briansp2020
Copy link
Author

briansp2020 commented Oct 21, 2018

After building docker image using updated docker file (https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm). I needed to uninstall hip-thrust and get the latest hip-thrust from git (https://github.com/ROCmSoftwarePlatform/Thrust.git) before building PyTorch. I summarized steps I took to build PyTorch using Python3.6 here (https://gist.github.com/briansp2020/717f5cab006889f056b36eacee7dd5d7).

After building PyTorch successfully in Python3.6, I tried fast.ai but ran into issues. Running the lesson1 notebook, Fine-tuning and differential learning rate annealing step shows very low accuracy. This screenshot shows the comparison between ROCm output and the output from the notebook that came with fast.ai repo. https://imgur.com/a/g4f5InS The ROCm version shows .55 accuracy when it should be .99.

Any ideas?

@iotamudelta
Copy link

Could you elaborate what you mean by/observe when you say:

  • rocm_agent_enumerator failing under python 3.6
  • hip-thrust package not working

It's hard to comment on the convergence/accuracy issue you observe directly without knowing what model/kernels are running.

@briansp2020
Copy link
Author

The rocm_agent_enumerator file in docker image has no parenthesis around print statement at line 141. The latest file in git has them. It seems to work fine in Python 2 but it generates error when using Python 3.6.

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("gfx000")?
File "/opt/rocm/bin/rocm_agent_enumerator", line 141
print "gfx000"
^

I tried building pytorch again without updating thrust and it worked this time...

@briansp2020
Copy link
Author

After building PyTorch, how do I make sure it is working ok? I ran some scripts under test directory and got bunch of errors (https://gist.github.com/briansp2020/c532898bb4a63a65d37281536df850e2). I'm not sure what the expected behavior is.

@briansp2020
Copy link
Author

Figured out what the issue was with thrust. I tried both https://github.com/pytorch/pytorch.git and https://github.com/ROCmSoftwarePlatform/pytorch.git. Code from pytorch repo requires updated thrust.

@briansp2020
Copy link
Author

It looks like https://github.com/pytorch/pytorch.git works better for me. When using ROCmSoftwarePlatform/pytorch.git, all training in fast.ai lesson1 fails. Using pytorch/pytorch.git, training seems to work until all the layers are unfrozen as shown here (https://imgur.com/a/g4f5InS).

@iotamudelta
Copy link

iotamudelta commented Oct 23, 2018

@briansp2020 before going to more complex tests, could you run
PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose (or python3 if you are running that)? If your compilation is good, there will be no errors.

@briansp2020
Copy link
Author

Test output from pytorch/pytorch at https://gist.github.com/briansp2020/8ddfdce9b1bc7da335cd5a09f166a8a9

@jithunnair-amd
Copy link
Collaborator

@iotamudelta the failing test in Brian's run is skipped for python2, which is what our CI runs on. @briansp2020 can you rerun with python2 and report the results?

@briansp2020
Copy link
Author

Test output from pytorch/pytorch using Python2.
https://gist.github.com/briansp2020/15c5792174c3cee11dfc03c8d999225f

It ran more tests before failing.

@briansp2020
Copy link
Author

briansp2020 commented Oct 24, 2018

Test and build output from ROCmSoftwarePlatform/pytorch using Python3.
https://gist.github.com/briansp2020/af9610196f5cb36c1b6dcec126aa2658

I see that more tests are run and passed. However, test_multinomial_invalid_probs_cuda still fails. Since test_multinomial is skipped on ROCm, test_multinomial_invalid_probs_cuda should be skipped as well.

I also ran into a build error. It seems compiler crashed while compiling unique_ops_hip.cc.

  1. parser at end of file
  2. Per-module optimization passes
  3. Running pass 'Function Pass Manager' on module '/root/pytorch/caffe2/operators/hip/unique_ops_hip.cc'.
  4. Running pass 'Combine redundant instructions' on function '@_ZN6thrust6system4cuda6detail4cub_28DeviceRadixSortUpsweepKernelINS3_21DeviceRadixSortPolicyIliiE9Policy700ELb0ELb0EliEEvPKT2_PT3_SB_iiNS3_13GridEvenShareISB_EE'
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x17678fa]
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm3sys17RunSignalHandlersEv+0x4c)[0x1765c2c]
    /opt/rocm/hcc/bin/clang-7[0x1765d97]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2c56230390]
    /opt/rocm/hcc/bin/clang-7[0x1378daa]
    /opt/rocm/hcc/bin/clang-7[0x1379a08]
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm24InstructionCombiningPass13runOnFunctionERNS_8FunctionE+0x232)[0x137a042]
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x2ca)[0x125d39a]
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE+0x33)[0x125d463]
    /opt/rocm/hcc/bin/clang-7(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x32c)[0x125cf0c]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang17EmitBackendOutputERNS_17DiagnosticsEngineERKNS_19HeaderSearchOptionsERKNS_14CodeGenOptionsERKNS_13TargetOptionsERKNS_11LangOptionsERKN4llvm10DataLayoutEPNSE_6ModuleENS_13BackendActionESt10unique_ptrINSE_17raw_pwrite_streamESt14default_deleteISM_EEb+0xc47)[0x195cbb7]
    /opt/rocm/hcc/bin/clang-7[0x2088602]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang8ParseASTERNS_4SemaEbb+0x370)[0x2814310]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang13CodeGenAction13ExecuteActionEv+0x37)[0x2087ba7]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang14FrontendAction7ExecuteEv+0x126)[0x1d5e7c6]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang16CompilerInstance13ExecuteActionERNS_14FrontendActionE+0x146)[0x1d281c6]
    /opt/rocm/hcc/bin/clang-7(_ZN5clang25ExecuteCompilerInvocationEPNS_16CompilerInstanceE+0x96c)[0x1df9c0c]
    /opt/rocm/hcc/bin/clang-7(_Z8cc1_mainN4llvm8ArrayRefIPKcEES2_Pv+0xa18)[0x8f9248]
    /opt/rocm/hcc/bin/clang-7(main+0x17e3)[0x8974c3]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f2c54f8d830]
    /opt/rocm/hcc/bin/clang-7(_start+0x29)[0x8f6a19]
    clang-7: error: unable to execute command: Segmentation fault (core dumped)
    clang-7: error: clang frontend command failed due to signal (use -v to see invocation)
    HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 37396a09268c7b90c63b91dbd66fe6c3b765bfee) (ssh://gerritgit/compute/ec/hcc-tot/llvm 4144ab076c5720d34d0f36c158ef3b094fe0d339) (based on HCC 1.2.18383-7685003-37396a0-4144ab0 )
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /opt/rocm/hcc/bin
    clang-7: note: diagnostic msg: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
    [ 83%] Building HIPCC object caffe2/CMakeFiles/caffe2_hip.dir/sgd/hip/caffe2_hip_generated_fp16_momentum_sgd_op_hip.cc.o
    [ 83%] Building HIPCC object caffe2/CMakeFiles/caffe2_hip.dir/sgd/hip/caffe2_hip_generated_momentum_sgd_op_hip.cc.o
    clang-7: note: diagnostic msg:

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-b16c04.cpp
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-efcfe7.cpp
clang-7: note: diagnostic msg: /tmp/unique_ops_hip-b16c04.sh
clang-7: note: diagnostic msg:


Full build output up to the error message is in the gist as well. Do I need updated compiler to build it properly?

Edit: fixed the link to gist

@briansp2020
Copy link
Author

Made mistake in previous post. Correct link to gist is
https://gist.github.com/briansp2020/af9610196f5cb36c1b6dcec126aa2658

@briansp2020
Copy link
Author

Rebuilt ROCmSoftwarePlatform/pytorch using Python3 and, this time, it compiled without crashing.
Skipping test_multinomial_invalid_probs_cuda allowed the test to proceed further and tests now fail at test_nn.py.
https://gist.github.com/briansp2020/fcbcd0cb1d962cc39d4eecd4f1a80f6d

@iotamudelta
Copy link

If you are using our docker file and compile with python2 (note that your host system must be ROCm 1.9), all unit tests after filtering with PYTORCH_TEST_WITH_ROCM=1 will pass either on pytorch/pytorch or this fork. I suspect you are not using our docker file? Which ROCm repository are you using?

@briansp2020
Copy link
Author

For Python2, I used the docker file as is to build ROCmSoftwarePlatform/pytorch. pytorch/pytorch won't compile because of thrust.
For Python3, I modified it to install Python3. The file in gist https://gist.github.com/briansp2020/7a7debad0d7f7aabecce1f7c8d39858a
I had to fix rocm_agent_enumerator. I'm using the thrust from git to build ROCmSoftwarePlatform/pytorch since that is what was needed to build pytorch/pytorch.

Host is ROCm1.9.1

$ uname -a
Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg --list | grep -i rocm
ii hsa-ext-rocr-dev 1.1.9-9-ge4ab040 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime extensions for ROCm platforms
ii hsa-rocr-dev 1.1.9-9-ge4ab040 amd64 AMD Heterogeneous System Architecture HSA - Linux HSA Runtime for ROCm platforms
ii rocm-clang-ocl 0.3.0-7997136 amd64 OpenCL compilation with clang compiler.
ii rocm-dev 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-device-libs 0.0.1 amd64 Radeon Open Compute - device libraries
ii rocm-dkms 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-libs 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-opencl 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-opencl-dev 1.2.0-2018090737 amd64 OpenCL/ROCm
ii rocm-smi 1.0.0-72-gec1da05 amd64 System Management Interface for ROCm
ii rocm-utils 1.9.211 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocminfo 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
ii rocr_debug_agent 1.0.0 amd64 Radeon Open Compute (ROCm) Runtime debug agent

@jithunnair-amd
Copy link
Collaborator

Test output from pytorch/pytorch using Python2.
https://gist.github.com/briansp2020/15c5792174c3cee11dfc03c8d999225f

It ran more tests before failing.

This particular run failed because you didn't have the "hypothesis" python module installed in your system. If you see similar import errors from python in the future, look for the matching python module to install. Sorry, as Johannes mentioned already, this is a WIP. :)

@briansp2020
Copy link
Author

@jithunnair-amd Yes. I ran the test again with hypothesis and was able to go further.

I'm at the moment fixing up test script to run the test in Python3 environment. I'm just keeping this thread up to date just in case I find something useful for the development team.

It seems the test script does not quite work in Python 3.6 environment that I'm interested in using. I was able to run test_nn.py and the result is in this gist.
https://gist.github.com/briansp2020/0fb92b6ef2b927386b89904a88974f8f

@jithunnair-amd
Copy link
Collaborator

Unfortunately, python2 is the only one thoroughly tested at the moment, so that's the one we would have stronger confidence on working. Does python2 not work for fast.ai?

Your python3 log above looks good! I don't see any failures. Why do you say "the test script does not quite work in Python 3.6 environment that I'm interested in using"?

@jithunnair-amd jithunnair-amd changed the title Builiding PyTorch with ROCm Building PyTorch with ROCm Oct 24, 2018
@briansp2020
Copy link
Author

Fast.ai requires Python3.6.

I fixed up the test script to get that result. I can send a pull request if you like. Just some minor change in conditions.

Also, I installed librosa and some tests fail looking for hipFFT. Is hipFFT part of separate package? It seems it should be part of rocFFT...
https://gist.github.com/briansp2020/fe1605a5aa8fde77448d76032b9e2703

@briansp2020
Copy link
Author

My pull request to get the tests running on Python 3.6. :)
#291

@iotamudelta
Copy link

The hipfft API is part of rocFFT. So it's interesting that there are missing symbols. We haven't tried with librosa, so we'll need to see if we can reproduce and fix.

Thanks for the PR!

@briansp2020
Copy link
Author

Just wondering. Do you have target release date for PyTorch for ROCm?

@briansp2020
Copy link
Author

I was looking into why Fast.ai does not converge when PyTorch on ROCm and noticed that the loss vs learning rate curve looks very different between the version I get from ROCm and the version from Fast.ai repo. ROCm version is much narrower and the minimum does not seem as low either. Is this expected behavior? It seems I'd have to adjust learning rate to make the model converge. See below screenshot for comparison.
https://imgur.com/a/oOWvwoo

@briansp2020
Copy link
Author

FYI
I noticed that packages were updated early this morning. However, using these latest packages in docker container causes issues with compiling PyTorch. Output in Python 2 (https://gist.github.com/briansp2020/2b149154eaac6da7073ad559951301b6) and 3 (https://gist.github.com/briansp2020/58bba440fd7925787a73a966d75847b7).

Linker seems to crash
Stack dump:
0. Program arguments: /opt/rocm/hcc/bin/llc -O2 -mtriple amdgcn--amdhsa-amdgiz -mcpu=gfx900 -filetype=obj -o /tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin /tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin.opt.bc

  1. Running pass 'CallGraph Pass Manager' on module '/tmp/tmp.RG3wASxAM4/caffe2_hip_generated_ReduceOpsKernel.cu.kernel.bc-gfx900.isabin.opt.bc'.
  2. Running pass 'Simple Register Coalescing' on function '@ZN2at6native13reduce_kernelILi512ENS0_8ReduceOpIsZNS0_15sum_kernel_implIssEEvRNS_14TensorIteratorEEUlssE_EEEEvT0'
    /opt/rocm/hcc/bin/llc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x15554ba]
    /opt/rocm/hcc/bin/llc(_ZN4llvm3sys17RunSignalHandlersEv+0x4c)[0x15537ec]
    /opt/rocm/hcc/bin/llc[0x1553957]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fac40cfb390]
    /opt/rocm/hcc/bin/llc[0xdfad58]
    /opt/rocm/hcc/bin/llc[0xdff624]
    /opt/rocm/hcc/bin/llc(_ZN4llvm12LiveInterval15refineSubRangesERNS_20BumpPtrAllocatorImplINS_15MallocAllocatorELm4096ELm4096EEENS_11LaneBitmaskESt8functionIFvRNS0_8SubRangeEEE+0x2c4)[0xca17e4]
    /opt/rocm/hcc/bin/llc[0xdfe127]
    /opt/rocm/hcc/bin/llc[0xe00f77]
    /opt/rocm/hcc/bin/llc[0xe04066]
    /opt/rocm/hcc/bin/llc(_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE+0x91)[0xd17d11]
    /opt/rocm/hcc/bin/llc(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x2ca)[0x10129ea]
    /opt/rocm/hcc/bin/llc[0xaa22de]
    /opt/rocm/hcc/bin/llc(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x32c)[0x101255c]
    /opt/rocm/hcc/bin/llc[0x672105]
    /opt/rocm/hcc/bin/llc(main+0x2f6)[0x613f86]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fac3fa58830]
    /opt/rocm/hcc/bin/llc(_start+0x29)[0x666939]
    /opt/rocm/hcc/bin/clamp-device: line 231: 16263 Segmentation fault (core dumped) $LLC $KMOPTLLC -mtriple amdgcn--amdhsa-amdgiz -mcpu=$AMDGPU_TARGET -filetype=obj -o $2 $2.opt.bc
    Generating AMD GCN kernel failed in llc for target: gfx900
    clang-7: error: linker command failed with exit code 139 (use -v to see invocation)

@iotamudelta
Copy link

Thanks! We are aware of this, it'll be fixed by release. Do you have a working docker image around still?

@briansp2020
Copy link
Author

Yes. I do have a working docker image.

@Citronnade
Copy link

I'm running into the same issue with the provided dockerfile and building for gfx900. Is there a quick fix I can use to allow it to compile?

@iotamudelta
Copy link

@Citronnade adjust the repository to http://repo.radeon.com/rocm/misc/facebook/apt/.apt_1.9.white_rabbit/debian/

@matthewmax09
Copy link

matthewmax09 commented Nov 12, 2018

Hello,

Just want to check if this is a concern, when I hipify the source code I get an error.

(py35) root@de18c9004c48:/data/pytorch# python tools/amd_build/build_pytorch_amd.py
error: patch failed: torch/cuda/init.py:123
error: torch/cuda/init.py: patch does not apply

The reason for asking is that after building pytorch and running the test, 12 tests fail with this sumary:

Ran 2048 tests in 58.064s

FAILED (failures=12, skipped=271)
Traceback (most recent call last):
File "test/run_test.py", line 394, in
main()
File "test/run_test.py", line 386, in main
raise RuntimeError(message)
RuntimeError: test_cuda failed!

@jithunnair-amd
Copy link
Collaborator

@matthewmax09 Yes, you should not get an error during hipification normally. This usually occurs when you have a partially or fully hipified directory and you run the hipification script again. You might want to try resetting your repo or starting on a fresh one.

@matthewmax09
Copy link

@jithunnair-amd

Ok, I recloned https://github.com/ROCmSoftwarePlatform/pytorch and ran hippify again. No errors this time! Thank you for that advice! But I am still getting the same errors during testing and pasted them in this gist.

https://gist.github.com/matthewmax09/265139c78b27100dd75d95bacc9298af

It seems to me that the tests were expecting the tensor to be less than or equal to 1e-05. Which wasn't the case, resulting in the failure.

@odellus
Copy link

odellus commented Nov 25, 2018

I'm seeing the same error.

@iotamudelta iotamudelta self-assigned this Jan 4, 2019
@iotamudelta
Copy link

Closing this issue as we are now at ROCm 2.3 and tests should have stabilized. @odellus @matthewmax09 please update and re-open if this is not the case, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants