-
Notifications
You must be signed in to change notification settings - Fork 68
Building PyTorch with ROCm #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We are in the process of releasing a new version of ROCm for PyTorch soon. The wiki and docker file will be updated then. Stay tuned! |
@briansp2020 Please don't forget to run Step 6 of https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm before you start the build process. It seems you didn't "hipify" using that step, which is why your code is still referencing CUDA headers. |
I was able to make some more progress. I did ran hipify but it failed because I did not set environment variables for language. After setting the environment variable and installing hipSPARSE & rocSPARSE to ROCm/TensorFlow docker, I was able to make progress. The build process reached 87% (https://gist.github.com/briansp2020/159742c61da5bb205b7214ac980ff092). Some errors I still get are
Any advice would be appreciated. I'll keep looking. Thanks! |
FYI. I saw that the Dockerfile was updated and tried to build pytorch & fast.ai. I was able to build them eventually but noticed that rocm_agent_enumerator installed in the docker image does not match the latest (https://github.com/RadeonOpenCompute/rocminfo/blob/master/rocm_agent_enumerator) and fails to run under Python 3.6. |
After building docker image using updated docker file (https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm). I needed to uninstall hip-thrust and get the latest hip-thrust from git (https://github.com/ROCmSoftwarePlatform/Thrust.git) before building PyTorch. I summarized steps I took to build PyTorch using Python3.6 here (https://gist.github.com/briansp2020/717f5cab006889f056b36eacee7dd5d7). After building PyTorch successfully in Python3.6, I tried fast.ai but ran into issues. Running the lesson1 notebook, Fine-tuning and differential learning rate annealing step shows very low accuracy. This screenshot shows the comparison between ROCm output and the output from the notebook that came with fast.ai repo. https://imgur.com/a/g4f5InS The ROCm version shows .55 accuracy when it should be .99. Any ideas? |
Could you elaborate what you mean by/observe when you say:
It's hard to comment on the convergence/accuracy issue you observe directly without knowing what model/kernels are running. |
The rocm_agent_enumerator file in docker image has no parenthesis around print statement at line 141. The latest file in git has them. It seems to work fine in Python 2 but it generates error when using Python 3.6.
I tried building pytorch again without updating thrust and it worked this time... |
After building PyTorch, how do I make sure it is working ok? I ran some scripts under test directory and got bunch of errors (https://gist.github.com/briansp2020/c532898bb4a63a65d37281536df850e2). I'm not sure what the expected behavior is. |
Figured out what the issue was with thrust. I tried both https://github.com/pytorch/pytorch.git and https://github.com/ROCmSoftwarePlatform/pytorch.git. Code from pytorch repo requires updated thrust. |
It looks like https://github.com/pytorch/pytorch.git works better for me. When using ROCmSoftwarePlatform/pytorch.git, all training in fast.ai lesson1 fails. Using pytorch/pytorch.git, training seems to work until all the layers are unfrozen as shown here (https://imgur.com/a/g4f5InS). |
@briansp2020 before going to more complex tests, could you run |
Test output from pytorch/pytorch at https://gist.github.com/briansp2020/8ddfdce9b1bc7da335cd5a09f166a8a9 |
@iotamudelta the failing test in Brian's run is skipped for python2, which is what our CI runs on. @briansp2020 can you rerun with python2 and report the results? |
Test output from pytorch/pytorch using Python2. It ran more tests before failing. |
Test and build output from ROCmSoftwarePlatform/pytorch using Python3. I see that more tests are run and passed. However, test_multinomial_invalid_probs_cuda still fails. Since test_multinomial is skipped on ROCm, test_multinomial_invalid_probs_cuda should be skipped as well. I also ran into a build error. It seems compiler crashed while compiling unique_ops_hip.cc.
Full build output up to the error message is in the gist as well. Do I need updated compiler to build it properly? Edit: fixed the link to gist |
Made mistake in previous post. Correct link to gist is |
Rebuilt ROCmSoftwarePlatform/pytorch using Python3 and, this time, it compiled without crashing. |
If you are using our docker file and compile with python2 (note that your host system must be ROCm 1.9), all unit tests after filtering with |
For Python2, I used the docker file as is to build ROCmSoftwarePlatform/pytorch. pytorch/pytorch won't compile because of thrust. Host is ROCm1.9.1
|
This particular run failed because you didn't have the "hypothesis" python module installed in your system. If you see similar import errors from python in the future, look for the matching python module to install. Sorry, as Johannes mentioned already, this is a WIP. :) |
@jithunnair-amd Yes. I ran the test again with hypothesis and was able to go further. I'm at the moment fixing up test script to run the test in Python3 environment. I'm just keeping this thread up to date just in case I find something useful for the development team. It seems the test script does not quite work in Python 3.6 environment that I'm interested in using. I was able to run test_nn.py and the result is in this gist. |
Unfortunately, python2 is the only one thoroughly tested at the moment, so that's the one we would have stronger confidence on working. Does python2 not work for fast.ai? Your python3 log above looks good! I don't see any failures. Why do you say "the test script does not quite work in Python 3.6 environment that I'm interested in using"? |
Fast.ai requires Python3.6. I fixed up the test script to get that result. I can send a pull request if you like. Just some minor change in conditions. Also, I installed librosa and some tests fail looking for hipFFT. Is hipFFT part of separate package? It seems it should be part of rocFFT... |
My pull request to get the tests running on Python 3.6. :) |
The hipfft API is part of rocFFT. So it's interesting that there are missing symbols. We haven't tried with librosa, so we'll need to see if we can reproduce and fix. Thanks for the PR! |
Just wondering. Do you have target release date for PyTorch for ROCm? |
I was looking into why Fast.ai does not converge when PyTorch on ROCm and noticed that the loss vs learning rate curve looks very different between the version I get from ROCm and the version from Fast.ai repo. ROCm version is much narrower and the minimum does not seem as low either. Is this expected behavior? It seems I'd have to adjust learning rate to make the model converge. See below screenshot for comparison. |
FYI Linker seems to crash
|
Thanks! We are aware of this, it'll be fixed by release. Do you have a working docker image around still? |
Yes. I do have a working docker image. |
I'm running into the same issue with the provided dockerfile and building for gfx900. Is there a quick fix I can use to allow it to compile? |
@Citronnade adjust the repository to |
Hello, Just want to check if this is a concern, when I hipify the source code I get an error. (py35) root@de18c9004c48:/data/pytorch# python tools/amd_build/build_pytorch_amd.py The reason for asking is that after building pytorch and running the test, 12 tests fail with this sumary: Ran 2048 tests in 58.064s FAILED (failures=12, skipped=271) |
@matthewmax09 Yes, you should not get an error during hipification normally. This usually occurs when you have a partially or fully hipified directory and you run the hipification script again. You might want to try resetting your repo or starting on a fresh one. |
Ok, I recloned https://github.com/ROCmSoftwarePlatform/pytorch and ran hippify again. No errors this time! Thank you for that advice! But I am still getting the same errors during testing and pasted them in this gist. https://gist.github.com/matthewmax09/265139c78b27100dd75d95bacc9298af It seems to me that the tests were expecting the tensor to be less than or equal to 1e-05. Which wasn't the case, resulting in the failure. |
I'm seeing the same error. |
Closing this issue as we are now at ROCm 2.3 and tests should have stabilized. @odellus @matthewmax09 please update and re-open if this is not the case, thank you! |
❓ Questions and Help
Please note that this issue tracker is not a help form and this issue will be closed.
I'm trying to build PyTorch to run on ROCm (Ubuntu 18.04) and am having issues. I tried the following.
I followed https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm but it seems to have failed at pyyaml (https://gist.github.com/briansp2020/114bd75ff0182197cf7efc7af265e89c)
I got over the error by installing wheel. However, the build still failed later (https://gist.github.com/briansp2020/2719353d626968082410011dc36608cf)
I tried build it in tensorflow docker and I get https://gist.github.com/briansp2020/2a109c0f1d40b45299cb73a76a255767
It seems the wiki is old and I needed to get latest rocSPARSE (https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases) to get past the CMake phase. Unfortunately, build still failed(https://gist.github.com/briansp2020/52047cf73d8d59ddd72f730d779b952c)...
Do you have up to date instruction on how to build PyTorch with ROCm? My goal is to run fast.ai on Vega FE with ROCm.
Thanks!
The text was updated successfully, but these errors were encountered: