Skip to content

Building PyTorch for ROCm

Jithun Nair edited this page Jan 26, 2021 · 78 revisions

General remarks

This is a quick guide to setup PyTorch with ROCm support inside a docker container. Assumes a .deb based system. See ROCm install for supported operating systems and general information on the ROCm software stack. If your host system doesn't have docker installed, please refer to docker install. It is recommended to add the user to the docker group to run docker as a non-root user, please refer here.

An install of the latest released ROCm version is recommended.

  1. Follow the instructions from ROCm installation page to install the baseline ROCm driver
    https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

Option 1 (Recommended) : Use docker image with PyTorch pre-installed

Pull the latest public PyTorch docker container

This option provides a docker image which has PyTorch ROCm installed. Users can launch the docker container and train/run deep learning models directly. This docker image will run on both gfx900 (Vega10-type GPU - MI25, Vega56, Vega64,...), gfx906 (Vega20-type GPU - MI50, MI60) and gfx908 (MI100)

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
This will automatically download the image if it does not exist on the host. You can also pass -v argument to mount any data directories on to the container.

Option 2: Install PyTorch using PyTorch ROCm base docker image

  1. Obtain docker image:
    docker pull rocm/pytorch:latest-base

  2. Clone PyTorch repository on the host:
    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive

  3. Start a docker container using the downloaded image:
    sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:latest-base
    Note: This will mount your host home directory on /data in the container.

  4. Change to previous PyTorch checkout from within the running docker:
    cd /data/pytorch

  5. Build PyTorch for ROCm:
    By default, PyTorch will build for gfx803, gfx900, gfx906 and gfx908 simultaneously (to see which AMD uarch you have, run /opt/rocm/bin/rocm_agent_enumerator (might need to install rocminfo package)). If you want to compile only for your uarch, export PYTORCH_ROCM_ARCH=<uarch> to gfx803, gfx900, gfx906 or gfx908. Then build with
    .jenkins/pytorch/build.sh
    This will first hipify the PyTorch sources and then compile, needing 16 GB of RAM to be available to the docker image.

  6. Confirm working installation:
    .jenkins/pytorch/test.sh
    runs all CI unit tests and skips as appropriate on your system based on ROCm and, e.g., single or multi GPU configuration. No tests will fail if the compilation and installation is correct. Additionally, this step will install torchvision which most PyTorch script use to load models. E.g., running the PyTorch examples requires torchvision.
    Individual test sets can be run with
    PYTORCH_TEST_WITH_ROCM=1 python test/test_nn.py --verbose
    Where test_nn.py can be replaced with any other test set.

  7. Commit the container to preserve the pytorch install (from the host):
    sudo docker commit <container_id> -m 'pytorch installed'

Option 3: Install using PyTorch upstream docker file

  1. Clone PyTorch repository on the host:
    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive

  2. Build PyTorch docker image:
    cd .circleci/docker
    ./build.sh pytorch-linux-bionic-rocm<version>-py3.6 (eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.6)
    This should complete with a message "Successfully built <image_id>"
    Note here that other software versions may be chosen, such setups are currently not tested though!

  3. Start a docker container using the new image:
    sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video <image_id>
    Note: This will mount your host home directory on /data in the container.

Follow steps 5-8 for Option 2 from here on.

Try PyTorch examples

  1. Clone the PyTorch examples repository:
    git clone https://github.com/pytorch/examples.git

  2. Run individual example: MNIST
    cd examples/mnist
    Follow instructions in README.md, in this case:
    pip3 install -r requirements.txt
    python3 main.py

  3. Run individual example: Try ImageNet training
    cd ../imagenet
    Follow instructions in README.md.

Clone this wiki locally