Skip to content

[Doc]: Steps to run vLLM on your RTX5080 or 5090! #14452

@pavanimajety

Description

@pavanimajety

📚 The doc issue

Let's take a look at the steps required to run vLLM on your RTX5080/5090!

  1. Initial Setup: To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nvcc that can compile for Blackwell.
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
                                -it nvcr.io/nvidia/pytorch:25.02-py3 /bin/bash
  1. Clone vLLM Repository: Let's clone top of tree vLLM. If you have an existing clone or working directory, ensure that you are at or above the commit ed6ea06 in your clone.
git clone https://github.com/vllm-project/vllm.git && cd vllm
  1. Build vLLM in the container: Now, we start building vLLM. Please note here that we can't use precompiled vLLM because vllm-project/vllm has not moved to the required torch and CUDA versions yet. So, we leverage the torch and CUDA versions that come with the NGC containers. The following steps are your standard build from source instructions, with the caveat of running use_existing_torch.py
python use_existing_torch.py
pip install -r requirements/build.txt
pip install setuptools_scm

# optionally create a CACHE_DIR if you don't have your regular CCACHE_DIR
mkdir <path/to/ccache/dir>

CCACHE_DIR=<path/to/ccache/dir> python setup.py develop

Notes:

  • If ccache is not already installed, please install using - apt-get update && apt-get install ccache.
  • The following may also be needed based on your environment.
apt-get update && apt-get install -y --no-install-recommends \
    kmod \
    git \
    python3-pip \
    && apt-get clean && rm -rf /var/lib/apt/lists/*
  • To speed up your process, you can leverage MAX_JOBS flag. Check the number of cores on your CPU using nproc and use it while running your build. For example, if your machine has 16 cores, MAX_JOBS=10 may be a good number to not overload your CPU with the build. Set it to 1 if you want a single threaded build or if you are running into any issues with your parallel build.
MAX_JOBS=<number> CCACHE_DIR=<path/to/ccache/dir> python setup.py develop
  • Switch steps 1 and 2 based on whether or not you want to re-use your repository for development purposes. If you clone first and then start the container, you may have to give additional permissions for making changes to vLLM source in the container.
  1. Test vLLM: Once your build succeeds, run the following to check your installation.
python -c "import vllm; print(vllm.__version__)"

You should see a compiled version of vllm.0.7.4+

Congratulations, your RTX5080/90 is now ready to run vLLM!

Note: Flash Attention 3 backend doesn't work with Blackwell yet, please use VLLM_FLASH_ATTN_VERSION=2 if you run into any issues.

Thanks @ywang96 for testing this out! Thanks to @kushanam, @kaixih for all the Blackwell support PRs!

Suggest a potential alternative/fix

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions