-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
📚 The doc issue
Let's take a look at the steps required to run vLLM on your RTX5080/5090!
- Initial Setup: To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nvcc that can compile for Blackwell.
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-it nvcr.io/nvidia/pytorch:25.02-py3 /bin/bash
- Clone vLLM Repository: Let's clone top of tree vLLM. If you have an existing clone or working directory, ensure that you are at or above the commit ed6ea06 in your clone.
git clone https://github.com/vllm-project/vllm.git && cd vllm
- Build vLLM in the container: Now, we start building vLLM. Please note here that we can't use precompiled vLLM because
vllm-project/vllm
has not moved to the required torch and CUDA versions yet. So, we leverage the torch and CUDA versions that come with the NGC containers. The following steps are your standard build from source instructions, with the caveat of runninguse_existing_torch.py
python use_existing_torch.py
pip install -r requirements/build.txt
pip install setuptools_scm
# optionally create a CACHE_DIR if you don't have your regular CCACHE_DIR
mkdir <path/to/ccache/dir>
CCACHE_DIR=<path/to/ccache/dir> python setup.py develop
Notes:
- If
ccache
is not already installed, please install using -apt-get update && apt-get install ccache
. - The following may also be needed based on your environment.
apt-get update && apt-get install -y --no-install-recommends \
kmod \
git \
python3-pip \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
- To speed up your process, you can leverage
MAX_JOBS
flag. Check the number of cores on your CPU usingnproc
and use it while running your build. For example, if your machine has 16 cores, MAX_JOBS=10 may be a good number to not overload your CPU with the build. Set it to1
if you want a single threaded build or if you are running into any issues with your parallel build.
MAX_JOBS=<number> CCACHE_DIR=<path/to/ccache/dir> python setup.py develop
- Switch steps 1 and 2 based on whether or not you want to re-use your repository for development purposes. If you clone first and then start the container, you may have to give additional permissions for making changes to vLLM source in the container.
- Test vLLM: Once your build succeeds, run the following to check your installation.
python -c "import vllm; print(vllm.__version__)"
You should see a compiled version of vllm.0.7.4
+
Congratulations, your RTX5080/90 is now ready to run vLLM!
Note: Flash Attention 3 backend doesn't work with Blackwell yet, please use VLLM_FLASH_ATTN_VERSION=2
if you run into any issues.
Thanks @ywang96 for testing this out! Thanks to @kushanam, @kaixih for all the Blackwell support PRs!
Suggest a potential alternative/fix
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
jayrodge, noooop, shahizat, SmileIsThinking, Dayrker and 19 morezilingzhang, kushanam, sayedmohamedscu, shahizat, Dayrker and 12 more
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation