[RFC]: copy pynvml code into vllm codebase

### Motivation.

We have suffered a lot from the module `pynvml` recently, see https://github.com/vllm-project/vllm/issues/12847 for example.

libnvml.so is the library behind nvidia-smi, and pynvml is a Python wrapper around it. We use it to get GPU status without initializing CUDA context in the current process.
Historically, there are two packages that provide a module named `pynvml`:
- `nvidia-ml-py` (https://pypi.org/project/nvidia-ml-py/): The official wrapper. It is a dependency of vLLM, and is installed when users install vLLM. It provides a Python module named `pynvml`.
- `pynvml` (https://pypi.org/project/pynvml/): An unofficial wrapper. Prior to version 12.0, it also provides a Python module `pynvml`, and therefore conflicts with the official one. What's worse, the module is a Python package, and has higher priority than the official one which is a standalone Python file. This causes errors when both of them are installed. Starting from version 12.0, it migrates to a new module named `pynvml_utils` to avoid the conflict.

To make vLLM work, we have to make sure, there's no `pynvml` package, or the `pynvml` package has version 12.0 or higher. However, neither of them is a doable solution:
- As a Python package, we cannot ask people to uninstall `pynvml` just to make vLLM work.
- If we pin `pynvml==12.0` as vLLM's dependency, then it can work for vLLM, but will break other libraries. Notably, deepspeed depends on `pynvml==11.5.0`: https://github.com/ray-project/ray/blob/9e3ec5972cd952d2b50f3b20abc24ced5abb8b54/python/requirements_compiled.txt#L1611  The module is so confusing, that lots of community libraries don't know `nvidia-ml-py` is the official one. Lots of community libraries depends `pynvml`, e.g. https://github.com/Sygil-Dev/sygil-webui/blob/d88fa9e8c4d9cefbbfb0b445ad79d4ddb85c8e36/requirements.txt#L17 . What's worse, even nvidia official container `nvcr.io/nvidia/pytorch:25.01-py3` uses the unofficial `pynvml<12.0` .

To summarize, we are in a [dependency hell](https://en.wikipedia.org/wiki/Dependency_hell) due to the historical confusing packages.

### Proposed Change.

To solve the problem, I propose to copy the code from `nvidia-ml-py` into vLLM, and use `vllm.third_party.pynvml` to import it. See https://github.com/vllm-project/vllm/pull/12963 for the prototype.

The solution is only to rescue us from the dependency hell. We don't need to maintain the code. If there are bugfixes in `nvidia-ml-py` in the future, we can periodically sync the code.

This is the first time we copy a whole package into vllm, so I'm creating a separate directory `vllm/third_party` to hold the code.

This RFC is for future reference, when we need to copy code into `vllm/third_party`.

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: copy pynvml code into vllm codebase #12977

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: copy pynvml code into vllm codebase #12977

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions