-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[V1] AsyncLLM data parallel #13923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLM data parallel #13923
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
how to test,I mean how to run the server,I think we need two command right? |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DP-related part looks good to me.
cc @robertgshaw2-redhat I'm not familiar with the frontend processing part, maybe Robert can take a look?
@v-lmn no for single node you can run a single command, with |
…gine # Conflicts: # vllm/v1/core/scheduler.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Thanks @tlrmchlsmth! Have addressed those comments. Also had to make some additional adjustments to ensure compatibility with @youkaichao's offline multi-node scenario added in #15484. |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
# Conflicts: # vllm/v1/core/sched/scheduler.py
local_dp_rank = vllm_config.parallel_config.data_parallel_rank_local | ||
|
||
assert dp_size > 1 | ||
assert 0 <= local_dp_rank <= dp_rank < dp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly needed, I just thought it might be good here to verify that the config is in a coherent state.
from vllm.platforms import current_platform | ||
if current_platform.is_cuda_alike(): | ||
from vllm.platforms.cuda import device_id_to_physical_device_id | ||
tp_size = vllm_config.parallel_config.tensor_parallel_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use world_size
to be general, not just tp_size
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
@njhill @youkaichao Hi, I tried to use dp as what you showed above with the latest dev version of vllm. However, the error below occurred. Do you have any clue on this? In the same shell environment, I can run without dp, e.g. with tp=2, and there are 8 usable GPUs on the machine. Thanks! $ uv run vllm serve /path/to/model --port 8088 --served-model-name xxx --data-parallel-size 2
...
myhost:2292596:2292596 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 53000
myhost:2292597:2292597 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 53000
...
(EngineCore_1 pid=2292597) File "/ephnvme/colin/code/prizetrain/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_0 pid=2292596) self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_1 pid=2292597) self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2292596) File "/ephnvme/colin/code/prizetrain/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_1 pid=2292597) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2292596) raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2292597) File "/ephnvme/colin/code/prizetrain/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
(EngineCore_1 pid=2292597) self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_1 pid=2292597) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2292596) RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_1 pid=2292597) File "/ephnvme/colin/code/prizetrain/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_1 pid=2292597) self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_1 pid=2292597) File "/ephnvme/colin/code/prizetrain/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_1 pid=2292597) raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2292597) RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details) |
@Co1lin I tested can you create a separate issue with detailed environment? |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Mu Huai <[email protected]>
local_unfinished_reqs = self.scheduler.has_unfinished_requests() | ||
|
||
if local_unfinished_reqs: | ||
# 2) Step the engine core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the presence of WAITING_FOR_REMOTE_KVS
and WAITING_FOR_FSM
, the condition local_unfinished_reqs = true
does not necessarily imply that scheduler_output.total_num_scheduled_tokens > 0
. This means that a forward pass may not actually be executed in _process_engine_step
-> step
-> execute_model
. Meanwhile, other cores might still execute a forward or dummy forward
@njhill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #18559
The engine core client starts an engine core proc per dp rank and load balances requests between them. A dummy request is sent to idle ranks when the global req count goes from 0->1, and when each engine finishes all requests it will continue in an idle forward loop.
Working for single node:
I aimed to keep the data parallel logic isolated as much as possible (in subclasses of the core engine and client) to avoid adding complexity/overhead to the more common default dp=1 case.
Follow-on after this PR: