Add CUDA graph-based all reduce launcher #26

WoosukKwon · 2023-04-05T02:45:37Z

Related to #22

This PR uses CUDA graph to reduce the CPU overhead of NCCL all reduce operation.

zhuohan123

LGTM!

zhuohan123 · 2023-04-05T17:22:53Z

cacheflow/parallel_utils/parallel_state.py

+        self.group = get_tensor_model_parallel_group()
+        self.buffer = torch.empty(
+            size=(max_num_tokens, hidden_size),
+            dtype=torch.half, # FIXME: hardcoded dtype


Add a dtype argument for this class?

Disable NPU merged to OV master recently

Install and configure use of the NCCL version recommended by vLLM via the [vllm-nccl](https://github.com/vllm-project/vllm-nccl) package. The install is a little wonky... but this set of changes should work. Signed-off-by: Travis Johnson <[email protected]>

deps: bump fastapi to >= 0.109.1

Update max_context_len for custom paged attention.

…c466a3 Rebase habana_main up to cc466a3

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

* add tool server Signed-off-by: Chen Zhang <[email protected]> * add back demo tool server Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * update disallow cases Signed-off-by: Chen Zhang <[email protected]> * fix Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> * fix some type Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]>

…oject#26) * indexer medatata to separate prefill and decode * deep_gemm prefill kernel * decode kernel, can run for single batch * bug fixing insert decode k into kv before gemm * don't use tilelang quant function * faster non-looping torch for kv cache insertion * add chunked prefill impl * change quant kernel back to tilelang for promotion * fix format (vllm-project#31) Signed-off-by: Chen Zhang <[email protected]> * update unit tests * Fp8 indexer prefill (vllm-project#33) * init Signed-off-by: Chen Zhang <[email protected]> * can run --------- Signed-off-by: Chen Zhang <[email protected]> * remove debug comment Signed-off-by: Chen Zhang <[email protected]> * cleanup * further cleanup --------- Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Chen Zhang <[email protected]>

WoosukKwon added 4 commits April 5, 2023 01:07

Add -tp and -pp

0f86522

Add graph-based all reduce launcher

8f4c648

max_batch_size -> max_num_batched_tokens

8077445

max_batch_size -> max_num_batched_tokens

1cfdb00

WoosukKwon requested a review from zhuohan123 April 5, 2023 09:31

zhuohan123 approved these changes Apr 5, 2023

View reviewed changes

Address comments & Code cleaning

d406199

WoosukKwon merged commit 12659a0 into main Apr 5, 2023

WoosukKwon deleted the graph branch April 5, 2023 18:17

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add CUDA graph-based all reduce launcher (vllm-project#26)

6376304

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 4, 2024

Merge pull request vllm-project#26 from ilya-lavrenov/disable-npu

818e384

Disable NPU merged to OV master recently

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Merge pull request vllm-project#26 from dtrifiro/bump-deps

255735f

deps: bump fastapi to >= 0.109.1

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#26 from ROCm/cl/updates-pag-shomy

fa75cba

Update max_context_len for custom paged attention.

tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024

Merge pull request vllm-project#26 from HabanaAI/habana_main_rebase_c…

ae3d612

…c466a3 Rebase habana_main up to cc466a3

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jun 25, 2024

Merge pull request vllm-project#26 from intel-sandbox/jianan/enable_l…

dddd40f

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add CUDA graph-based all reduce launcher #26

Add CUDA graph-based all reduce launcher #26

Uh oh!

WoosukKwon commented Apr 5, 2023 •

edited

Loading

Uh oh!

zhuohan123 left a comment

Uh oh!

zhuohan123 Apr 5, 2023

Uh oh!

WoosukKwon Apr 5, 2023

Uh oh!

Uh oh!

Uh oh!

Add CUDA graph-based all reduce launcher #26

Add CUDA graph-based all reduce launcher #26

Uh oh!

Conversation

WoosukKwon commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon commented Apr 5, 2023 •

edited

Loading