Support tensor parallel #2

zhuohan123 · 2023-02-28T08:40:38Z

TODOs:

In another PR:

Merge QKV into one.

WoosukKwon

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

Current master (python server.py --model facebook/opt-13b)

# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'

4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)

# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)

# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

cacheflow/utils.py

cacheflow/models/memory_analyzer.py

cacheflow/models/model_utils.py

cacheflow/models/opt.py

server.py

cacheflow/models/opt.py

cacheflow/worker/controller.py

cacheflow/worker/worker.py

zhuohan123 · 2023-03-21T09:36:06Z

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

WoosukKwon

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

cacheflow/models/model_utils.py

[vllm] v1 tracing with fixes based on vllm-project#20372

epd clean code

Bug vllm-project#1 (CRITICAL): Add missing begin() and stage() methods to KVWriteRouter - Flash attention backend calls router.begin() and router.stage() - KVWriteRouter only had write() and commit() methods - Added begin() to store slot_mapping and initialize shadow buffer - Added stage() to extract per-timestep slot and stage KV pairs - Without these, no tokens were being staged → 0% acceptance rate Bug vllm-project#2 (MODERATE): Fix bonus token counting in accepted_lens - valid_sampled_token_ids includes [accepted_draft_tokens..., bonus_token] - Previous: len([bonus]) = 1, incorrectly counted as 1 accepted draft token - Fixed: Use max(0, len(seq) - 1) to exclude bonus token from count - Now correctly reports 0 accepted when only bonus token is present Files modified: - vllm/v1/kv_cache/write_router.py: Added begin() and stage() methods - vllm/v1/worker/gpu_model_runner.py: Fixed accepted_lens calculation

Bug vllm-project#1: EAGLE tree proposal returned zeros for draft_logprobs - Root cause: When using topk for tree branching, code set draft_logp_list=None, then created zeros tensor as fallback (lines 850-851) - Fix: Compute actual log-probs from logits using log_softmax + gather - Applied at 2 locations: root level (lines 698-704) and tree levels (lines 839-846) Bug vllm-project#2: Added diagnostic logging in rejection sampler - Log draft_p (nonzero) min/med/max to detect zeros - Log p_target min/med/max to detect degenerate softmax - Helps identify if target logits are masked/filtered before sampling Expected results after fix: - draft_logp: -3.2/-1.6/-0.0 (real log-probs, all ≤ 0) instead of 0/0/0 - p_target: 1e-6/1e-3/0.7 (realistic distribution) instead of 1/1/1 - Acceptance rate: 30-70% instead of 0% Files changed: - vllm/v1/spec_decode/eagle.py: Fix draft_logp computation - vllm/v1/sample/rejection_sampler.py: Add sanity logging

…thing diagnostics Bug vllm-project#4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add [NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if config attribute is missing, preventing 32000 survivors (full vocab). Bug vllm-project#5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda. These fixes were accidentally removed during the bug vllm-project#2 draft-anchored rewrite (commit 595a371). Restoring them does not affect bug vllm-project#2's core algorithm - they only improve fallback behavior and diagnostics.

…r to 2.0 ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution instead of softening it (dividing by tau<1.0 doubles logit magnitudes). This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance stuck at ~0.7038 (average p_target). FIXES: 1. Config defaults (config.py, arg_utils.py): - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range) - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens) At draft_temp=0.05: - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!) - After: tau_q = max(0.05+0.25, 2.0) = 2.0 (2x softer) 2. Force min_keep=2 in nucleus (eagle.py line 271): - Added keep_sorted[..., :2] = True - Prevents survivors=1 by construction (defensive programming) 3. Fix smoothing to uniform over kept set (eagle.py lines 275-287): - Before: Mixed with untempered baseline (wrong approach) - After: Uniform distribution over survivors only (correct) - Prevents q from reaching exactly 1.0 in corner cases 4. Remove dead code (eagle.py line 322): - Deleted unused self._current_sampling_metadata assignment - No longer needed with draft-anchored approach (bug vllm-project#2 fix) Expected results: - tau_q ≥ 2.0 at ultracold temps → softer distribution - NUC_DEBUG: survivors = hundreds/thousands (not 1-2) - Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0) - Accept rate: dynamic range restored across temp sweep

zhuohan123 added 9 commits February 28, 2023 01:30

copy code from fairseq

e8d661c

remove files from fairscale

827f85f

copy files from megatron

76ed019

[WIP] add distributed init

55e5d86

Parallelize the Transformer layers

7100db2

Load weight on a single GPU

1e86393

support multi-gpu tensor parallelism

90970e1

support tensor parallelism on multiple gpus

88960f7

fix correctness

900eace

zhuohan123 changed the title ~~[WIP] Support tensor parallel~~ Support tensor parallel Mar 9, 2023

zhuohan123 added 6 commits March 17, 2023 14:05

Merge branch 'main' into tensor_parallel

6a6f7cc

fix merging errors

d5a70ab

add filelock

60bf11e

support parallel decoding

a7be5b8

update readme

538d067

remove unused files

893d4b3

zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51

fix loading for large models

e0f9f48

WoosukKwon reviewed Mar 21, 2023

View reviewed changes

zhuohan123 added 5 commits March 21, 2023 03:24

Fix some smaller issues raised by Woosuk first.

6ef5111

Fix more review issues

6727083

remove duplicate set_seed

ddc1ab0

Support the case where embedding_size != hidden_size

1d532c5

Resolve comments on weight loading and device id comments.

64e3950

WoosukKwon approved these changes Mar 21, 2023

View reviewed changes

cacheflow/models/model_utils.py Show resolved Hide resolved

WoosukKwon merged commit 2f49f15 into main Mar 21, 2023

zhuohan123 deleted the tensor_parallel branch June 18, 2023 07:22

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

Danielkinz mentioned this pull request Aug 15, 2023

[Feature | CI] Added a github action to build wheels #746

Merged

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

yarongmu-google mentioned this pull request Jun 11, 2025

[RFC]: A Strategic Framework for Extensibility and Innovation in vLLM #19376

Open

1 task

cyc00518 mentioned this pull request Jun 12, 2025

[Bug] Mistral Tool-Call via Jinja Template: Missing parallel_tool_prompt Injection and Incorrect tool_response Handling #19545

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Chris113113 mentioned this pull request Jul 10, 2025

[Bug]: [V1][gpu_model_runner.py] CUDA memory error #19415

Open

1 task

shrijayan mentioned this pull request Jul 12, 2025

vLLM hangs after 10 minutes without any error message #1492

Closed

aarondou mentioned this pull request Jul 16, 2025

[RFC]: Neuron Support for V1 Engine #21082

Closed

1 task

tyxiong23 mentioned this pull request Jul 30, 2025

[Bug]: GLM-4.1V-Thinking ValueError #21811

Closed

1 task

sfeng33 mentioned this pull request Jul 30, 2025

[Feature]: Add support for multi-lora and single lora for classification tasks #19623

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

devops724 mentioned this pull request Aug 3, 2025

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Open

1 task

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

AlpinDale added a commit to AlpinDale/vllm that referenced this pull request Aug 11, 2025

address gemini comments vllm-project#2

d64560a

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

bbartels pushed a commit to bbartels/vllm that referenced this pull request Aug 14, 2025

Merge pull request vllm-project#2 from RichardoMrMu/main-ttft-fix

a30adc7

[vllm] v1 tracing with fixes based on vllm-project#20372

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

ruisearch42 mentioned this pull request Aug 22, 2025

[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3 #23448

Open

1 task

Tar-ive mentioned this pull request Aug 24, 2025

feat: Add TPU v6e architecture-adaptive attention backend #23507

Open

16 tasks

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Open

1 task

ZJY0516 mentioned this pull request Aug 31, 2025

[Bug]: CUDA error when serving MiniCPM-V model #23954

Closed

wyn1015 mentioned this pull request Sep 19, 2025

[Bug]: assortment of warnings / errors coming out of vllm basic python inference script #18634

Open

1 task

RobotSail mentioned this pull request Sep 21, 2025

[Bug]: vLLM chat breaks during multi-turn chat #25108

Open

1 task

Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Sep 23, 2025

Merge pull request vllm-project#2 from wuhang2014/why_epd_v_0_9_1

e862b3c

epd clean code

zhanghb55 mentioned this pull request Sep 25, 2025

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

zhuohan123 commented Feb 28, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

Conversation

zhuohan123 commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Feb 28, 2023 •

edited

Loading