Skip to content

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Feb 28, 2023

TODOs:

  • Parallel embedding and softmax.
  • Merge with the main branch.
  • Modify README.
  • Remove unused codes.
  • Fix the bug that downloads the weight twice.
  • Test with larger models.

In another PR:

  • Merge QKV into one.

@zhuohan123 zhuohan123 changed the title [WIP] Support tensor parallel Support tensor parallel Mar 9, 2023
@zhuohan123 zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

  • Current master (python server.py --model facebook/opt-13b)
# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'
  • 4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)
# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
  • 8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)
# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

@zhuohan123
Copy link
Member Author

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

AlpinDale added a commit to AlpinDale/vllm that referenced this pull request Aug 11, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Aug 14, 2025
Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Sep 23, 2025
yuz207 added a commit to IluvatarLabs/vllm that referenced this pull request Sep 27, 2025
Bug vllm-project#1 (CRITICAL): Add missing begin() and stage() methods to KVWriteRouter
- Flash attention backend calls router.begin() and router.stage()
- KVWriteRouter only had write() and commit() methods
- Added begin() to store slot_mapping and initialize shadow buffer
- Added stage() to extract per-timestep slot and stage KV pairs
- Without these, no tokens were being staged → 0% acceptance rate

Bug vllm-project#2 (MODERATE): Fix bonus token counting in accepted_lens
- valid_sampled_token_ids includes [accepted_draft_tokens..., bonus_token]
- Previous: len([bonus]) = 1, incorrectly counted as 1 accepted draft token
- Fixed: Use max(0, len(seq) - 1) to exclude bonus token from count
- Now correctly reports 0 accepted when only bonus token is present

Files modified:
- vllm/v1/kv_cache/write_router.py: Added begin() and stage() methods
- vllm/v1/worker/gpu_model_runner.py: Fixed accepted_lens calculation
yuz207 added a commit to IluvatarLabs/vllm that referenced this pull request Sep 27, 2025
Bug vllm-project#1: EAGLE tree proposal returned zeros for draft_logprobs
- Root cause: When using topk for tree branching, code set draft_logp_list=None,
  then created zeros tensor as fallback (lines 850-851)
- Fix: Compute actual log-probs from logits using log_softmax + gather
- Applied at 2 locations: root level (lines 698-704) and tree levels (lines 839-846)

Bug vllm-project#2: Added diagnostic logging in rejection sampler
- Log draft_p (nonzero) min/med/max to detect zeros
- Log p_target min/med/max to detect degenerate softmax
- Helps identify if target logits are masked/filtered before sampling

Expected results after fix:
- draft_logp: -3.2/-1.6/-0.0 (real log-probs, all ≤ 0) instead of 0/0/0
- p_target: 1e-6/1e-3/0.7 (realistic distribution) instead of 1/1/1
- Acceptance rate: 30-70% instead of 0%

Files changed:
- vllm/v1/spec_decode/eagle.py: Fix draft_logp computation
- vllm/v1/sample/rejection_sampler.py: Add sanity logging
yuz207 added a commit to IluvatarLabs/vllm that referenced this pull request Sep 30, 2025
…thing diagnostics

Bug vllm-project#4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add
[NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if
config attribute is missing, preventing 32000 survivors (full vocab).

Bug vllm-project#5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda.

These fixes were accidentally removed during the bug vllm-project#2 draft-anchored
rewrite (commit 595a371). Restoring them does not affect bug vllm-project#2's
core algorithm - they only improve fallback behavior and diagnostics.
yuz207 added a commit to IluvatarLabs/vllm that referenced this pull request Sep 30, 2025
…r to 2.0

ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution
instead of softening it (dividing by tau<1.0 doubles logit magnitudes).
This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance
stuck at ~0.7038 (average p_target).

FIXES:

1. Config defaults (config.py, arg_utils.py):
   - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range)
   - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens)

   At draft_temp=0.05:
   - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!)
   - After:  tau_q = max(0.05+0.25, 2.0)  = 2.0  (2x softer)

2. Force min_keep=2 in nucleus (eagle.py line 271):
   - Added keep_sorted[..., :2] = True
   - Prevents survivors=1 by construction (defensive programming)

3. Fix smoothing to uniform over kept set (eagle.py lines 275-287):
   - Before: Mixed with untempered baseline (wrong approach)
   - After:  Uniform distribution over survivors only (correct)
   - Prevents q from reaching exactly 1.0 in corner cases

4. Remove dead code (eagle.py line 322):
   - Deleted unused self._current_sampling_metadata assignment
   - No longer needed with draft-anchored approach (bug vllm-project#2 fix)

Expected results:
- tau_q ≥ 2.0 at ultracold temps → softer distribution
- NUC_DEBUG: survivors = hundreds/thousands (not 1-2)
- Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0)
- Accept rate: dynamic range restored across temp sweep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants