Skip to content

Conversation

wili-65535
Copy link
Collaborator

@wili-65535 wili-65535 commented Jun 27, 2025

Description

Improve iteration statistics to enable more detailed performance tests for NGram speculative decoding.

We will add a experimental branch of NGram in overlap-scheduler mode once this PR is merged, and do performance tests to decide whether to enable overlap-scheduler for NGram.

Previous PR: PR4569.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Added tracking and reporting of iteration statistics during draft token preparation, including latency and token counts, for improved transparency in speculative decoding.
  • Refactor

    • Refactored draft token preparation logic to improve maintainability and enable statistical tracking.
  • Bug Fixes

    • Improved handling of cases where no draft tokens are generated, ensuring accurate length reporting.

@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch from 886ffaf to 9834dc8 Compare June 28, 2025 02:47
@wili-65535 wili-65535 marked this pull request as ready for review June 29, 2025 13:27
@wili-65535 wili-65535 requested review from a team as code owners June 29, 2025 13:27
@wili-65535 wili-65535 requested review from lfr-0531 and QiJune June 29, 2025 13:27
@wili-65535 wili-65535 changed the title [DON'T MERGE] NGram with iter_stats [TRTLLM-5846] NGram with iter_stats Jun 29, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 29, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 29, 2025
@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch 2 times, most recently from 9066237 to 872b390 Compare July 1, 2025 05:58
for request in scheduled_requests.generation_requests:
num_draft_tokens = 0 if request.py_last_draft_tokens is None else len(
request.py_last_draft_tokens)
num_accepted_tokens = request.py_num_accepted_draft_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this access can be done within the if block below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! We simplify it as here.

@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch 2 times, most recently from 11bd865 to 95fa535 Compare July 4, 2025 03:48
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 4, 2025
@wili-65535
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10918 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10918 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8070 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

num_requests_with_draft_tokens = 0
for request in scheduled_requests.generation_requests:
if request.py_last_draft_tokens is not None:
total_num_draft_tokens += len(request.py_last_draft_tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does py_last_draft_tokens include paddings? Shall we filter out the "real" drafts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should remove the padding length here.
Now we note the length of the real draft tokens in the request, and add it to total_num_draft_tokens here.

Copy link
Collaborator

@SimengLiu-nv SimengLiu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments.

@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch 2 times, most recently from e799556 to e92e16b Compare July 8, 2025 11:26
num_requests_with_draft_tokens = 0
for request in scheduled_requests.generation_requests:
if request.py_last_draft_tokens is not None:
total_num_draft_tokens += request.draft_length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be request.py_draft_length?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.
Thank you very much!

@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch 2 times, most recently from ad106cc to db3ac00 Compare July 11, 2025 00:07
@wili-65535
Copy link
Collaborator Author

/bot run --disable-fail-fast

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 11, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #11578 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11578 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8573 completed with status: 'SUCCESS'

@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch from db3ac00 to 3866ad4 Compare July 24, 2025 03:09
Copy link
Contributor

coderabbitai bot commented Jul 24, 2025

📝 Walkthrough

Walkthrough

The changes update the prepare_draft_tokens method in the drafter abstraction and its NGramDrafter and ModelDrafter implementations to accept an optional IterationStats argument. The method now tracks and records iteration-level statistics, such as timing and token counts, during draft token preparation. Corresponding call sites and method signatures are updated.

Changes

File(s) Change Summary
tensorrt_llm/_torch/pyexecutor/py_executor.py Updated the call to prepare_draft_tokens to pass iter_stats as an additional argument.
tensorrt_llm/_torch/speculative/drafter.py Modified Drafter.prepare_draft_tokens signature to accept iter_stats; added import for IterationStats.
tensorrt_llm/_torch/speculative/ngram.py Refactored NGramDrafter.prepare_draft_tokens to accept iter_stats, track timing, and update iteration stats. Added private methods for draft token preparation and stats update.
tensorrt_llm/_torch/speculative/model_drafter.py Modified ModelDrafter.prepare_draft_tokens signature to accept iter_stats; added import for IterationStats.

Sequence Diagram(s)

sequenceDiagram
    participant Executor as PyExecutor
    participant Drafter as Drafter/NGramDrafter/ModelDrafter
    participant Stats as IterationStats

    Executor->>Drafter: prepare_draft_tokens(scheduled_batch, resource_manager, iter_stats)
    alt NGramDrafter
        Drafter->>Drafter: _prepare_draft_tokens(scheduled_batch)
        Drafter->>Drafter: _update_ngram_iter_stats(scheduled_batch, iter_stats, start_time)
        Drafter->>Stats: Update specdec_stats (latency, token counts)
    else ModelDrafter or base Drafter
        Drafter->>Drafter: prepare_draft_tokens(scheduled_batch, resource_manager, iter_stats)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • nv-guomingz
  • HuiGao-NV
  • pcastonguay

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai bot added the Community want to contribute PRs initiated from Community label Jul 24, 2025
fix by Joyjit suggestions

fix by Simeng suggestions

Signed-off-by: wili-65535 <[email protected]>
Signed-off-by: wili-65535 <[email protected]>
@wili-65535 wili-65535 force-pushed the user/wili/ngram-iter_state branch from 3866ad4 to d43d144 Compare July 24, 2025 06:43
@coderabbitai coderabbitai bot requested a review from pcastonguay July 24, 2025 06:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3866ad4 and d43d144.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (2 hunks)
  • tensorrt_llm/_torch/speculative/ngram.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/_torch/speculative/model_drafter.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/drafter.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/speculative/ngram.py (2)

Learnt from: yiqingy0
PR: #5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/speculative/ngram.py

229-229: LlmRequest may be undefined, or defined from star imports

(F405)

🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/ngram.py (3)

1-1: LGTM: Import additions support new iteration statistics functionality.

The added imports (time, Dict, Tuple, IterationStats) are all necessary for the new timing measurements and statistics collection features.

Also applies to: 3-3, 7-7


176-191: LGTM: Clean implementation of optional iteration statistics tracking.

The method signature change maintains backward compatibility with the optional parameter, and the timing wrapper pattern cleanly separates concerns between core functionality and statistics collection.


192-223: LGTM: Refactored method correctly preserves functionality and adds draft length tracking.

The extraction of the core logic into a private method is clean, and the addition of py_draft_length tracking properly addresses the need to distinguish actual draft tokens from padding for accurate statistics collection.

Comment on lines +224 to +260
def _update_ngram_iter_stats(
self,
scheduled_requests: ScheduledRequests,
iter_stats: IterationStats,
start_time: float,
) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]:
"""
Get statistic information from the draft tokens in NGram drafter
"""
now_time = time.time()

total_num_draft_tokens = 0
total_num_accepted_tokens = 0
num_requests_with_draft_tokens = 0
for request in scheduled_requests.generation_requests:
if request.py_last_draft_tokens is not None:
total_num_draft_tokens += request.py_draft_length
total_num_accepted_tokens += request.py_num_accepted_draft_tokens
num_requests_with_draft_tokens += 1

if num_requests_with_draft_tokens > 0:
iter_stats.specdec_stats.iter_latency_ms = (now_time -
start_time) * 1e3
iter_stats.specdec_stats.num_draft_tokens = total_num_draft_tokens
iter_stats.specdec_stats.num_accepted_tokens = total_num_accepted_tokens
iter_stats.specdec_stats.num_requests_with_draft_tokens = num_requests_with_draft_tokens
iter_stats.specdec_stats.acceptance_length = (
total_num_accepted_tokens +
num_requests_with_draft_tokens) / num_requests_with_draft_tokens
else:
iter_stats.specdec_stats.iter_latency_ms = 0.0
iter_stats.specdec_stats.num_draft_tokens = 0
iter_stats.specdec_stats.num_accepted_tokens = 0
iter_stats.specdec_stats.num_requests_with_draft_tokens = 0
iter_stats.specdec_stats.acceptance_length = 1.0

return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix return type annotation and approve statistics collection logic.

The statistics collection implementation is correct and properly uses py_draft_length to exclude padding. However, the return type annotation doesn't match the actual return value.

Apply this diff to fix the return type annotation:

-    ) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]:
+    ) -> None:

The rest of the implementation looks good - the acceptance length calculation and latency measurement are correct, and the conditional handling for cases with/without draft tokens is appropriate.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _update_ngram_iter_stats(
self,
scheduled_requests: ScheduledRequests,
iter_stats: IterationStats,
start_time: float,
) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]:
"""
Get statistic information from the draft tokens in NGram drafter
"""
now_time = time.time()
total_num_draft_tokens = 0
total_num_accepted_tokens = 0
num_requests_with_draft_tokens = 0
for request in scheduled_requests.generation_requests:
if request.py_last_draft_tokens is not None:
total_num_draft_tokens += request.py_draft_length
total_num_accepted_tokens += request.py_num_accepted_draft_tokens
num_requests_with_draft_tokens += 1
if num_requests_with_draft_tokens > 0:
iter_stats.specdec_stats.iter_latency_ms = (now_time -
start_time) * 1e3
iter_stats.specdec_stats.num_draft_tokens = total_num_draft_tokens
iter_stats.specdec_stats.num_accepted_tokens = total_num_accepted_tokens
iter_stats.specdec_stats.num_requests_with_draft_tokens = num_requests_with_draft_tokens
iter_stats.specdec_stats.acceptance_length = (
total_num_accepted_tokens +
num_requests_with_draft_tokens) / num_requests_with_draft_tokens
else:
iter_stats.specdec_stats.iter_latency_ms = 0.0
iter_stats.specdec_stats.num_draft_tokens = 0
iter_stats.specdec_stats.num_accepted_tokens = 0
iter_stats.specdec_stats.num_requests_with_draft_tokens = 0
iter_stats.specdec_stats.acceptance_length = 1.0
return
def _update_ngram_iter_stats(
self,
scheduled_requests: ScheduledRequests,
iter_stats: IterationStats,
start_time: float,
) -> None:
🧰 Tools
🪛 Ruff (0.12.2)

229-229: LlmRequest may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/speculative/ngram.py lines 224 to 260, the method
_update_ngram_iter_stats currently has a return type annotation of
Tuple[ScheduledRequests, Dict[int, LlmRequest]] but does not return any value.
To fix this, update the return type annotation to None to match the actual
behavior of the method. The statistics collection logic is correct and does not
require changes.

@wili-65535
Copy link
Collaborator Author

/bot run --disable-fail-fast

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 24, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 24, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12825 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12825 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9559 completed with status: 'SUCCESS'

if request.py_last_draft_tokens is not None:
total_num_draft_tokens += request.py_draft_length
total_num_accepted_tokens += request.py_num_accepted_draft_tokens
num_requests_with_draft_tokens += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this number will always be the scheduled request for one iteration.
If num_requests_with_draft_tokens should exclude the request with padded tokens only, them maybe we should count requests with py_draft_length > 0.

request.py_draft_tokens = draft_tokens
request.py_draft_length = draft_length

def _update_ngram_iter_stats(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function can be moved to Drafter class.

@pcastonguay
Copy link
Collaborator

@tijyojwad @SimengLiu-nv @wili-65535 there's been no updates since Jul 24th. Should we close? If not, can the addressed comments be resolved? @wili-65535 I see some comments from @SimengLiu-nv are not addressed yet.

@SimengLiu-nv
Copy link
Collaborator

I am good with closing the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants