-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5846] NGram with iter_stats #5542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
886ffaf
to
9834dc8
Compare
9066237
to
872b390
Compare
for request in scheduled_requests.generation_requests: | ||
num_draft_tokens = 0 if request.py_last_draft_tokens is None else len( | ||
request.py_last_draft_tokens) | ||
num_accepted_tokens = request.py_num_accepted_draft_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this access can be done within the if block below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! We simplify it as here.
11bd865
to
95fa535
Compare
/bot run |
PR_Github #10918 [ run ] triggered by Bot |
PR_Github #10918 [ run ] completed with state |
num_requests_with_draft_tokens = 0 | ||
for request in scheduled_requests.generation_requests: | ||
if request.py_last_draft_tokens is not None: | ||
total_num_draft_tokens += len(request.py_last_draft_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does py_last_draft_tokens
include paddings? Shall we filter out the "real" drafts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we should remove the padding length here.
Now we note the length of the real draft tokens in the request, and add it to total_num_draft_tokens
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments.
e799556
to
e92e16b
Compare
num_requests_with_draft_tokens = 0 | ||
for request in scheduled_requests.generation_requests: | ||
if request.py_last_draft_tokens is not None: | ||
total_num_draft_tokens += request.draft_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be request.py_draft_length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed.
Thank you very much!
ad106cc
to
db3ac00
Compare
/bot run --disable-fail-fast |
PR_Github #11578 [ run ] triggered by Bot |
PR_Github #11578 [ run ] completed with state |
db3ac00
to
3866ad4
Compare
📝 WalkthroughWalkthroughThe changes update the Changes
Sequence Diagram(s)sequenceDiagram
participant Executor as PyExecutor
participant Drafter as Drafter/NGramDrafter/ModelDrafter
participant Stats as IterationStats
Executor->>Drafter: prepare_draft_tokens(scheduled_batch, resource_manager, iter_stats)
alt NGramDrafter
Drafter->>Drafter: _prepare_draft_tokens(scheduled_batch)
Drafter->>Drafter: _update_ngram_iter_stats(scheduled_batch, iter_stats, start_time)
Drafter->>Stats: Update specdec_stats (latency, token counts)
else ModelDrafter or base Drafter
Drafter->>Drafter: prepare_draft_tokens(scheduled_batch, resource_manager, iter_stats)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
fix by Joyjit suggestions fix by Simeng suggestions Signed-off-by: wili-65535 <[email protected]>
Signed-off-by: wili-65535 <[email protected]>
3866ad4
to
d43d144
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/ngram.py
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/_torch/speculative/model_drafter.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/speculative/drafter.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/speculative/ngram.py (2)
Learnt from: yiqingy0
PR: #5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/speculative/ngram.py
229-229: LlmRequest
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/ngram.py (3)
1-1
: LGTM: Import additions support new iteration statistics functionality.The added imports (
time
,Dict, Tuple
,IterationStats
) are all necessary for the new timing measurements and statistics collection features.Also applies to: 3-3, 7-7
176-191
: LGTM: Clean implementation of optional iteration statistics tracking.The method signature change maintains backward compatibility with the optional parameter, and the timing wrapper pattern cleanly separates concerns between core functionality and statistics collection.
192-223
: LGTM: Refactored method correctly preserves functionality and adds draft length tracking.The extraction of the core logic into a private method is clean, and the addition of
py_draft_length
tracking properly addresses the need to distinguish actual draft tokens from padding for accurate statistics collection.
def _update_ngram_iter_stats( | ||
self, | ||
scheduled_requests: ScheduledRequests, | ||
iter_stats: IterationStats, | ||
start_time: float, | ||
) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]: | ||
""" | ||
Get statistic information from the draft tokens in NGram drafter | ||
""" | ||
now_time = time.time() | ||
|
||
total_num_draft_tokens = 0 | ||
total_num_accepted_tokens = 0 | ||
num_requests_with_draft_tokens = 0 | ||
for request in scheduled_requests.generation_requests: | ||
if request.py_last_draft_tokens is not None: | ||
total_num_draft_tokens += request.py_draft_length | ||
total_num_accepted_tokens += request.py_num_accepted_draft_tokens | ||
num_requests_with_draft_tokens += 1 | ||
|
||
if num_requests_with_draft_tokens > 0: | ||
iter_stats.specdec_stats.iter_latency_ms = (now_time - | ||
start_time) * 1e3 | ||
iter_stats.specdec_stats.num_draft_tokens = total_num_draft_tokens | ||
iter_stats.specdec_stats.num_accepted_tokens = total_num_accepted_tokens | ||
iter_stats.specdec_stats.num_requests_with_draft_tokens = num_requests_with_draft_tokens | ||
iter_stats.specdec_stats.acceptance_length = ( | ||
total_num_accepted_tokens + | ||
num_requests_with_draft_tokens) / num_requests_with_draft_tokens | ||
else: | ||
iter_stats.specdec_stats.iter_latency_ms = 0.0 | ||
iter_stats.specdec_stats.num_draft_tokens = 0 | ||
iter_stats.specdec_stats.num_accepted_tokens = 0 | ||
iter_stats.specdec_stats.num_requests_with_draft_tokens = 0 | ||
iter_stats.specdec_stats.acceptance_length = 1.0 | ||
|
||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix return type annotation and approve statistics collection logic.
The statistics collection implementation is correct and properly uses py_draft_length
to exclude padding. However, the return type annotation doesn't match the actual return value.
Apply this diff to fix the return type annotation:
- ) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]:
+ ) -> None:
The rest of the implementation looks good - the acceptance length calculation and latency measurement are correct, and the conditional handling for cases with/without draft tokens is appropriate.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _update_ngram_iter_stats( | |
self, | |
scheduled_requests: ScheduledRequests, | |
iter_stats: IterationStats, | |
start_time: float, | |
) -> Tuple[ScheduledRequests, Dict[int, LlmRequest]]: | |
""" | |
Get statistic information from the draft tokens in NGram drafter | |
""" | |
now_time = time.time() | |
total_num_draft_tokens = 0 | |
total_num_accepted_tokens = 0 | |
num_requests_with_draft_tokens = 0 | |
for request in scheduled_requests.generation_requests: | |
if request.py_last_draft_tokens is not None: | |
total_num_draft_tokens += request.py_draft_length | |
total_num_accepted_tokens += request.py_num_accepted_draft_tokens | |
num_requests_with_draft_tokens += 1 | |
if num_requests_with_draft_tokens > 0: | |
iter_stats.specdec_stats.iter_latency_ms = (now_time - | |
start_time) * 1e3 | |
iter_stats.specdec_stats.num_draft_tokens = total_num_draft_tokens | |
iter_stats.specdec_stats.num_accepted_tokens = total_num_accepted_tokens | |
iter_stats.specdec_stats.num_requests_with_draft_tokens = num_requests_with_draft_tokens | |
iter_stats.specdec_stats.acceptance_length = ( | |
total_num_accepted_tokens + | |
num_requests_with_draft_tokens) / num_requests_with_draft_tokens | |
else: | |
iter_stats.specdec_stats.iter_latency_ms = 0.0 | |
iter_stats.specdec_stats.num_draft_tokens = 0 | |
iter_stats.specdec_stats.num_accepted_tokens = 0 | |
iter_stats.specdec_stats.num_requests_with_draft_tokens = 0 | |
iter_stats.specdec_stats.acceptance_length = 1.0 | |
return | |
def _update_ngram_iter_stats( | |
self, | |
scheduled_requests: ScheduledRequests, | |
iter_stats: IterationStats, | |
start_time: float, | |
) -> None: |
🧰 Tools
🪛 Ruff (0.12.2)
229-229: LlmRequest
may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/speculative/ngram.py lines 224 to 260, the method
_update_ngram_iter_stats currently has a return type annotation of
Tuple[ScheduledRequests, Dict[int, LlmRequest]] but does not return any value.
To fix this, update the return type annotation to None to match the actual
behavior of the method. The statistics collection logic is correct and does not
require changes.
/bot run --disable-fail-fast |
PR_Github #12825 [ run ] triggered by Bot |
PR_Github #12825 [ run ] completed with state |
if request.py_last_draft_tokens is not None: | ||
total_num_draft_tokens += request.py_draft_length | ||
total_num_accepted_tokens += request.py_num_accepted_draft_tokens | ||
num_requests_with_draft_tokens += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this number will always be the scheduled request for one iteration.
If num_requests_with_draft_tokens
should exclude the request with padded tokens only, them maybe we should count requests with py_draft_length > 0
.
request.py_draft_tokens = draft_tokens | ||
request.py_draft_length = draft_length | ||
|
||
def _update_ngram_iter_stats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this function can be moved to Drafter class.
@tijyojwad @SimengLiu-nv @wili-65535 there's been no updates since Jul 24th. Should we close? If not, can the addressed comments be resolved? @wili-65535 I see some comments from @SimengLiu-nv are not addressed yet. |
I am good with closing the PR. |
Description
Improve iteration statistics to enable more detailed performance tests for NGram speculative decoding.
We will add a experimental branch of NGram in overlap-scheduler mode once this PR is merged, and do performance tests to decide whether to enable overlap-scheduler for NGram.
Previous PR: PR4569.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Bug Fixes