-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6903][feat] Support chunked prefill for multimodal models #6843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughRefactors multimodal runtime and embedding utilities to support KV-cache reuse and chunked prefill: adds get_multimodal_embeddings and caching helpers, changes find_input_mm_embeds to return per-parameter slices, updates MultimodalRuntimeData fields/logic, adjusts model/engine call-sites, and replaces/extends unit tests and integration parameters. Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Qwen2VL.forward
participant Utils as get_multimodal_embeddings
participant Sel as _get_active_multimodal_params
participant Enc as encoder_forward_fn
participant Cache as _cache_multimodal_embeddings
participant Slice as find_input_mm_embeds
Model->>Utils: get_multimodal_embeddings(encoder_forward_fn, multimodal_params)
Utils->>Sel: select active multimodal_params (skip non-content / already cached)
alt active params exist
Utils->>Enc: encoder_forward_fn(active_params)
Enc-->>Utils: concatenated embeddings
Utils->>Cache: split & store per-param multimodal_embedding
end
Utils-->>Model: list of per-param embeddings
Model->>Slice: find_input_mm_embeds(mm_embeds_list, multimodal_params)
Slice-->>Model: per-param slices for KV-cache reuse / current chunk
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/integration/defs/test_e2e.py (1)
2124-2125
: Consider clarifying the chunked prefill support scope in code comments.While the gating logic correctly restricts chunked prefill to Qwen2-VL models for now, this limitation and the image-only modality restriction should be documented in the code comment or docstring to help future maintainers understand the constraints.
+ # NOTE: chunked prefill is currently only supported for: + # - Qwen2-VL and Qwen2.5-VL models + # - Image modality only (not video/mixed) if model_name not in ["qwen2-vl-7b-instruct", "qwen2.5-vl-7b-instruct" ] and enable_chunked_prefill: pytest.skip( "Only Qwen2-VL and Qwen2-5-VL support chunked prefill for now") if modality != "image" and enable_chunked_prefill: pytest.skip("Chunked prefill is only supported for image modality")Also applies to: 2143-2148
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (3)
219-225
: Clarify comments: “unseen” vs “cached” terminology is inconsistent.In this module, num_unseen_mm_tokens represents already-seen/cached tokens to skip. The comments here say “All unseen,” which is misleading. Use “Fully cached” (or “All cached”) for clarity and consistency with the runtime data semantics.
Apply this diff:
- self.create_multimodal_params( - 8, 0, [8]), # All unseen - should be skipped + self.create_multimodal_params( + 8, 0, [8]), # Fully cached - should be skipped - self.create_multimodal_params( - 3, 6, [6, 3]), # 3 unseen, 6 in current chunk + self.create_multimodal_params( + 3, 6, [6, 3]), # 3 cached, 6 in current chunk - self.create_multimodal_params(8, 0, - [8]) # All unseen - should be skipped + self.create_multimodal_params(8, 0, + [8]) # Fully cached - should be skipped- self.create_multimodal_params(10, 0, [10]), # All unseen - self.create_multimodal_params(10, 0, [10]), # All unseen - self.create_multimodal_params(10, 0, [10]) # All unseen + self.create_multimodal_params(10, 0, [10]), # Fully cached + self.create_multimodal_params(10, 0, [10]), # Fully cached + self.create_multimodal_params(10, 0, [10]) # Fully cachedAlso applies to: 243-246
56-63
: Fix minor docstring typos for readability.Short, clear docstrings help future maintenance; tighten these two.
Apply this diff:
- """Test case chunk around chunk boundaries.""" + """Test case around chunk boundaries."""- """Test test chunk end is very large.""" + """Test when chunk_end_pos is very large."""Also applies to: 72-80
145-176
: Consider an extra pass-through test for find_input_mm_embeds when runtime is None.The implementation returns the full mm_embeds when multimodal_params is empty or when multimodal_runtime is None. Adding a direct unit test for that path would lock in behavior and prevent regressions.
If helpful, I can draft a small test case that asserts pass-through when multimodal_runtime is None for one or more params.
Also applies to: 177-212, 213-235, 236-272, 273-306, 307-321, 323-344, 345-363
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
(2 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(1 hunks)tensorrt_llm/inputs/multimodal.py
(2 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/unittest/_torch/multimodal/test_kvcache_reuse.py
(0 hunks)tests/unittest/_torch/multimodal/test_multimodal_runtime.py
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/multimodal/test_kvcache_reuse.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/inputs/multimodal.py
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/inputs/multimodal.py
tests/integration/defs/test_e2e.py
🧬 Code Graph Analysis (3)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)
find_input_mm_embeds
(153-222)get_multimodal_embeddings
(96-150)tensorrt_llm/inputs/multimodal.py (2)
MultimodalParams
(168-426)MultimodalRuntimeData
(86-164)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)
tensorrt_llm/inputs/multimodal.py (2)
MultimodalParams
(168-426)has_content
(424-426)tensorrt_llm/logger.py (2)
debug
(143-144)warning
(131-132)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (3)
find_input_mm_embeds
(153-222)fuse_input_embeds
(225-284)get_multimodal_embeddings
(96-150)
🪛 Ruff (0.12.2)
tensorrt_llm/inputs/multimodal.py
118-118: Line too long (148 > 120)
(E501)
163-163: Line too long (193 > 120)
(E501)
🔇 Additional comments (12)
tests/integration/defs/test_e2e.py (1)
2269-2277
: LGTM! Consistent application of chunked prefill configuration.The chunked prefill logic is properly handled with appropriate max_num_tokens settings based on the model and modality combination. The special handling for video modality with larger token counts (16384) is correct for Qwen2-VL models.
tensorrt_llm/inputs/multimodal.py (2)
86-100
: Well-designed data structure for chunked prefill support.The updated MultimodalRuntimeData correctly tracks multimodal token processing state with:
- Renamed
past_seen_token_num
for clarity- New
chunk_end_pos
for chunked prefill boundaries- Computed fields for unseen and current chunk tokens
The docstring clearly explains the purpose of each field.
113-164
: Complex but correct chunked prefill token counting logic.The post_init implementation correctly handles:
- Automatic total_mm_tokens calculation when not provided
- Complex overlap scenarios between cached tokens and current chunk
- Proper validation that combined counts don't exceed total tokens
The nested conditionals correctly handle all edge cases for partial overlaps between cached/chunk boundaries.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1239-1244
: Correctly updated MultimodalRuntimeData construction for chunked prefill.The changes properly pass:
past_seen_token_num
(renamed from num_cached_tokens)- New
chunk_end_pos
parameter set toend_compute
This aligns with the updated MultimodalRuntimeData structure for chunked prefill support.tensorrt_llm/_torch/models/modeling_multimodal_utils.py (4)
33-58
: Well-structured helper for identifying active multimodal parameters.The
_get_active_multimodal_params
function cleanly filters parameters that need encoder processing by:
- Skipping parameters without content
- Skipping already-cached embeddings
- Providing debug logging for cache hits
This efficiently avoids redundant encoder computations.
60-94
: Efficient embedding caching with proper validation.The
_cache_multimodal_embeddings
function:
- Uses torch.split for efficient tensor splitting
- Validates total length matches expected
- Stores embeddings back to multimodal_data for reuse
- Includes helpful debug logging
The use of torch.split is more efficient than manual indexing.
96-151
: Well-orchestrated multimodal embedding management.The
get_multimodal_embeddings
function provides a clean high-level API that:
- Identifies active parameters needing processing
- Runs encoder only on uncached parameters
- Caches results when runtime data is available
- Assembles all embeddings for the batch
The early returns and validation checks handle edge cases appropriately.
153-222
: Comprehensive slicing logic for chunked prefill scenarios.The updated
find_input_mm_embeds
now returns List[torch.Tensor] and correctly handles:
- Both KV cache reuse and chunked prefill
- Two batching modes (individual vs pre-concatenated)
- Complex slicing based on num_unseen_mm_tokens and num_mm_tokens
- Empty list return when no tokens need processing
The documentation clearly explains the behavior for each scenario.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)
22-23
: Import changes align with the new multimodal utilities API.The imports correctly reflect the refactored API:
- Removed
find_uncached_mm_embeds
- Added
get_multimodal_embeddings
for the new embedding orchestration- Kept
find_input_mm_embeds
andfuse_input_embeds
616-630
: Clean integration with the new multimodal embedding API.The forward method now properly uses:
get_multimodal_embeddings
with encoder_forward_fn for clean separation of concernsfind_input_mm_embeds
(renamed from find_uncached_mm_embeds) for slicing- Proper handling of context requests only
The DISAGG path comment correctly notes it's a dead path.
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)
396-659
: Strong coverage of multimodal embedding orchestration and caching paths.The tests for get_multimodal_embeddings comprehensively cover empty inputs, all-uncached, all-cached, mixed caching, missing runtime/total_mm_tokens, multi-modality early return, correct splitting via torch.split, and CUDA device handling. This is solid and aligns with the updated API behavior.
1-6
: Add SPDX copyright header to test_multimodal_runtime.pyPlease prepend the standard NVIDIA SPDX header for 2025 to the top of tests/unittest/_torch/multimodal/test_multimodal_runtime.py. For consistency with existing Python tests (e.g., tests/unittest/_torch/test_group_rmn_norm.py), it should read:
--- a/tests/unittest/_torch/multimodal/test_multimodal_runtime.py +++ b/tests/unittest/_torch/multimodal/test_multimodal_runtime.py @@ -1,2 +1,4 @@ + # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + from typing import List from unittest.mock import Mock⛔ Skipped due to learnings
Learnt from: galagam PR: NVIDIA/TensorRT-LLM#6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)
13-85
: Clarify “unseen” semantics in comments to match slicing behavior.Across these tests, “num_unseen_mm_tokens” is used as the prefix-skip count (i.e., tokens already seen/cached before the current chunk). The term “unseen” can be read as the opposite. Consider tightening comments/docstrings to avoid confusion for future readers.
Example rewording in comments:
- “X unseen” → “skip X cached (prefix) tokens”
- “all unseen” → “all tokens cached (0 tokens in current chunk)”
396-404
: Optional: remove pytest entrypoint or keep consistently across test modules.
pytest
auto-discovers tests; the explicitmain
entrypoint is not required. Keep it only if this repository follows that pattern consistently.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1)
127-135
: Resolved: batched slicing now sets total_mm_tokens on runtime mock.Thanks for addressing the earlier review note; setting
runtime.total_mm_tokens = sum(mm_token_lengths)
unblocks batched slicing infind_input_mm_embeds
whenlen(mm_embeds) == 1
.
5c437a2
to
bfcf14a
Compare
/bot run |
PR_Github #15030 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (4)
1-1
: Missing NVIDIA copyright header (2025).Per the coding guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from typing import List
127-135
: Add runtime.total_mm_tokens to the mock for concatenated mode slicing.The
find_input_mm_embeds
function expectsruntime.total_mm_tokens
when processing pre-concatenated embeddings (len(mm_embeds) == 1), but the mock currently doesn't set it. This will cause errors during concatenated mode tests.The mock already sets
total_mm_tokens
on line 133, which correctly addresses this issue.
269-272
: Fix direct list comparison of tensors to avoid ambiguous boolean evaluation.Comparing lists of tensors with
==
triggers element-wise tensor comparison, which raises "Boolean value of Tensor is ambiguous" errors in PyTorch.Apply this diff:
result = find_input_mm_embeds(mm_embeds, multimodal_params) # Should return the full embeddings - assert result == mm_embeds + assert len(result) == 1 + torch.testing.assert_close(result[0], mm_embeds[0])
329-333
: Fix tensor equality check in single-batch scenario.Same issue as above - direct list comparison will fail with PyTorch tensors.
Apply this diff:
result = find_input_mm_embeds(mm_embeds, multimodal_params) - assert result == mm_embeds + assert len(result) == 1 + torch.testing.assert_close(result[0], mm_embeds[0])
🧹 Nitpick comments (2)
tensorrt_llm/inputs/multimodal.py (2)
118-118
: Consider wrapping long error message for readability.The error message exceeds the 120-character line limit. While functional, consider wrapping for better readability.
Apply this diff:
- f"mm_token_positions ({len(self.mm_token_positions)}) and mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length" + f"mm_token_positions ({len(self.mm_token_positions)}) and " + f"mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length"
163-164
: Consider wrapping long error message for readability.The error message significantly exceeds the 120-character line limit.
Apply this diff:
- f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})" + f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) " + f"must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
(2 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(1 hunks)tensorrt_llm/inputs/multimodal.py
(2 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/unittest/_torch/multimodal/test_kvcache_reuse.py
(0 hunks)tests/unittest/_torch/multimodal/test_multimodal_runtime.py
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/multimodal/test_kvcache_reuse.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/pyexecutor/model_engine.py
- tensorrt_llm/_torch/models/modeling_qwen2vl.py
- tensorrt_llm/_torch/models/modeling_multimodal_utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/inputs/multimodal.py
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/inputs/multimodal.py
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-08-12T10:28:57.288Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.288Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
🪛 Ruff (0.12.2)
tensorrt_llm/inputs/multimodal.py
118-118: Line too long (148 > 120)
(E501)
163-163: Line too long (193 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/defs/test_e2e.py (3)
2124-2124
: LGTM! Parameterization for chunked prefill testing added.The addition of the
enable_chunked_prefill
parameter enables comprehensive testing of the new chunked prefill feature for multimodal models.
2143-2149
: Good gating logic for unsupported models and modalities.The implementation correctly restricts chunked prefill to only Qwen2-VL models and image modality, preventing incorrect usage with unsupported configurations.
2273-2277
: LGTM! Conditional max_num_tokens configuration for chunked prefill.The logic correctly sets max_num_tokens to 256 when chunked prefill is enabled, aligning with the expected behavior for chunked processing.
tensorrt_llm/inputs/multimodal.py (4)
93-97
: LGTM! Clear documentation of KV cache reuse and chunked prefill attributes.The updated docstring clearly explains the new fields and their purpose in both KV cache reuse and chunked prefill scenarios.
114-115
: Logical auto-calculation of total_mm_tokens.Good defensive programming - automatically calculating
total_mm_tokens
when not provided ensures consistency.
121-124
: Appropriate validation for past_seen_token_num.The non-negative check ensures the field maintains a valid state.
136-159
: Complex but correct token counting logic for chunked prefill.The implementation properly handles all overlap scenarios between multimodal token spans and chunk boundaries:
- Fully cached tokens (before past_seen_token_num)
- Partially cached tokens (spanning past_seen_token_num)
- Tokens in current chunk (between past_seen_token_num and chunk_end_pos)
- Tokens beyond current chunk
The logic correctly accumulates
num_unseen_mm_tokens
for cached tokens andnum_mm_tokens
for tokens in the current chunk.
PR_Github #15030 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signing off on pyexecutor changes on behalf of trt-llm-torch-runtime-devs. I did not review modeling changes, will leave that to the qwen devs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for supporting the chunked prefill for MM models. This will be an essential feature for improving MM performance.
920c384
to
a6befdc
Compare
/bot run |
PR_Github #15580 [ run ] triggered by Bot |
PR_Github #15580 [ run ] completed with state |
/bot run |
PR_Github #15587 [ run ] triggered by Bot |
PR_Github #15587 [ run ] completed with state |
216a3a7
to
1b6d4e6
Compare
PR_Github #18308 [ run ] completed with state |
c1533a5
to
a56beac
Compare
/bot run |
PR_Github #18411 [ run ] triggered by Bot |
PR_Github #18411 [ run ] completed with state |
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
…lama4) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
def8e67
to
5bcf58a
Compare
/bot run |
PR_Github #18445 [ run ] triggered by Bot |
PR_Github #18445 [ run ] completed with state |
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
/bot run |
PR_Github #18457 [ run ] triggered by Bot |
PR_Github #18457 [ run ] completed with state |
/bot run |
PR_Github #18462 [ run ] triggered by Bot |
PR_Github #18462 [ run ] completed with state |
/bot run --reuse-test |
PR_Github #18477 [ run ] triggered by Bot |
Note this PR only enables the chunked prefill support for llava-next and qwen2 series models.
Summary by CodeRabbit
New Features
Refactor
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.