[TRTLLM-6903][feat] Support chunked prefill for multimodal models #6843

chang-l · 2025-08-12T23:21:34Z

Note this PR only enables the chunked prefill support for llava-next and qwen2 series models.

Summary by CodeRabbit

New Features
- Chunked prefill for multimodal inputs: selective encoder execution, per-item embedding outputs, and embedding caching/reuse; integrated into the Qwen2-VL flow.
Refactor
- Centralized orchestration for computing and caching multimodal embeddings; runtime metadata now tracks chunk boundaries and seen/unseen token counts.
Tests
- Added end-to-end option to exercise chunked prefill and comprehensive unit tests for runtime data, embedding selection, and orchestration; removed legacy KV-cache reuse tests.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-12T23:21:40Z

Caution

Review failed

The head commit changed during the review from 920c384 to a6befdc.

📝 Walkthrough

Walkthrough

Refactors multimodal runtime and embedding utilities to support KV-cache reuse and chunked prefill: adds get_multimodal_embeddings and caching helpers, changes find_input_mm_embeds to return per-parameter slices, updates MultimodalRuntimeData fields/logic, adjusts model/engine call-sites, and replaces/extends unit tests and integration parameters.

Changes

Cohort / File(s)	Summary
Multimodal utils refactor `tensorrt_llm/_torch/models/modeling_multimodal_utils.py`	Adds `_get_active_multimodal_params`, `_cache_multimodal_embeddings`, `get_multimodal_embeddings`; updates `find_input_mm_embeds` to return `List[torch.Tensor]`, handle chunked prefill and KV-cache reuse; removes `find_uncached_mm_embeds`; updates docs/comments.
Qwen2VL model updates `tensorrt_llm/_torch/models/modeling_qwen2vl.py`	Replaces direct encoder.forward usage with `get_multimodal_embeddings(encoder_forward_fn, multimodal_params)`; imports/uses `find_input_mm_embeds` instead of `find_uncached_mm_embeds`; preserves DISAGG legacy path comment.
Runtime dataclass refactor `tensorrt_llm/inputs/multimodal.py`	`MultimodalRuntimeData` API changes: `num_cached_tokens` → `past_seen_token_num`; adds `chunk_end_pos`, `num_unseen_mm_tokens`, `num_mm_tokens`; removes `prompt_tokens`, `num_cached_mm_tokens`; updates `__post_init__` logic and validations to compute unseen/mm counts per chunk.
Engine call-site update `tensorrt_llm/_torch/pyexecutor/model_engine.py`	`_prepare_tp_inputs` now constructs `MultimodalRuntimeData` with `past_seen_token_num` and `chunk_end_pos=end_compute` (replacing prior `num_cached_tokens` use).
Integration tests param `tests/integration/defs/test_e2e.py`	Adds `enable_chunked_prefill` test parameterization; gates quickstart multimodal test for supported models and modalities; adjusts CLI flags and max_num_tokens when chunked prefill enabled.
Unit tests removed `tests/unittest/_torch/multimodal/test_kvcache_reuse.py`	Removes legacy KV-cache reuse unit tests tied to the old runtime API and helpers.
Unit tests added `tests/unittest/_torch/multimodal/test_multimodal_runtime.py`	Adds comprehensive tests for `MultimodalRuntimeData`, `find_input_mm_embeds`, and `get_multimodal_embeddings` covering chunked prefill, caching behavior, batching modes, error conditions, and cross-device scenarios.

Sequence Diagram(s)

sequenceDiagram
  participant Model as Qwen2VL.forward
  participant Utils as get_multimodal_embeddings
  participant Sel as _get_active_multimodal_params
  participant Enc as encoder_forward_fn
  participant Cache as _cache_multimodal_embeddings
  participant Slice as find_input_mm_embeds

  Model->>Utils: get_multimodal_embeddings(encoder_forward_fn, multimodal_params)
  Utils->>Sel: select active multimodal_params (skip non-content / already cached)
  alt active params exist
    Utils->>Enc: encoder_forward_fn(active_params)
    Enc-->>Utils: concatenated embeddings
    Utils->>Cache: split & store per-param multimodal_embedding
  end
  Utils-->>Model: list of per-param embeddings
  Model->>Slice: find_input_mm_embeds(mm_embeds_list, multimodal_params)
  Slice-->>Model: per-param slices for KV-cache reuse / current chunk

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models #5444: Overlaps on multimodal KV-cache/reuse logic and touches MultimodalRuntimeData and uncached-mm-embeds helpers.

Suggested labels

Community want to contribute

Suggested reviewers

symphonylyh
rakib-hasan
yilin-void
QiJune
lfr-0531

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

tests/integration/defs/test_e2e.py (1)
2124-2125: Consider clarifying the chunked prefill support scope in code comments.

While the gating logic correctly restricts chunked prefill to Qwen2-VL models for now, this limitation and the image-only modality restriction should be documented in the code comment or docstring to help future maintainers understand the constraints.
+    # NOTE: chunked prefill is currently only supported for:
+    # - Qwen2-VL and Qwen2.5-VL models  
+    # - Image modality only (not video/mixed)
     if model_name not in ["qwen2-vl-7b-instruct", "qwen2.5-vl-7b-instruct"
                           ] and enable_chunked_prefill:
         pytest.skip(
             "Only Qwen2-VL and Qwen2-5-VL support chunked prefill for now")
     if modality != "image" and enable_chunked_prefill:
         pytest.skip("Chunked prefill is only supported for image modality")
Also applies to: 2143-2148
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (3)
219-225: Clarify comments: “unseen” vs “cached” terminology is inconsistent.

In this module, num_unseen_mm_tokens represents already-seen/cached tokens to skip. The comments here say “All unseen,” which is misleading. Use “Fully cached” (or “All cached”) for clarity and consistency with the runtime data semantics.

Apply this diff:
-            self.create_multimodal_params(
-                8, 0, [8]),  # All unseen - should be skipped
+            self.create_multimodal_params(
+                8, 0, [8]),  # Fully cached - should be skipped
-            self.create_multimodal_params(
-                3, 6, [6, 3]),  # 3 unseen, 6 in current chunk
+            self.create_multimodal_params(
+                3, 6, [6, 3]),  # 3 cached, 6 in current chunk
-            self.create_multimodal_params(8, 0,
-                                          [8])  # All unseen - should be skipped
+            self.create_multimodal_params(8, 0,
+                                          [8])  # Fully cached - should be skipped
-            self.create_multimodal_params(10, 0, [10]),  # All unseen
-            self.create_multimodal_params(10, 0, [10]),  # All unseen
-            self.create_multimodal_params(10, 0, [10])  # All unseen
+            self.create_multimodal_params(10, 0, [10]),  # Fully cached
+            self.create_multimodal_params(10, 0, [10]),  # Fully cached
+            self.create_multimodal_params(10, 0, [10])  # Fully cached
Also applies to: 243-246

56-63: Fix minor docstring typos for readability.

Short, clear docstrings help future maintenance; tighten these two.

Apply this diff:
-        """Test case chunk around chunk boundaries."""
+        """Test case around chunk boundaries."""
-        """Test test chunk end is very large."""
+        """Test when chunk_end_pos is very large."""
Also applies to: 72-80

145-176: Consider an extra pass-through test for find_input_mm_embeds when runtime is None.

The implementation returns the full mm_embeds when multimodal_params is empty or when multimodal_runtime is None. Adding a direct unit test for that path would lock in behavior and prevent regressions.

If helpful, I can draft a small test case that asserts pass-through when multimodal_runtime is None for one or more params.

Also applies to: 177-212, 213-235, 236-272, 273-306, 307-321, 323-344, 345-363

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9a6dd and 183d3f0.

📒 Files selected for processing (7)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/inputs/multimodal.py (2 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/unittest/_torch/multimodal/test_kvcache_reuse.py (0 hunks)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/_torch/multimodal/test_kvcache_reuse.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/inputs/multimodal.py
tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/inputs/multimodal.py
tests/integration/defs/test_e2e.py

🧬 Code Graph Analysis (3)

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)

find_input_mm_embeds (153-222)

get_multimodal_embeddings (96-150)

tensorrt_llm/inputs/multimodal.py (2)

MultimodalParams (168-426)

MultimodalRuntimeData (86-164)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)

tensorrt_llm/inputs/multimodal.py (2)

MultimodalParams (168-426)

has_content (424-426)

tensorrt_llm/logger.py (2)

debug (143-144)

warning (131-132)

tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (3)

find_input_mm_embeds (153-222)

fuse_input_embeds (225-284)

get_multimodal_embeddings (96-150)

🪛 Ruff (0.12.2)

tensorrt_llm/inputs/multimodal.py

118-118: Line too long (148 > 120)

(E501)

163-163: Line too long (193 > 120)

(E501)

🔇 Additional comments (12)

tests/integration/defs/test_e2e.py (1)

2269-2277: LGTM! Consistent application of chunked prefill configuration.

The chunked prefill logic is properly handled with appropriate max_num_tokens settings based on the model and modality combination. The special handling for video modality with larger token counts (16384) is correct for Qwen2-VL models.

tensorrt_llm/inputs/multimodal.py (2)

86-100: Well-designed data structure for chunked prefill support.

The updated MultimodalRuntimeData correctly tracks multimodal token processing state with:

Renamed past_seen_token_num for clarity

New chunk_end_pos for chunked prefill boundaries

Computed fields for unseen and current chunk tokens
The docstring clearly explains the purpose of each field.

113-164: Complex but correct chunked prefill token counting logic.

The post_init implementation correctly handles:

Automatic total_mm_tokens calculation when not provided

Complex overlap scenarios between cached tokens and current chunk

Proper validation that combined counts don't exceed total tokens

The nested conditionals correctly handle all edge cases for partial overlaps between cached/chunk boundaries.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1239-1244: Correctly updated MultimodalRuntimeData construction for chunked prefill.

The changes properly pass:

past_seen_token_num (renamed from num_cached_tokens)

New chunk_end_pos parameter set to end_compute
This aligns with the updated MultimodalRuntimeData structure for chunked prefill support.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (4)

33-58: Well-structured helper for identifying active multimodal parameters.

The _get_active_multimodal_params function cleanly filters parameters that need encoder processing by:

Skipping parameters without content

Skipping already-cached embeddings

Providing debug logging for cache hits

This efficiently avoids redundant encoder computations.

60-94: Efficient embedding caching with proper validation.

The _cache_multimodal_embeddings function:

Uses torch.split for efficient tensor splitting

Validates total length matches expected

Stores embeddings back to multimodal_data for reuse

Includes helpful debug logging

The use of torch.split is more efficient than manual indexing.

96-151: Well-orchestrated multimodal embedding management.

The get_multimodal_embeddings function provides a clean high-level API that:

Identifies active parameters needing processing

Runs encoder only on uncached parameters

Caches results when runtime data is available

Assembles all embeddings for the batch

The early returns and validation checks handle edge cases appropriately.

153-222: Comprehensive slicing logic for chunked prefill scenarios.

The updated find_input_mm_embeds now returns List[torch.Tensor] and correctly handles:

Both KV cache reuse and chunked prefill

Two batching modes (individual vs pre-concatenated)

Complex slicing based on num_unseen_mm_tokens and num_mm_tokens

Empty list return when no tokens need processing

The documentation clearly explains the behavior for each scenario.

tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)

22-23: Import changes align with the new multimodal utilities API.

The imports correctly reflect the refactored API:

Removed find_uncached_mm_embeds

Added get_multimodal_embeddings for the new embedding orchestration

Kept find_input_mm_embeds and fuse_input_embeds

616-630: Clean integration with the new multimodal embedding API.

The forward method now properly uses:

get_multimodal_embeddings with encoder_forward_fn for clean separation of concerns

find_input_mm_embeds (renamed from find_uncached_mm_embeds) for slicing

Proper handling of context requests only

The DISAGG path comment correctly notes it's a dead path.
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)

396-659: Strong coverage of multimodal embedding orchestration and caching paths.

The tests for get_multimodal_embeddings comprehensively cover empty inputs, all-uncached, all-cached, mixed caching, missing runtime/total_mm_tokens, multi-modality early return, correct splitting via torch.split, and CUDA device handling. This is solid and aligns with the updated API behavior.

1-6: Add SPDX copyright header to test_multimodal_runtime.py

Please prepend the standard NVIDIA SPDX header for 2025 to the top of tests/unittest/_torch/multimodal/test_multimodal_runtime.py. For consistency with existing Python tests (e.g., tests/unittest/_torch/test_group_rmn_norm.py), it should read:
--- a/tests/unittest/_torch/multimodal/test_multimodal_runtime.py
+++ b/tests/unittest/_torch/multimodal/test_multimodal_runtime.py
@@ -1,2 +1,4 @@
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
  from typing import List
  from unittest.mock import Mock
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

tests/unittest/_torch/multimodal/test_multimodal_runtime.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)

13-85: Clarify “unseen” semantics in comments to match slicing behavior.

Across these tests, “num_unseen_mm_tokens” is used as the prefix-skip count (i.e., tokens already seen/cached before the current chunk). The term “unseen” can be read as the opposite. Consider tightening comments/docstrings to avoid confusion for future readers.

Example rewording in comments:

“X unseen” → “skip X cached (prefix) tokens”

“all unseen” → “all tokens cached (0 tokens in current chunk)”

396-404: Optional: remove pytest entrypoint or keep consistently across test modules.

pytest auto-discovers tests; the explicit main entrypoint is not required. Keep it only if this repository follows that pattern consistently.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 183d3f0 and 5c437a2.

📒 Files selected for processing (1)

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/_torch/multimodal/test_multimodal_runtime.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/_torch/multimodal/test_multimodal_runtime.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1)

127-135: Resolved: batched slicing now sets total_mm_tokens on runtime mock.

Thanks for addressing the earlier review note; setting runtime.total_mm_tokens = sum(mm_token_lengths) unblocks batched slicing in find_input_mm_embeds when len(mm_embeds) == 1.

tests/unittest/_torch/multimodal/test_multimodal_runtime.py

chang-l · 2025-08-12T23:43:16Z

/bot run

tensorrt-cicd · 2025-08-12T23:48:47Z

PR_Github #15030 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (4)
1-1: Missing NVIDIA copyright header (2025).

Per the coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 from typing import List
127-135: Add runtime.total_mm_tokens to the mock for concatenated mode slicing.

The find_input_mm_embeds function expects runtime.total_mm_tokens when processing pre-concatenated embeddings (len(mm_embeds) == 1), but the mock currently doesn't set it. This will cause errors during concatenated mode tests.

The mock already sets total_mm_tokens on line 133, which correctly addresses this issue.

269-272: Fix direct list comparison of tensors to avoid ambiguous boolean evaluation.

Comparing lists of tensors with == triggers element-wise tensor comparison, which raises "Boolean value of Tensor is ambiguous" errors in PyTorch.

Apply this diff:
         result = find_input_mm_embeds(mm_embeds, multimodal_params)
 
         # Should return the full embeddings
-        assert result == mm_embeds
+        assert len(result) == 1
+        torch.testing.assert_close(result[0], mm_embeds[0])
329-333: Fix tensor equality check in single-batch scenario.

Same issue as above - direct list comparison will fail with PyTorch tensors.

Apply this diff:
         result = find_input_mm_embeds(mm_embeds, multimodal_params)
-        assert result == mm_embeds
+        assert len(result) == 1
+        torch.testing.assert_close(result[0], mm_embeds[0])

🧹 Nitpick comments (2)

tensorrt_llm/inputs/multimodal.py (2)

118-118: Consider wrapping long error message for readability.

The error message exceeds the 120-character line limit. While functional, consider wrapping for better readability.

Apply this diff:

-                f"mm_token_positions ({len(self.mm_token_positions)}) and mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length"
+                f"mm_token_positions ({len(self.mm_token_positions)}) and "
+                f"mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length"

163-164: Consider wrapping long error message for readability.

The error message significantly exceeds the 120-character line limit.

Apply this diff:

-                f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})"
+                f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) "
+                f"must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})"

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c437a2 and bfcf14a.

📒 Files selected for processing (7)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/inputs/multimodal.py (2 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/unittest/_torch/multimodal/test_kvcache_reuse.py (0 hunks)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/_torch/multimodal/test_kvcache_reuse.py

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/inputs/multimodal.py
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/inputs/multimodal.py
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-08-12T10:28:57.288Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.288Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

tests/unittest/_torch/multimodal/test_multimodal_runtime.py

🪛 Ruff (0.12.2)

tensorrt_llm/inputs/multimodal.py

118-118: Line too long (148 > 120)

(E501)

163-163: Line too long (193 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tests/integration/defs/test_e2e.py (3)

2124-2124: LGTM! Parameterization for chunked prefill testing added.

The addition of the enable_chunked_prefill parameter enables comprehensive testing of the new chunked prefill feature for multimodal models.

2143-2149: Good gating logic for unsupported models and modalities.

The implementation correctly restricts chunked prefill to only Qwen2-VL models and image modality, preventing incorrect usage with unsupported configurations.

2273-2277: LGTM! Conditional max_num_tokens configuration for chunked prefill.

The logic correctly sets max_num_tokens to 256 when chunked prefill is enabled, aligning with the expected behavior for chunked processing.

tensorrt_llm/inputs/multimodal.py (4)

93-97: LGTM! Clear documentation of KV cache reuse and chunked prefill attributes.

The updated docstring clearly explains the new fields and their purpose in both KV cache reuse and chunked prefill scenarios.

114-115: Logical auto-calculation of total_mm_tokens.

Good defensive programming - automatically calculating total_mm_tokens when not provided ensures consistency.

121-124: Appropriate validation for past_seen_token_num.

The non-negative check ensures the field maintains a valid state.

136-159: Complex but correct token counting logic for chunked prefill.

The implementation properly handles all overlap scenarios between multimodal token spans and chunk boundaries:

Fully cached tokens (before past_seen_token_num)

Partially cached tokens (spanning past_seen_token_num)

Tokens in current chunk (between past_seen_token_num and chunk_end_pos)

Tokens beyond current chunk

The logic correctly accumulates num_unseen_mm_tokens for cached tokens and num_mm_tokens for tokens in the current chunk.

tensorrt-cicd · 2025-08-13T00:19:12Z

PR_Github #15030 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11352 completed with status: 'FAILURE'

tensorrt_llm/_torch/models/modeling_qwen2vl.py

tensorrt_llm/inputs/multimodal.py

rakib-hasan

Added some minor comments.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

tensorrt_llm/_torch/models/modeling_qwen2vl.py

tensorrt_llm/inputs/multimodal.py

mikeiovine

Signing off on pyexecutor changes on behalf of trt-llm-torch-runtime-devs. I did not review modeling changes, will leave that to the qwen devs.

tensorrt_llm/_torch/pyexecutor/model_engine.py

yechank-nvidia

Thx for the work.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

jaedeok-nvidia

Thanks for supporting the chunked prefill for MM models. This will be an essential feature for improving MM performance.

tensorrt_llm/_torch/models/modeling_qwen2vl.py

chang-l · 2025-08-18T04:28:21Z

/bot run

tensorrt-cicd · 2025-08-18T04:34:05Z

PR_Github #15580 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-18T05:04:01Z

PR_Github #15580 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11731 completed with status: 'FAILURE'

chang-l · 2025-08-18T05:30:15Z

/bot run

tensorrt-cicd · 2025-08-18T05:35:37Z

PR_Github #15587 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-18T07:08:47Z

PR_Github #15587 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11736 completed with status: 'FAILURE'

tensorrt-cicd · 2025-09-10T06:46:42Z

PR_Github #18308 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13727 completed with status: 'FAILURE'

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

chang-l · 2025-09-10T21:56:29Z

/bot run

tensorrt-cicd · 2025-09-10T22:03:14Z

PR_Github #18411 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-11T04:34:34Z

PR_Github #18411 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13810 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

…lama4) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-09-11T15:03:25Z

/bot run

tensorrt-cicd · 2025-09-11T15:09:47Z

PR_Github #18445 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-11T17:14:45Z

PR_Github #18445 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13838 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-09-12T03:26:03Z

/bot run

tensorrt-cicd · 2025-09-12T03:31:09Z

PR_Github #18457 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-12T04:44:07Z

PR_Github #18457 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13847 completed with status: 'FAILURE'

chang-l · 2025-09-12T04:52:50Z

/bot run

tensorrt-cicd · 2025-09-12T04:57:51Z

PR_Github #18462 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-12T11:35:24Z

PR_Github #18462 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13851 completed with status: 'FAILURE'

chang-l · 2025-09-12T13:56:34Z

/bot run --reuse-test

tensorrt-cicd · 2025-09-12T14:02:03Z

PR_Github #18477 [ run ] triggered by Bot

chang-l requested review from a team as code owners August 12, 2025 23:21

chang-l requested review from byshiue and 2ez4bz August 12, 2025 23:21

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

tests/unittest/_torch/multimodal/test_multimodal_runtime.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

tests/unittest/_torch/multimodal/test_multimodal_runtime.py Show resolved Hide resolved

tests/unittest/_torch/multimodal/test_multimodal_runtime.py Outdated Show resolved Hide resolved

tests/unittest/_torch/multimodal/test_multimodal_runtime.py Outdated Show resolved Hide resolved

chang-l changed the title ~~[TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend~~ [TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend Aug 12, 2025

chang-l force-pushed the mm_chunked_prefill branch from 5c437a2 to bfcf14a Compare August 12, 2025 23:43

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

amukkara reviewed Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_qwen2vl.py Outdated Show resolved Hide resolved

tensorrt_llm/inputs/multimodal.py Outdated Show resolved Hide resolved

tensorrt_llm/inputs/multimodal.py Show resolved Hide resolved

tensorrt_llm/inputs/multimodal.py Outdated Show resolved Hide resolved

rakib-hasan reviewed Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_qwen2vl.py Outdated Show resolved Hide resolved

tensorrt_llm/inputs/multimodal.py Show resolved Hide resolved

mikeiovine approved these changes Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Show resolved Hide resolved

yechank-nvidia approved these changes Aug 14, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Show resolved Hide resolved

jaedeok-nvidia reviewed Aug 15, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Show resolved Hide resolved

jaedeok-nvidia reviewed Aug 15, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_qwen2vl.py Show resolved Hide resolved

chang-l force-pushed the mm_chunked_prefill branch from 920c384 to a6befdc Compare August 18, 2025 04:28

chang-l force-pushed the mm_chunked_prefill branch 2 times, most recently from 216a3a7 to 1b6d4e6 Compare September 10, 2025 04:16

chang-l requested a review from a team as a code owner September 10, 2025 04:16

amukkara approved these changes Sep 10, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Show resolved Hide resolved

chang-l force-pushed the mm_chunked_prefill branch from c1533a5 to a56beac Compare September 10, 2025 21:21

chang-l changed the title ~~[TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend~~ [TRTLLM-6903][feat] Support chunked prefill for multimodal models Sep 10, 2025

QiJune approved these changes Sep 11, 2025

View reviewed changes

chang-l added 9 commits September 11, 2025 08:03

Add mm runtime fields and mm util methods for chunk prefill

3691538

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Enable chunked prefill for qwen2vl

de85429

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Format and add e2e tests

c77f9da

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Minor update to ut

28c8dbd

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Address comments

88b036c

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Update docs and tests

7565948

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Raise errors if disagg

f67067d

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Account for special token breaks for some VLMs (e.g., mistral-small/l…

ee92a5e

…lama4) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Remove unintended print

5bcf58a

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l force-pushed the mm_chunked_prefill branch from def8e67 to 5bcf58a Compare September 11, 2025 15:03

Fix unit tests w/ speical tokens

1fbeb89

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

[TRTLLM-6903][feat] Support chunked prefill for multimodal models #6843

Are you sure you want to change the base?

[TRTLLM-6903][feat] Support chunked prefill for multimodal models #6843

Uh oh!

Conversation

chang-l commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chang-l commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakib-hasan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yechank-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jaedeok-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chang-l commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

chang-l commented Aug 18, 2025

chang-l commented Aug 12, 2025 •

edited

Loading

coderabbitai bot commented Aug 12, 2025 •

edited

Loading