Skip to content

Conversation

chang-l
Copy link
Collaborator

@chang-l chang-l commented Aug 12, 2025

Note this PR only enables the chunked prefill support for llava-next and qwen2 series models.

Summary by CodeRabbit

  • New Features

    • Chunked prefill for multimodal inputs: selective encoder execution, per-item embedding outputs, and embedding caching/reuse; integrated into the Qwen2-VL flow.
  • Refactor

    • Centralized orchestration for computing and caching multimodal embeddings; runtime metadata now tracks chunk boundaries and seen/unseen token counts.
  • Tests

    • Added end-to-end option to exercise chunked prefill and comprehensive unit tests for runtime data, embedding selection, and orchestration; removed legacy KV-cache reuse tests.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chang-l chang-l requested review from a team as code owners August 12, 2025 23:21
@chang-l chang-l requested review from byshiue and 2ez4bz August 12, 2025 23:21
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

Caution

Review failed

The head commit changed during the review from 920c384 to a6befdc.

📝 Walkthrough

Walkthrough

Refactors multimodal runtime and embedding utilities to support KV-cache reuse and chunked prefill: adds get_multimodal_embeddings and caching helpers, changes find_input_mm_embeds to return per-parameter slices, updates MultimodalRuntimeData fields/logic, adjusts model/engine call-sites, and replaces/extends unit tests and integration parameters.

Changes

Cohort / File(s) Summary
Multimodal utils refactor
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
Adds _get_active_multimodal_params, _cache_multimodal_embeddings, get_multimodal_embeddings; updates find_input_mm_embeds to return List[torch.Tensor], handle chunked prefill and KV-cache reuse; removes find_uncached_mm_embeds; updates docs/comments.
Qwen2VL model updates
tensorrt_llm/_torch/models/modeling_qwen2vl.py
Replaces direct encoder.forward usage with get_multimodal_embeddings(encoder_forward_fn, multimodal_params); imports/uses find_input_mm_embeds instead of find_uncached_mm_embeds; preserves DISAGG legacy path comment.
Runtime dataclass refactor
tensorrt_llm/inputs/multimodal.py
MultimodalRuntimeData API changes: num_cached_tokenspast_seen_token_num; adds chunk_end_pos, num_unseen_mm_tokens, num_mm_tokens; removes prompt_tokens, num_cached_mm_tokens; updates __post_init__ logic and validations to compute unseen/mm counts per chunk.
Engine call-site update
tensorrt_llm/_torch/pyexecutor/model_engine.py
_prepare_tp_inputs now constructs MultimodalRuntimeData with past_seen_token_num and chunk_end_pos=end_compute (replacing prior num_cached_tokens use).
Integration tests param
tests/integration/defs/test_e2e.py
Adds enable_chunked_prefill test parameterization; gates quickstart multimodal test for supported models and modalities; adjusts CLI flags and max_num_tokens when chunked prefill enabled.
Unit tests removed
tests/unittest/_torch/multimodal/test_kvcache_reuse.py
Removes legacy KV-cache reuse unit tests tied to the old runtime API and helpers.
Unit tests added
tests/unittest/_torch/multimodal/test_multimodal_runtime.py
Adds comprehensive tests for MultimodalRuntimeData, find_input_mm_embeds, and get_multimodal_embeddings covering chunked prefill, caching behavior, batching modes, error conditions, and cross-device scenarios.

Sequence Diagram(s)

sequenceDiagram
  participant Model as Qwen2VL.forward
  participant Utils as get_multimodal_embeddings
  participant Sel as _get_active_multimodal_params
  participant Enc as encoder_forward_fn
  participant Cache as _cache_multimodal_embeddings
  participant Slice as find_input_mm_embeds

  Model->>Utils: get_multimodal_embeddings(encoder_forward_fn, multimodal_params)
  Utils->>Sel: select active multimodal_params (skip non-content / already cached)
  alt active params exist
    Utils->>Enc: encoder_forward_fn(active_params)
    Enc-->>Utils: concatenated embeddings
    Utils->>Cache: split & store per-param multimodal_embedding
  end
  Utils-->>Model: list of per-param embeddings
  Model->>Slice: find_input_mm_embeds(mm_embeds_list, multimodal_params)
  Slice-->>Model: per-param slices for KV-cache reuse / current chunk
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • symphonylyh
  • rakib-hasan
  • yilin-void
  • QiJune
  • lfr-0531
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/integration/defs/test_e2e.py (1)

2124-2125: Consider clarifying the chunked prefill support scope in code comments.

While the gating logic correctly restricts chunked prefill to Qwen2-VL models for now, this limitation and the image-only modality restriction should be documented in the code comment or docstring to help future maintainers understand the constraints.

+    # NOTE: chunked prefill is currently only supported for:
+    # - Qwen2-VL and Qwen2.5-VL models  
+    # - Image modality only (not video/mixed)
     if model_name not in ["qwen2-vl-7b-instruct", "qwen2.5-vl-7b-instruct"
                           ] and enable_chunked_prefill:
         pytest.skip(
             "Only Qwen2-VL and Qwen2-5-VL support chunked prefill for now")
     if modality != "image" and enable_chunked_prefill:
         pytest.skip("Chunked prefill is only supported for image modality")

Also applies to: 2143-2148

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (3)

219-225: Clarify comments: “unseen” vs “cached” terminology is inconsistent.

In this module, num_unseen_mm_tokens represents already-seen/cached tokens to skip. The comments here say “All unseen,” which is misleading. Use “Fully cached” (or “All cached”) for clarity and consistency with the runtime data semantics.

Apply this diff:

-            self.create_multimodal_params(
-                8, 0, [8]),  # All unseen - should be skipped
+            self.create_multimodal_params(
+                8, 0, [8]),  # Fully cached - should be skipped
-            self.create_multimodal_params(
-                3, 6, [6, 3]),  # 3 unseen, 6 in current chunk
+            self.create_multimodal_params(
+                3, 6, [6, 3]),  # 3 cached, 6 in current chunk
-            self.create_multimodal_params(8, 0,
-                                          [8])  # All unseen - should be skipped
+            self.create_multimodal_params(8, 0,
+                                          [8])  # Fully cached - should be skipped
-            self.create_multimodal_params(10, 0, [10]),  # All unseen
-            self.create_multimodal_params(10, 0, [10]),  # All unseen
-            self.create_multimodal_params(10, 0, [10])  # All unseen
+            self.create_multimodal_params(10, 0, [10]),  # Fully cached
+            self.create_multimodal_params(10, 0, [10]),  # Fully cached
+            self.create_multimodal_params(10, 0, [10])  # Fully cached

Also applies to: 243-246


56-63: Fix minor docstring typos for readability.

Short, clear docstrings help future maintenance; tighten these two.

Apply this diff:

-        """Test case chunk around chunk boundaries."""
+        """Test case around chunk boundaries."""
-        """Test test chunk end is very large."""
+        """Test when chunk_end_pos is very large."""

Also applies to: 72-80


145-176: Consider an extra pass-through test for find_input_mm_embeds when runtime is None.

The implementation returns the full mm_embeds when multimodal_params is empty or when multimodal_runtime is None. Adding a direct unit test for that path would lock in behavior and prevent regressions.

If helpful, I can draft a small test case that asserts pass-through when multimodal_runtime is None for one or more params.

Also applies to: 177-212, 213-235, 236-272, 273-306, 307-321, 323-344, 345-363

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9a6dd and 183d3f0.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/inputs/multimodal.py (2 hunks)
  • tests/integration/defs/test_e2e.py (3 hunks)
  • tests/unittest/_torch/multimodal/test_kvcache_reuse.py (0 hunks)
  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/multimodal/test_kvcache_reuse.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
  • tensorrt_llm/_torch/models/modeling_multimodal_utils.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tensorrt_llm/inputs/multimodal.py
  • tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
  • tensorrt_llm/_torch/models/modeling_multimodal_utils.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tensorrt_llm/inputs/multimodal.py
  • tests/integration/defs/test_e2e.py
🧬 Code Graph Analysis (3)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)
  • find_input_mm_embeds (153-222)
  • get_multimodal_embeddings (96-150)
tensorrt_llm/inputs/multimodal.py (2)
  • MultimodalParams (168-426)
  • MultimodalRuntimeData (86-164)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)
tensorrt_llm/inputs/multimodal.py (2)
  • MultimodalParams (168-426)
  • has_content (424-426)
tensorrt_llm/logger.py (2)
  • debug (143-144)
  • warning (131-132)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (3)
  • find_input_mm_embeds (153-222)
  • fuse_input_embeds (225-284)
  • get_multimodal_embeddings (96-150)
🪛 Ruff (0.12.2)
tensorrt_llm/inputs/multimodal.py

118-118: Line too long (148 > 120)

(E501)


163-163: Line too long (193 > 120)

(E501)

🔇 Additional comments (12)
tests/integration/defs/test_e2e.py (1)

2269-2277: LGTM! Consistent application of chunked prefill configuration.

The chunked prefill logic is properly handled with appropriate max_num_tokens settings based on the model and modality combination. The special handling for video modality with larger token counts (16384) is correct for Qwen2-VL models.

tensorrt_llm/inputs/multimodal.py (2)

86-100: Well-designed data structure for chunked prefill support.

The updated MultimodalRuntimeData correctly tracks multimodal token processing state with:

  • Renamed past_seen_token_num for clarity
  • New chunk_end_pos for chunked prefill boundaries
  • Computed fields for unseen and current chunk tokens
    The docstring clearly explains the purpose of each field.

113-164: Complex but correct chunked prefill token counting logic.

The post_init implementation correctly handles:

  1. Automatic total_mm_tokens calculation when not provided
  2. Complex overlap scenarios between cached tokens and current chunk
  3. Proper validation that combined counts don't exceed total tokens

The nested conditionals correctly handle all edge cases for partial overlaps between cached/chunk boundaries.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1239-1244: Correctly updated MultimodalRuntimeData construction for chunked prefill.

The changes properly pass:

  • past_seen_token_num (renamed from num_cached_tokens)
  • New chunk_end_pos parameter set to end_compute
    This aligns with the updated MultimodalRuntimeData structure for chunked prefill support.
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (4)

33-58: Well-structured helper for identifying active multimodal parameters.

The _get_active_multimodal_params function cleanly filters parameters that need encoder processing by:

  1. Skipping parameters without content
  2. Skipping already-cached embeddings
  3. Providing debug logging for cache hits

This efficiently avoids redundant encoder computations.


60-94: Efficient embedding caching with proper validation.

The _cache_multimodal_embeddings function:

  1. Uses torch.split for efficient tensor splitting
  2. Validates total length matches expected
  3. Stores embeddings back to multimodal_data for reuse
  4. Includes helpful debug logging

The use of torch.split is more efficient than manual indexing.


96-151: Well-orchestrated multimodal embedding management.

The get_multimodal_embeddings function provides a clean high-level API that:

  1. Identifies active parameters needing processing
  2. Runs encoder only on uncached parameters
  3. Caches results when runtime data is available
  4. Assembles all embeddings for the batch

The early returns and validation checks handle edge cases appropriately.


153-222: Comprehensive slicing logic for chunked prefill scenarios.

The updated find_input_mm_embeds now returns List[torch.Tensor] and correctly handles:

  1. Both KV cache reuse and chunked prefill
  2. Two batching modes (individual vs pre-concatenated)
  3. Complex slicing based on num_unseen_mm_tokens and num_mm_tokens
  4. Empty list return when no tokens need processing

The documentation clearly explains the behavior for each scenario.

tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)

22-23: Import changes align with the new multimodal utilities API.

The imports correctly reflect the refactored API:

  • Removed find_uncached_mm_embeds
  • Added get_multimodal_embeddings for the new embedding orchestration
  • Kept find_input_mm_embeds and fuse_input_embeds

616-630: Clean integration with the new multimodal embedding API.

The forward method now properly uses:

  1. get_multimodal_embeddings with encoder_forward_fn for clean separation of concerns
  2. find_input_mm_embeds (renamed from find_uncached_mm_embeds) for slicing
  3. Proper handling of context requests only

The DISAGG path comment correctly notes it's a dead path.

tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)

396-659: Strong coverage of multimodal embedding orchestration and caching paths.

The tests for get_multimodal_embeddings comprehensively cover empty inputs, all-uncached, all-cached, mixed caching, missing runtime/total_mm_tokens, multi-modality early return, correct splitting via torch.split, and CUDA device handling. This is solid and aligns with the updated API behavior.


1-6: Add SPDX copyright header to test_multimodal_runtime.py

Please prepend the standard NVIDIA SPDX header for 2025 to the top of tests/unittest/_torch/multimodal/test_multimodal_runtime.py. For consistency with existing Python tests (e.g., tests/unittest/_torch/test_group_rmn_norm.py), it should read:

--- a/tests/unittest/_torch/multimodal/test_multimodal_runtime.py
+++ b/tests/unittest/_torch/multimodal/test_multimodal_runtime.py
@@ -1,2 +1,4 @@
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
  from typing import List
  from unittest.mock import Mock
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (2)

13-85: Clarify “unseen” semantics in comments to match slicing behavior.

Across these tests, “num_unseen_mm_tokens” is used as the prefix-skip count (i.e., tokens already seen/cached before the current chunk). The term “unseen” can be read as the opposite. Consider tightening comments/docstrings to avoid confusion for future readers.

Example rewording in comments:

  • “X unseen” → “skip X cached (prefix) tokens”
  • “all unseen” → “all tokens cached (0 tokens in current chunk)”

396-404: Optional: remove pytest entrypoint or keep consistently across test modules.

pytest auto-discovers tests; the explicit main entrypoint is not required. Keep it only if this repository follows that pattern consistently.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 183d3f0 and 5c437a2.

📒 Files selected for processing (1)
  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1)

127-135: Resolved: batched slicing now sets total_mm_tokens on runtime mock.

Thanks for addressing the earlier review note; setting runtime.total_mm_tokens = sum(mm_token_lengths) unblocks batched slicing in find_input_mm_embeds when len(mm_embeds) == 1.

@chang-l chang-l changed the title [TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend [TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend Aug 12, 2025
@chang-l chang-l force-pushed the mm_chunked_prefill branch from 5c437a2 to bfcf14a Compare August 12, 2025 23:43
@chang-l
Copy link
Collaborator Author

chang-l commented Aug 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15030 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tests/unittest/_torch/multimodal/test_multimodal_runtime.py (4)

1-1: Missing NVIDIA copyright header (2025).

Per the coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 from typing import List

127-135: Add runtime.total_mm_tokens to the mock for concatenated mode slicing.

The find_input_mm_embeds function expects runtime.total_mm_tokens when processing pre-concatenated embeddings (len(mm_embeds) == 1), but the mock currently doesn't set it. This will cause errors during concatenated mode tests.

The mock already sets total_mm_tokens on line 133, which correctly addresses this issue.


269-272: Fix direct list comparison of tensors to avoid ambiguous boolean evaluation.

Comparing lists of tensors with == triggers element-wise tensor comparison, which raises "Boolean value of Tensor is ambiguous" errors in PyTorch.

Apply this diff:

         result = find_input_mm_embeds(mm_embeds, multimodal_params)
 
         # Should return the full embeddings
-        assert result == mm_embeds
+        assert len(result) == 1
+        torch.testing.assert_close(result[0], mm_embeds[0])

329-333: Fix tensor equality check in single-batch scenario.

Same issue as above - direct list comparison will fail with PyTorch tensors.

Apply this diff:

         result = find_input_mm_embeds(mm_embeds, multimodal_params)
-        assert result == mm_embeds
+        assert len(result) == 1
+        torch.testing.assert_close(result[0], mm_embeds[0])
🧹 Nitpick comments (2)
tensorrt_llm/inputs/multimodal.py (2)

118-118: Consider wrapping long error message for readability.

The error message exceeds the 120-character line limit. While functional, consider wrapping for better readability.

Apply this diff:

-                f"mm_token_positions ({len(self.mm_token_positions)}) and mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length"
+                f"mm_token_positions ({len(self.mm_token_positions)}) and "
+                f"mm_token_lengths ({len(self.mm_token_lengths)}) must have the same length"

163-164: Consider wrapping long error message for readability.

The error message significantly exceeds the 120-character line limit.

Apply this diff:

-                f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})"
+                f"num_unseen_mm_tokens ({self.num_unseen_mm_tokens}) + num_mm_tokens ({self.num_mm_tokens}) "
+                f"must be less than or equal to sum of mm_token_lengths ({sum(self.mm_token_lengths)})"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c437a2 and bfcf14a.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/inputs/multimodal.py (2 hunks)
  • tests/integration/defs/test_e2e.py (3 hunks)
  • tests/unittest/_torch/multimodal/test_kvcache_reuse.py (0 hunks)
  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/multimodal/test_kvcache_reuse.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tensorrt_llm/_torch/models/modeling_multimodal_utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/inputs/multimodal.py
  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
  • tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/inputs/multimodal.py
  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
  • tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-08-12T10:28:57.288Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.288Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • tests/unittest/_torch/multimodal/test_multimodal_runtime.py
🪛 Ruff (0.12.2)
tensorrt_llm/inputs/multimodal.py

118-118: Line too long (148 > 120)

(E501)


163-163: Line too long (193 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/defs/test_e2e.py (3)

2124-2124: LGTM! Parameterization for chunked prefill testing added.

The addition of the enable_chunked_prefill parameter enables comprehensive testing of the new chunked prefill feature for multimodal models.


2143-2149: Good gating logic for unsupported models and modalities.

The implementation correctly restricts chunked prefill to only Qwen2-VL models and image modality, preventing incorrect usage with unsupported configurations.


2273-2277: LGTM! Conditional max_num_tokens configuration for chunked prefill.

The logic correctly sets max_num_tokens to 256 when chunked prefill is enabled, aligning with the expected behavior for chunked processing.

tensorrt_llm/inputs/multimodal.py (4)

93-97: LGTM! Clear documentation of KV cache reuse and chunked prefill attributes.

The updated docstring clearly explains the new fields and their purpose in both KV cache reuse and chunked prefill scenarios.


114-115: Logical auto-calculation of total_mm_tokens.

Good defensive programming - automatically calculating total_mm_tokens when not provided ensures consistency.


121-124: Appropriate validation for past_seen_token_num.

The non-negative check ensures the field maintains a valid state.


136-159: Complex but correct token counting logic for chunked prefill.

The implementation properly handles all overlap scenarios between multimodal token spans and chunk boundaries:

  • Fully cached tokens (before past_seen_token_num)
  • Partially cached tokens (spanning past_seen_token_num)
  • Tokens in current chunk (between past_seen_token_num and chunk_end_pos)
  • Tokens beyond current chunk

The logic correctly accumulates num_unseen_mm_tokens for cached tokens and num_mm_tokens for tokens in the current chunk.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15030 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11352 completed with status: 'FAILURE'

Copy link
Collaborator

@rakib-hasan rakib-hasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments.

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing off on pyexecutor changes on behalf of trt-llm-torch-runtime-devs. I did not review modeling changes, will leave that to the qwen devs.

Copy link
Collaborator

@yechank-nvidia yechank-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the work.

Copy link
Collaborator

@jaedeok-nvidia jaedeok-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for supporting the chunked prefill for MM models. This will be an essential feature for improving MM performance.

@chang-l chang-l force-pushed the mm_chunked_prefill branch from 920c384 to a6befdc Compare August 18, 2025 04:28
@chang-l
Copy link
Collaborator Author

chang-l commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15580 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15580 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11731 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15587 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15587 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11736 completed with status: 'FAILURE'

@chang-l chang-l force-pushed the mm_chunked_prefill branch 2 times, most recently from 216a3a7 to 1b6d4e6 Compare September 10, 2025 04:16
@chang-l chang-l requested a review from a team as a code owner September 10, 2025 04:16
@tensorrt-cicd
Copy link
Collaborator

PR_Github #18308 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13727 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 10, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18411 [ run ] triggered by Bot

@chang-l chang-l changed the title [TRTLLM-6903][feat] Support chunked prefill for multimodal models (Qwen2) within pytorch backend [TRTLLM-6903][feat] Support chunked prefill for multimodal models Sep 10, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #18411 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13810 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
…lama4)

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
@chang-l
Copy link
Collaborator Author

chang-l commented Sep 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18445 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18445 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13838 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
@chang-l
Copy link
Collaborator Author

chang-l commented Sep 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18457 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18457 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13847 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18462 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18462 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13851 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 12, 2025

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18477 [ run ] triggered by Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants