[TRTLLM-7440][fix] Split `fused_input_embed` to separate out host sync #7280

chang-l · 2025-08-27T05:45:03Z

Summary by CodeRabbit

New Features
- Token filtering for multimodal inputs and an orchestrated embedding-fusion step.
- Models now forward extra kwargs to the fusion step; callers can supply precomputed indices.
Tests
- Added CPU/CUDA unit tests covering token filtering, fusion success/failure, and kwargs precedence.
Documentation
- Clarified text/mm token indices usage and host-device synchronization considerations.
Refactor
- Skip compilation overhead for embedding path on single-GPU runs.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-27T05:45:10Z

📝 Walkthrough

Walkthrough

Adds multimodal token filtering and refactors embedding fusion to accept precomputed indices and **kwargs; propagates **kwargs to fuse_input_embeds across multiple model forwards; integrates multimodal-index preparation into the executor; adds unit tests; conditions torch.compile usage in Embedding.forward on tp_size.

Changes

Cohort / File(s)	Change summary
Multimodal fusion utilities `tensorrt_llm/_torch/models/modeling_multimodal_utils.py`	Adds `filter_mm_token_from_input_ids`; refactors `fuse_input_embeds` to accept/derive `text_token_indices` and `mm_token_indices`, accept `**kwargs`, validate multimodal counts, and build fused `input_embeds` (returns either `(input_ids, None)` or `(None, input_embeds)`).
Model forwards kwargs propagation `tensorrt_llm/_torch/models/modeling_mistral.py`, `.../modeling_gemma3vl.py`, `.../modeling_hyperclovax.py`, `.../modeling_llama.py`, `.../modeling_llava_next.py`, `.../modeling_phi4mm.py`, `.../modeling_qwen2vl.py`, `.../modeling_vila.py`	Forward `**kwargs` into `fuse_input_embeds` calls inside `forward()` methods; no other logic changes.
Executor integration `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Imports `filter_mm_token_from_input_ids`; adds `_prepare_multimodal_indices(self, input_ids)`; prepares and forwards `text_token_indices`/`mm_token_indices` during `_prepare_tp_inputs`, pinning and CUDA-transferring when multimodal present.
Embedding compile gating `tensorrt_llm/_torch/modules/embedding.py`	Use `torch.compile` for `pre_comm_embedding_ops` only when `tp_size > 1`; for `tp_size == 1` call the pre-comm ops directly to avoid compile overhead.
Unit tests `tests/unittest/_torch/multimodal/test_fuse_input_embeds.py`	Adds tests for `filter_mm_token_from_input_ids` and `fuse_input_embeds`: OOV path, explicit IDs, empty-mm path, mismatch error, kwargs-precedence; parametrized for CPU and CUDA (if available).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller as Model.forward(...)
  participant Fuser as fuse_input_embeds
  participant Filter as filter_mm_token_from_input_ids
  participant Emb as Embedding Layer

  Caller->>Fuser: fuse_input_embeds(embedding_layer, input_ids, mm_embeds, mm_token_ids, **kwargs)
  alt Precomputed indices provided
    Fuser->>Fuser: use provided text_token_indices & mm_token_indices
  else No indices provided
    Fuser->>Filter: filter_mm_token_from_input_ids(input_ids, vocab_size, mm_token_ids)
    Filter-->>Fuser: text_token_indices, mm_token_indices
  end
  Fuser->>Fuser: validate mm count == mm_embeds rows
  Fuser->>Emb: embed(input_ids[text_token_indices])
  Emb-->>Fuser: text_embeds
  Fuser->>Fuser: assemble input_embeds (text at text indices, mm at mm indices)
  alt multimodal present
    Fuser-->>Caller: (None, input_embeds)
  else no multimodal
    Fuser-->>Caller: (input_ids, None)
  end

sequenceDiagram
  autonumber
  actor Host as PyExecutor
  participant Prep as _prepare_tp_inputs
  participant MM as _prepare_multimodal_indices
  participant Filter as filter_mm_token_from_input_ids
  participant Model as Model.forward
  participant Fuser as fuse_input_embeds

  Host->>Prep: _prepare_tp_inputs(...)
  Prep->>MM: if multimodal present, call _prepare_multimodal_indices(input_ids)
  MM->>Filter: filter_mm_token_from_input_ids(cpu_input_ids, vocab_size, mm_token_ids?)
  Filter-->>MM: text_token_indices, mm_token_indices
  MM-->>Prep: indices
  Prep->>Model: inputs + pinned/moved indices
  Model->>Fuser: fuse_input_embeds(..., **kwargs incl. indices)
  Fuser-->>Model: (ids_or_none, embeds_or_none)
  Model-->>Host: outputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

[TRTLLM-7326][feat] Add standalone multimodal encoder #6743 — related multimodal input handling and executor/model coordination changes (similar index preparation and fusion concerns).

Suggested labels

SW Architecture

Suggested reviewers

Wanli-Jiang
byshiue
QiJune
yechank-nvidia
brb-nv

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

134-176: Robust CUDA-based embedding fusion implementation.

The function correctly handles:

Early return for empty multimodal embeddings
Token count validation with clear error message
Efficient tensor allocation and population
Proper device and dtype handling for embeddings

The implementation eliminates host synchronization from torch.where operations by accepting precomputed indices.

Consider breaking up the long lines (142, 145, 150, 151) to improve readability and comply with the 120-character line limit mentioned in static analysis.

def fuse_input_embeds_cuda(
    embedding_layer: Embedding,
    input_ids: torch.IntTensor,
    text_token_indices: torch.IntTensor,
    mm_token_indices: torch.IntTensor,
    mm_embeds: List[torch.Tensor],
) -> Tuple[Optional[torch.FloatTensor], Optional[torch.FloatTensor]]:
    """
-    Fuse text and multimodal embeddings. input_ids is [text_total_length + mm_total_length] and mm_embed is [mm_total_length, hidden_dim]. We just need to fuse them into [text_total_length + mm_total_length, hidden_dim] by slice-and-assign to the corresponding entries.
+    Fuse text and multimodal embeddings. input_ids is [text_total_length + mm_total_length] 
+    and mm_embed is [mm_total_length, hidden_dim]. We just need to fuse them into 
+    [text_total_length + mm_total_length, hidden_dim] by slice-and-assign to the corresponding entries.

    Args:
-        input_ids: shape [text_total_length + mm_total_length], flattened from List[(text_length1 + mm_total_length1), ..., (text_lengthi + mm_total_lengthi)]. For LLM model, the requests are inflight batched together, but the input_ids are flattened with padding removed. By the slice condition < vocab_size, we can easily separate text / multimodal tokens and naturally batched the LLM embedding lookup
+        input_ids: shape [text_total_length + mm_total_length], flattened from 
+            List[(text_length1 + mm_total_length1), ..., (text_lengthi + mm_total_lengthi)]. 
+            For LLM model, the requests are inflight batched together, but the input_ids are 
+            flattened with padding removed.
        text_token_indices: indices of text tokens in the input_ids
        mm_token_indices: indices of multimodal tokens in the input_ids
-        mm_embeds: List[(mm_total_length1, hidden_dim), ..., (mm_total_lengthi, hidden_dim)].
+        mm_embeds: List[(mm_total_length1, hidden_dim), ..., (mm_total_lengthi, hidden_dim)].
    Returns:
-        - If (1) JIT test run, (2) non-multimodal run, i.e. all text-only requests, either context or generation phase (3) multimodal run, all requests in generation phase --> there is no multimodal data, return only the input_ids
-        - If (4) multimodal run, mixed batch of context and generation requests, each context request has a multimodal feature --> return only the fused input_embeds of shape [total length, hidden_dim]. For text tokens, LLM embedding layer has already run.
+        - If (1) JIT test run, (2) non-multimodal run, i.e. all text-only requests, 
+            either context or generation phase (3) multimodal run, all requests in generation phase 
+            --> there is no multimodal data, return only the input_ids
+        - If (4) multimodal run, mixed batch of context and generation requests, 
+            each context request has a multimodal feature --> return only the fused input_embeds 
+            of shape [total length, hidden_dim]. For text tokens, LLM embedding layer has already run.
    """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abdb273 and e4fead7.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_mistral.py (2 hunks)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4 hunks)
tensorrt_llm/inputs/multimodal.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/inputs/multimodal.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_mistral.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/inputs/multimodal.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_mistral.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

tensorrt_llm/_torch/modules/embedding.py (1)

Embedding (164-242)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

filter_mm_token_from_input_ids (108-131)

tensorrt_llm/_torch/models/modeling_mistral.py (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)

fuse_input_embeds (179-238)

fuse_input_embeds_cuda (134-176)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

142-142: Line too long (269 > 120)

(E501)

145-145: Line too long (404 > 120)

(E501)

150-150: Line too long (230 > 120)

(E501)

151-151: Line too long (256 > 120)

(E501)

🔇 Additional comments (8)

tensorrt_llm/inputs/multimodal.py (1)

191-192: LGTM - Adding indices for CUDA-based embedding fusion.

The new fields text_token_indices and mm_token_indices enable the CUDA-based embedding fusion path by carrying precomputed indices through the pipeline. The fields are properly typed as Optional[torch.Tensor] with default values of None, maintaining backward compatibility.

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

50-50: LGTM - Import aligns with multimodal enhancement.

The import of filter_mm_token_from_input_ids from the modeling utils supports the new CUDA-based multimodal embedding fusion feature.

1134-1144: Clean implementation for multimodal token index preparation.

The method correctly:

Converts input_ids to CPU tensor for processing

Retrieves vocab_size from model config

Uses model-specific image token IDs when available

Delegates actual filtering to the utility function

1209-1212: Properly conditionalizes multimodal index computation.

The logic correctly computes multimodal indices only when multimodal parameters are present, avoiding unnecessary computation for text-only requests.

1581-1587: Efficient CUDA tensor preparation with proper memory management.

The code correctly:

Creates a boolean mask to identify text tokens

Uses pin_memory() and non_blocking=True for efficient CPU-to-GPU transfers

Maintains the same tensor preparation pattern as other inputs

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

108-131: Well-designed function for multimodal token filtering.

The function provides an efficient way to separate text and multimodal tokens with:

Performance optimization when mm_token_ids is None by using vocab_size comparison instead of torch.isin()

Clear documentation explaining the performance trade-off

Proper device handling for mm_token_ids

Clean return of both text and multimodal token indices

tensorrt_llm/_torch/models/modeling_mistral.py (2)

18-18: LGTM - Import supports new CUDA embedding fusion path.

The additional import of fuse_input_embeds_cuda enables the new CUDA-based embedding fusion functionality while maintaining backward compatibility with the existing fuse_input_embeds function.

395-413: Well-implemented conditional CUDA embedding fusion with fallback.

The implementation correctly:

Checks for the presence of both text_token_indices and mm_token_indices to determine which fusion path to use

Uses the new CUDA-based fuse_input_embeds_cuda when indices are available, avoiding host synchronization

Falls back to the existing fuse_input_embeds with mm_token_ids for backward compatibility

Maintains the same interface and return values for both paths

The TODO comment appropriately indicates this is a transitional implementation.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

rakib-hasan

LGTM.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2)

108-145: Fix token dtype annotations, add input validation, and clarify host-sync note

Indices in PyTorch are torch.long by convention; annotating with torch.IntTensor is misleading and may trip static/type checks.
Add lightweight shape/dtype guards to fail fast.
The note about torch.where “requiring allocation on host” is inaccurate; it returns a tensor on the same device. There can be sync, but not mandatory host allocation. Rephrase.

Apply:

-def filter_mm_token_from_input_ids(
-    input_ids: torch.IntTensor,
-    vocab_size: int,
-    mm_token_ids: Optional[torch.IntTensor] = None,
-) -> Tuple[torch.IntTensor, torch.IntTensor]:
+def filter_mm_token_from_input_ids(
+    input_ids: torch.Tensor,
+    vocab_size: int,
+    mm_token_ids: Optional[torch.Tensor] = None,
+) -> Tuple[torch.Tensor, torch.Tensor]:
@@
-    Note:
-        This function involves host device sync due to the use of torch.where() (= torch.nonzero) which requires allocation on host.
-        The output indices reside on the same device as input_ids.
+    Note:
+        The outputs reside on the same device as `input_ids`. `torch.where` may introduce sync, but it does not require host allocation.
@@
-    if mm_token_ids is None:
+    if input_ids.dim() != 1:
+        raise ValueError("input_ids must be 1D (flattened).")
+    if input_ids.dtype != torch.long:
+        raise TypeError(f"input_ids dtype must be torch.long, got {input_ids.dtype}.")
+    if mm_token_ids is None:
@@
-    else:
-        mm_token_ids = mm_token_ids.to(input_ids.device)
+    else:
+        if mm_token_ids.dim() != 1:
+            raise ValueError("mm_token_ids must be 1D.")
+        mm_token_ids = mm_token_ids.to(device=input_ids.device, dtype=torch.long)

147-189: Correct return types and add safety checks (count coverage and hidden size)

The function returns (input_ids, None) in the text-only path, but the annotation claims FloatTensor; fix to Tensor.
Add coverage check to ensure text_token_indices ∪ mm_token_indices spans the whole input_ids and dims match the embedding layer.

-def fuse_input_embeds_cuda(
+def fuse_input_embeds_cuda(
     embedding_layer: Embedding,
-    input_ids: torch.IntTensor,
-    text_token_indices: torch.IntTensor,
-    mm_token_indices: torch.IntTensor,
+    input_ids: torch.Tensor,
+    text_token_indices: torch.Tensor,
+    mm_token_indices: torch.Tensor,
     mm_embeds: List[torch.Tensor],
-) -> Tuple[Optional[torch.FloatTensor], Optional[torch.FloatTensor]]:
+) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]:
@@
-    if len(mm_embeds) == 0:
-        return input_ids, None
+    if len(mm_embeds) == 0:
+        if mm_token_indices.numel() != 0:
+            raise ValueError("mm_token_indices non-empty but no mm_embeds provided.")
+        return input_ids, None
@@
-    if mm_token_indices.shape[0] != mm_embed.shape[0]:
+    if mm_token_indices.shape[0] != mm_embed.shape[0]:
         raise ValueError(
@@
-    text_embed = embedding_layer(input_ids[text_token_indices])
+    if text_token_indices.numel() + mm_token_indices.numel() != input_ids.shape[0]:
+        raise ValueError("text_token_indices and mm_token_indices do not cover all tokens in input_ids.")
+    if mm_embed.shape[-1] != embedding_layer.embedding_dim:
+        raise ValueError(
+            f"mm_embed hidden size ({mm_embed.shape[-1]}) != embedding_dim ({embedding_layer.embedding_dim})."
+        )
+    text_embed = embedding_layer(input_ids[text_token_indices])
@@
-    input_embeds[mm_token_indices, :] = mm_embed.to(dtype=input_embeds.dtype,
-                                                    device=input_embeds.device)
+    input_embeds[mm_token_indices, :] = mm_embed.to(
+        dtype=input_embeds.dtype, device=input_embeds.device
+    )

Also, the docstring mentions separating tokens via < vocab_size, which this function no longer does—consider removing that sentence.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

147-189: Optional: rename to better reflect behavior, not backend

fuse_input_embeds_cuda works with tensors on any device; consider fuse_input_embeds_with_indices to avoid implying CUDA-only behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 93314a0 and 594371f.

📒 Files selected for processing (5)

tensorrt_llm/_torch/models/modeling_mistral.py (2 hunks)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4 hunks)
tensorrt_llm/inputs/multimodal.py (1 hunks)
tests/unittest/_torch/multimodal/test_fuse_input_embeds.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tensorrt_llm/inputs/multimodal.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/multimodal/test_fuse_input_embeds.py
tensorrt_llm/_torch/models/modeling_mistral.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

tensorrt_llm/_torch/modules/embedding.py (1)

Embedding (164-242)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_phi4mm.py (1)

592-599: Document fuse_input_embeds kwargs and expected formats
Add/update the docstring for fuse_input_embeds to enumerate all supported keyword arguments and their required shapes/dtypes—e.g.

mm_embeds: List[Tensor] or Tensor of per‐modal embeddings

mm_token_ids (and text_token_indices if applicable): 1D torch.LongTensor matching input_ids positions, on the same device
Ensure these names don’t collide with any kwargs passed later to llm.forward. All multimodal model callsites (hyperclovax, qwen2vl, vila, phi4mm, llava_next, llama) already forward **kwargs; pure‐LLM calls omit them by design.

♻️ Duplicate comments (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

148-155: Unify annotations and return types with torch.Tensor

Align with upstream style and previous feedback; annotate inputs/outputs as Tensor.

-def fuse_input_embeds(
-    embedding_layer: Embedding,
-    input_ids: torch.IntTensor,
-    mm_embeds: List[torch.Tensor],
-    mm_token_ids: Optional[torch.IntTensor] = None,
-    **kwargs,
-) -> Tuple[Optional[torch.FloatTensor], Optional[torch.FloatTensor]]:
+def fuse_input_embeds(
+    embedding_layer: Embedding,
+    input_ids: torch.Tensor,
+    mm_embeds: List[torch.Tensor],
+    mm_token_ids: Optional[torch.Tensor] = None,
+    **kwargs,
+) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]:

🧹 Nitpick comments (13)

tensorrt_llm/_torch/modules/embedding.py (1)

210-217: Cache the (compiled) embedding ops to avoid per-forward wrapper creation

Recreating a torch.compile wrapper on every forward adds overhead (even when disabled). Cache once and reuse.

Apply within this hunk:

-        if self.tp_size > 1:
-            embedding_ops_func = torch.compile(
-                pre_comm_embedding_ops,
-                options={"max-autotune": True},
-                disable=not self.enable_torch_compile_for_embedding)
-        else:
-            # Skip torch.compile when TP size is 1 to avoid unnecessary host overhead
-            embedding_ops_func = pre_comm_embedding_ops
+        # Avoid recreating a (compiled) wrapper on each forward; use the cached function from __init__.
+        embedding_ops_func = self._embedding_ops_func

And initialize once in init (outside this hunk):

# after setting vocab_start_index/vocab_end_index
self._embedding_ops_func = pre_comm_embedding_ops
if self.tp_size > 1:
    self._embedding_ops_func = torch.compile(
        pre_comm_embedding_ops,
        options={"max-autotune": True},
        disable=not self.enable_torch_compile_for_embedding,
    )

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (3)

108-113: Generalize type hints to torch.Tensor

Prefer torch.Tensor over torch.IntTensor for annotations; improves compatibility with type checkers and mixed dtypes.

-def filter_mm_token_from_input_ids(
-    input_ids: torch.IntTensor,
-    vocab_size: int,
-    mm_token_ids: Optional[torch.IntTensor] = None,
-) -> Tuple[torch.IntTensor, torch.IntTensor]:
+def filter_mm_token_from_input_ids(
+    input_ids: torch.Tensor,
+    vocab_size: int,
+    mm_token_ids: Optional[torch.Tensor] = None,
+) -> Tuple[torch.Tensor, torch.Tensor]:

114-125: Docstring/inline note: avoid asserting host sync; phrase as “may sync”

torch.where/torch.nonzero return GPU tensors; they may cause device sync depending on backend, but do not imply host allocation.

-    Note:
-        This function involves host-device synchronization due to torch.where() (= torch.nonzero) requiring
-        host allocation. The output indices reside on the same device as input_ids.
+    Note:
+        May introduce device synchronization depending on backend implementation of where/nonzero.
+        Outputs remain on the same device as `input_ids`.
@@
-    # NOTE: torch.where() enforces a host sync
+    # NOTE: where()/nonzero may trigger a device sync on some backends

Also applies to: 142-145

166-168: Tighten “may sync” wording in fuse docstring

Keep the note accurate and concise.

-    Note:
-        This function may involve host-device synchronization if text_token_indices (cuda tensor) and mm_token_indices (cuda tensor) are not provided from kwargs. See filter_mm_token_from_input_ids for more details.
+    Note:
+        May trigger device synchronization when computing indices on CUDA.
+        Provide precomputed `text_token_indices` and `mm_token_indices` via kwargs to avoid extra sync.

tensorrt_llm/_torch/models/modeling_mistral.py (1)

330-333: Prefer torch.long for image token ids to match input_ids

Avoid downstream dtype conversions and surprises; most tokenizers produce torch.long input_ids.

-        self._image_token_ids = torch.tensor([config.image_token_index],
-                                             dtype=torch.int32,
-                                             device=self._device)
+        # Align dtype with typical `input_ids` dtype
+        self._image_token_ids = torch.tensor([config.image_token_index],
+                                             dtype=torch.long,
+                                             device=self._device)

tensorrt_llm/_torch/models/modeling_gemma3vl.py (1)

262-269: kwargs and mm_token_ids now forwarded to fusion: ensure dtype consistency

Good to pass mm_token_ids and kwargs. To avoid implicit type promotion in torch.isin/fusion paths, prefer torch.long for token IDs.

Apply:

-        self.image_token_ids = torch.tensor([config.image_token_index],
-                                            dtype=torch.int32,
-                                            device=self._device)
+        self.image_token_ids = torch.tensor(
+            [config.image_token_index], dtype=torch.long, device=self._device
+        )

tensorrt_llm/_torch/models/modeling_llama.py (2)

1212-1213: Forwarding kwargs into fuse_input_embeds is fine; consider narrowing to expected keys.

This keeps the call flexible, but it also forwards large/irrelevant objects (e.g., attn_metadata) that fuse_input_embeds doesn’t need. Passing only whitelisted args avoids accidental coupling and future TypeErrors if the callee signature tightens.
-        input_ids, inputs_embeds = fuse_input_embeds(self.model.embed_tokens,
-                                                     input_ids, mm_embeds,
-                                                     **kwargs)
+        allowed = ("mm_token_indices", "text_token_indices", "mm_token_ids")
+        fuse_kwargs = {k: kwargs[k] for k in allowed if k in kwargs}
+        input_ids, inputs_embeds = fuse_input_embeds(
+            self.model.embed_tokens, input_ids, mm_embeds, **fuse_kwargs
+        )
1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the NVIDIA copyright header (2025) to all source files.

tensorrt_llm/_torch/models/modeling_hyperclovax.py (2)

1055-1057: Same kwargs-forwarding concern as in Llama: whitelist the ones fuse_input_embeds actually uses.

-        input_ids, input_embeds = fuse_input_embeds(self.llm.model.embed_tokens,
-                                                    input_ids, mm_embeds,
-                                                    **kwargs)
+        allowed = ("mm_token_indices", "text_token_indices", "mm_token_ids")
+        fuse_kwargs = {k: kwargs[k] for k in allowed if k in kwargs}
+        input_ids, input_embeds = fuse_input_embeds(
+            self.llm.model.embed_tokens, input_ids, mm_embeds, **fuse_kwargs
+        )

1-1: Missing NVIDIA copyright header.

Please add the standard NVIDIA header at the top of this file.

tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

1193-1201: Document intent and harden mm_token_ids type before calling filter.

Add a short docstring to clarify why this runs on CPU (to avoid GPU host sync).
Ensure mm_token_ids is a Tensor; filter_mm_token_from_input_ids calls .to(), which will fail on lists.

-    def _prepare_multimodal_indices(self, input_ids: list[int]):
-        input_ids = torch.tensor(input_ids, dtype=torch.int, device="cpu")
-        vocab_size = self.model.config.vocab_size
-        # TODO: unify naming of mm_token_ids across models
-        mm_token_ids = getattr(self.model, "_image_token_ids", None)
-
-        text_token_indices, mm_token_indices = filter_mm_token_from_input_ids(
-            input_ids, vocab_size=vocab_size, mm_token_ids=mm_token_ids)
-        return text_token_indices, mm_token_indices
+    def _prepare_multimodal_indices(self, input_ids: list[int]):
+        """Compute text/mm token positions on CPU to avoid GPU host sync in torch.where/nonzero."""
+        input_ids = torch.tensor(input_ids, dtype=torch.int, device="cpu")
+        vocab_size = self.model.config.vocab_size
+        # NOTE: unify naming of mm_token_ids across models in future
+        mm_token_ids = getattr(self.model, "_image_token_ids", None)
+        if mm_token_ids is not None and not torch.is_tensor(mm_token_ids):
+            mm_token_ids = torch.as_tensor(mm_token_ids, dtype=torch.int, device="cpu")
+
+        text_token_indices, mm_token_indices = filter_mm_token_from_input_ids(
+            input_ids, vocab_size=vocab_size, mm_token_ids=mm_token_ids
+        )
+        return text_token_indices, mm_token_indices

1643-1650: Good: stage indices on host then transfer to CUDA; minor clarity nit.

Consider naming mask text_token_mask for readability, since you invert the mm positions to derive text indices.

-        if mm_token_indices is not None:
-            mask = torch.ones(total_num_tokens, dtype=torch.bool)
-            mask[mm_token_indices] = False
+        if mm_token_indices is not None:
+            text_token_mask = torch.ones(total_num_tokens, dtype=torch.bool)
+            text_token_mask[mm_token_indices] = False
             inputs['mm_token_indices'] = mm_token_indices.pin_memory().to(
                 "cuda", non_blocking=True)
-            inputs['text_token_indices'] = torch.where(mask)[0].pin_memory().to(
+            inputs['text_token_indices'] = torch.where(text_token_mask)[0].pin_memory().to(
                 "cuda", non_blocking=True)

1-1: Missing NVIDIA copyright header.

Please add the standard header at the file top.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 594371f and 580d330.

📒 Files selected for processing (11)

tensorrt_llm/_torch/models/modeling_gemma3vl.py (1 hunks)
tensorrt_llm/_torch/models/modeling_hyperclovax.py (1 hunks)
tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_llava_next.py (1 hunks)
tensorrt_llm/_torch/models/modeling_mistral.py (1 hunks)
tensorrt_llm/_torch/models/modeling_multimodal_utils.py (2 hunks)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1 hunks)
tensorrt_llm/_torch/models/modeling_vila.py (1 hunks)
tensorrt_llm/_torch/modules/embedding.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

tensorrt_llm/_torch/models/modeling_vila.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/modules/embedding.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tensorrt_llm/_torch/models/modeling_hyperclovax.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tensorrt_llm/_torch/models/modeling_mistral.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

tensorrt_llm/_torch/models/modeling_vila.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/modules/embedding.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tensorrt_llm/_torch/models/modeling_hyperclovax.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tensorrt_llm/_torch/models/modeling_mistral.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

tensorrt_llm/_torch/models/modeling_vila.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/modules/embedding.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tensorrt_llm/_torch/models/modeling_hyperclovax.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tensorrt_llm/_torch/models/modeling_mistral.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_vila.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/modules/embedding.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tensorrt_llm/_torch/models/modeling_hyperclovax.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tensorrt_llm/_torch/models/modeling_mistral.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_multimodal_utils.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/modules/embedding.py (1)

tensorrt_llm/_torch/distributed/communicator.py (1)

tp_size (46-47)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

filter_mm_token_from_input_ids (108-145)

tensorrt_llm/_torch/models/modeling_multimodal_utils.py (1)

tensorrt_llm/_torch/modules/embedding.py (1)

Embedding (164-246)

🔇 Additional comments (5)

tensorrt_llm/_torch/models/modeling_mistral.py (1)

400-400: LGTM: forwarding kwargs to fusion path

Passing through **kwargs enables precomputed indices without changing public signatures.

tensorrt_llm/_torch/models/modeling_llava_next.py (1)

487-489: kwargs isolation confirmed

fuse_input_embeds correctly consumes only text_token_indices/mm_token_indices from **kwargs, and llm.forward is called with explicit parameters (attn_metadata, input_ids, position_ids, inputs_embeds, return_context_logits), so no unexpected kwargs leak.

tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

51-51: LGTM: imports updated for filter_mm_token_from_input_ids.

1266-1270: Computing indices only when multimodal content exists is appropriate.

This avoids unnecessary CPU work on pure-text batches.

1193-1201: No action needed: _image_token_ids is consistently a Tensor
Verified in tensorrt_llm/_torch/models/modeling_mistral.py that _image_token_ids is set via torch.tensor(...), and no other definitions exist.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

tensorrt_llm/_torch/models/modeling_qwen2vl.py

tensorrt_llm/_torch/models/modeling_vila.py

2ez4bz

LGTM, but I would personally prefer if the new arguments to fuse_input_embeds are made explicit.

tensorrt_llm/_torch/models/modeling_multimodal_utils.py

tensorrt_llm/_torch/modules/embedding.py

tests/unittest/_torch/multimodal/test_fuse_input_embeds.py

tensorrt-cicd · 2025-09-04T04:14:11Z

PR_Github #17613 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T10:08:53Z

PR_Github #17613 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13240 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-09-04T17:59:36Z

/bot run

tensorrt-cicd · 2025-09-04T18:05:38Z

PR_Github #17705 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T18:05:39Z

PR_Github #17705 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chang-l is not in the prioritized list. L0 testing cannot be triggered.

chang-l · 2025-09-05T16:29:54Z

/bot run

tensorrt-cicd · 2025-09-05T16:34:55Z

PR_Github #17801 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T19:30:12Z

PR_Github #17801 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13328 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-09-05T20:21:15Z

/bot run

tensorrt-cicd · 2025-09-05T20:26:38Z

PR_Github #17818 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T22:14:43Z

PR_Github #17818 [ run ] completed with state ABORTED

chang-l · 2025-09-05T22:15:50Z

/bot run

tensorrt-cicd · 2025-09-05T22:21:10Z

PR_Github #17825 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-06T05:34:27Z

PR_Github #17825 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13343 completed with status: 'FAILURE'

yechank-nvidia

LGTM

chang-l · 2025-09-06T21:44:57Z

/bot run --reuse-test

tensorrt-cicd · 2025-09-06T21:50:30Z

PR_Github #17899 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-07T03:11:28Z

PR_Github #17899 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13407 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

NVIDIA#7280) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l requested review from a team as code owners August 27, 2025 05:45

chang-l requested review from hypdeb, Wanli-Jiang and shaharmor98 August 27, 2025 05:45

chang-l requested review from 2ez4bz, jaedeok-nvidia and yechank-nvidia August 27, 2025 05:45

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

2ez4bz reviewed Aug 27, 2025

View reviewed changes

chang-l requested a review from rakib-hasan August 27, 2025 18:31

2ez4bz reviewed Aug 27, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Outdated Show resolved Hide resolved

rakib-hasan approved these changes Aug 28, 2025

View reviewed changes

chang-l force-pushed the avoid_sync_fused_embed branch from 93314a0 to 594371f Compare September 2, 2025 22:15

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Outdated Show resolved Hide resolved

chang-l requested review from a team as code owners September 3, 2025 01:20

chang-l requested review from byshiue, amukkara, brb-nv and QiJune September 3, 2025 01:20

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_qwen2vl.py Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_vila.py Show resolved Hide resolved

2ez4bz approved these changes Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_multimodal_utils.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/modules/embedding.py Show resolved Hide resolved

tests/unittest/_torch/multimodal/test_fuse_input_embeds.py Show resolved Hide resolved

amukkara approved these changes Sep 3, 2025

View reviewed changes

chang-l added 8 commits September 4, 2025 10:34

Split fused_input_embed to separate cpu op

c27b29a

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Address comments and add some unit tests

4157ac6

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Simplify and make it more backward compatible

bc85533

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Address comments and fix some tests

d9e7852

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Minor update

c7214ed

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Fix CI

eb304c0

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Fix CI

7155b28

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Fix test to avoid randomness

b2f2637

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l force-pushed the avoid_sync_fused_embed branch from 6250a0c to b2f2637 Compare September 4, 2025 17:58

Resolve merge conflict

024c972

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

yechank-nvidia approved these changes Sep 6, 2025

View reviewed changes

chang-l merged commit 99b98f1 into NVIDIA:main Sep 7, 2025
5 checks passed

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025

[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync (

25aba9e

NVIDIA#7280) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync #7280

[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync #7280

Uh oh!

Conversation

chang-l commented Aug 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakib-hasan left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2ez4bz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

chang-l commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

chang-l commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

chang-l commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

chang-l commented Sep 5, 2025

Uh oh!

[TRTLLM-7440][fix] Split `fused_input_embed` to separate out host sync #7280

[TRTLLM-7440][fix] Split `fused_input_embed` to separate out host sync #7280

chang-l commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading