[TRTLLM-7326][feat] Add standalone multimodal encoder #6743

chang-l · 2025-08-08T06:54:56Z

Summary by CodeRabbit

It's the second PR from #5000 that introduces a standalone multimodal encoder and runs only the ViT and returns the mm embedding (via handle).

Summary by CodeRabbit

New Features
- Introduced support for serving and running standalone multimodal encoder models, enabling extraction of multimodal embeddings separately from text generation.
- Added a new CLI command to launch a multimodal encoder server.
- Enhanced API and server responses to include multimodal embedding handles alongside standard outputs.
- Enabled external integration and handling of precomputed multimodal embeddings in model workflows.
- Added vision encoder registration mechanism for modular model support.
Bug Fixes
- None.
Tests
- Added comprehensive tests verifying disaggregated multimodal encoder outputs match standard multimodal generation results.
- Added tests for external embedding integration and token count correctness for image inputs.
Refactor
- Improved configuration handling and model initialization for multimodal encoder-only mode.
- Modularized vision encoder registration and updated model class inheritance for consistency.
Chores
- Expanded enums and configuration options to support multimodal encoder features.
- Updated executor and sampler logic to handle multimodal embedding workflows.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-08T06:55:03Z

📝 Walkthrough

Walkthrough

This change introduces a complete multimodal encoder-only execution and serving pathway. It adds a new MultimodalEncoder class, updates model configs, input processors, and model loading logic to support encoder-only operation, and registers vision encoder classes. The server, CLI, and protocol are extended for encoder-specific endpoints and outputs. Comprehensive tests are included.

Changes

Cohort / File(s)	Change Summary
Multimodal Encoder Class & API Exposure `tensorrt_llm/llmapi/mm_encoder.py`, `tensorrt_llm/llmapi/__init__.py`, `tensorrt_llm/__init__.py`	Introduces `MultimodalEncoder` class for encoder-only models, imports and exports it in public APIs.
Vision Encoder Registration & Model Loading `tensorrt_llm/_torch/models/modeling_utils.py`, `tensorrt_llm/_torch/models/modeling_auto.py`, `tensorrt_llm/_torch/models/modeling_llava_next.py`, `tensorrt_llm/_torch/models/modeling_qwen2vl.py`	Adds vision encoder registration/decorator, vision encoder mapping, and encoder-only model loading logic. Updates input processors and model classes for external embeddings and vision encoder handling.
Model Config and Executor Config Extensions `tensorrt_llm/_torch/model_config.py`, `tensorrt_llm/_torch/pyexecutor/config.py`	Adds `mm_encoder_only` flag to model and executor configs, updates config propagation and logic.
PyExecutor & Sampler Encoder-Only Path `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/sampler.py`, `tensorrt_llm/_torch/pyexecutor/llm_request.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Implements encoder-only execution path: special sampler, result handling, input prep, and disables unnecessary scheduling. Adds multimodal embedding result handling in requests and samplers.
Serving, CLI, and Protocol Updates `tensorrt_llm/commands/serve.py`, `tensorrt_llm/serve/openai_server.py`, `tensorrt_llm/serve/openai_protocol.py`, `tensorrt_llm/llmapi/disagg_utils.py`, `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/llmapi/llm_args.py`	Adds CLI command, server role, and protocol fields for encoder-only serving. Updates OpenAI server for encoder-specific endpoints and response structure.
Executor Result Handling `tensorrt_llm/executor/result.py`	Adds `mm_embedding_handle` property to generation results for multimodal embedding outputs.
Test Suite `tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py`, `tests/unittest/_torch/multimodal/test_external_embedding.py`, `tests/unittest/_torch/multimodal/test_find_num_image_tokens.py`, `tests/unittest/llmapi/apps/_test_openai_mmencoder.py`, `tests/integration/defs/test_e2e.py`	Adds end-to-end tests validating encoder-only and disaggregated multimodal generation equivalence, external embedding handling, token count correctness for image inputs, and OpenAI protocol compliance.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant OpenAIServer
    participant MultimodalEncoder
    participant PyExecutor

    Client->>OpenAIServer: POST /v1/chat/completions (image/text)
    OpenAIServer->>MultimodalEncoder: generate_async(multimodal_inputs)
    MultimodalEncoder->>PyExecutor: run_encoder_only(inputs)
    PyExecutor-->>MultimodalEncoder: mm_embeddings
    MultimodalEncoder-->>OpenAIServer: mm_embedding_handle
    OpenAIServer-->>Client: ChatCompletionResponse (mm_embedding_handle)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[None][feat] Refactor Llava-Next #6478: Refactors Llava-Next multimodal model internals and vision encoder handling, overlapping with this PR's multimodal encoder and input processor enhancements.
[TRTLLM-6654][feat] Add support for external multimodal embeddings #6263: Adds support for external multimodal embeddings integration in input processors and model logic, complementing this PR's encoder-only execution and API exposure.

Suggested reviewers

amukkara
litaotju
symphonylyh
rakib-hasan
nv-guomingz
juney-nvidia

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

coderabbitai

Actionable comments posted: 22

🔭 Outside diff range comments (12)

tensorrt_llm/llmapi/disagg_utils.py (1)
1-1: Missing NVIDIA copyright header

Per coding guidelines, add the NVIDIA copyright header with the current year at the top of this file.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+## Licensed under the Apache License, Version 2.0 (the "License");
+## you may not use this file except in compliance with the License.
+## You may obtain a copy of the License at
+##
+##     http://www.apache.org/licenses/LICENSE-2.0
+##
+## Unless required by applicable law or agreed to in writing, software
+## distributed under the License is distributed on an "AS IS" BASIS,
+## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+## See the License for the specific language governing permissions and
+## limitations under the License.
+
 import logging
tensorrt_llm/llmapi/llm_args.py (1)
1-1: Missing NVIDIA copyright header

Please add the standard NVIDIA header at the top of this file as required by the coding guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+## Licensed under the Apache License, Version 2.0 (the "License");
+## you may not use this file except in compliance with the License.
+## You may obtain a copy of the License at
+##
+##     http://www.apache.org/licenses/LICENSE-2.0
+##
+## Unless required by applicable law or agreed to in writing, software
+## distributed under the License is distributed on an "AS IS" BASIS,
+## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+## See the License for the specific language governing permissions and
+## limitations under the License.
+
 import copy
tensorrt_llm/llmapi/__init__.py (1)
1-1: Missing NVIDIA copyright header

Add the standard NVIDIA header at the top of this file.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+## Licensed under the Apache License, Version 2.0 (the "License");
+## you may not use this file except in compliance with the License.
+## You may obtain a copy of the License at
+##
+##     http://www.apache.org/licenses/LICENSE-2.0
+##
+## Unless required by applicable law or agreed to in writing, software
+## distributed under the License is distributed on an "AS IS" BASIS,
+## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+## See the License for the specific language governing permissions and
+## limitations under the License.
+
 from ..disaggregated_params import DisaggregatedParams
tensorrt_llm/serve/openai_protocol.py (1)
1-3: Add NVIDIA SPDX header (production file)

Per project guidelines, production .py files must start with the NVIDIA copyright:

Add at file top:
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/pyexecutor/_util.py (1)
589-600: Ensure mm_encoder_only short-circuits before TRTLLMSampler to avoid wrong sampler selection

If enable_trtllm_sampler is True, the current order returns TRTLLMSampler before honoring mm_encoder_only. Encoder-only paths should bypass decoding/sampling entirely.

Apply this diff to short-circuit on mm_encoder_only before TRTLLMSampler:
@@
     if engine.spec_config is not None and engine.spec_config.spec_dec_mode.has_spec_decoder(
     ):
         return get_spec_decoder(sampler_args, engine.spec_config)
-    if pytorch_backend_config.enable_trtllm_sampler:
+    # Short-circuit for multimodal encoder-only execution: no decoding/sampling.
+    if executor_config.mm_encoder_only:
+        # NOTE: handle model outputs specially for mm encoder executor/engine
+        return EarlyStopWithMMResult()
+    if pytorch_backend_config.enable_trtllm_sampler:
         decoding_mode = get_decoding_mode(executor_config)
         return TRTLLMSampler(executor_config, engine.model, engine.dtype,
                              mapping, decoding_mode,
                              pytorch_backend_config.disable_overlap_scheduler)
-    if executor_config.mm_encoder_only:
-        # NOTE: handle model outputs specially for mm encoder executor/engine
-        return EarlyStopWithMMResult()
Optionally, add a warning if both mm_encoder_only and enable_trtllm_sampler are set.
tensorrt_llm/_torch/models/modeling_auto.py (1)
1-1: Missing required NVIDIA copyright header.

Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# See LICENSE for licensing terms.
tensorrt_llm/executor/result.py (1)
1-1: Missing required NVIDIA copyright header.

Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# See LICENSE for licensing terms.
tensorrt_llm/_torch/models/modeling_utils.py (1)
1-1: Missing required NVIDIA copyright header.

Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# See LICENSE for licensing terms.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
82-92: AttributeError: SampleStateTensors lacks mm_embeddings
EarlyStopSampler.update_requests accesses state.host.mm_embeddings, but SampleStateTensors defines only new_tokens, logits, log_probs. This will crash when py_return_mm_embeddings is true.

Minimal fix:
 class SampleStateTensors:
     new_tokens: torch.Tensor
     logits: torch.Tensor | None = None
     log_probs: torch.Tensor | None = None
+    mm_embeddings: list[torch.Tensor] | None = None
and populate it in sample_async:
-        host = SampleStateTensors(logits=model_outputs['logits'],
-                                  new_tokens=torch.empty(0))
+        host = SampleStateTensors(
+            logits=model_outputs.get('logits'),
+            new_tokens=torch.empty(0),
+            mm_embeddings=model_outputs.get('mm_embeddings'),
+        )
Alternatively, drop the mm_embeddings branch from this sampler if embeddings are guaranteed to be handled only by EarlyStopWithMMResult.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
1-1: Missing NVIDIA copyright header (required).

All OSS source files must include the NVIDIA copyright header with the current year. Please add it.
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/models/modeling_llava_next.py (2)
318-321: Detect GPU properties for the correct device index (not hard-coded 0).

Current code queries device 0, which may not match self.device (cuda:{rank}). Use the actual device index to avoid incorrect SM detection on multi-GPU systems.
-        prop = torch.cuda.get_device_properties(0)
+        device_idx = int(str(self.device).split(":")[1])
+        prop = torch.cuda.get_device_properties(device_idx)
486-489: self.dtype is undefined in LlavaNextModel; fix logging by defining it.

This will raise at runtime. Define self.dtype before logging or log model_dtype only.
-        self.model_dtype = getattr(config.text_config, "torch_dtype",
-                                   torch.float16)
-        logger.info(f"{self.dtype=} {self.model_dtype=}")
+        self.model_dtype = getattr(config.text_config, "torch_dtype", torch.float16)
+        self.dtype = self.model_dtype  # Keep a consistent dtype attribute for logging
+        logger.info(f"{self.dtype=} {self.model_dtype=}")

🧹 Nitpick comments (40)

tensorrt_llm/llmapi/disagg_utils.py (1)
35-39: Type mismatch: RouterConfig.server_role should be Optional[ServerRole]

The field is annotated as ServerRole but defaulted to None. Recommend Optional for correctness and clearer intent.
-    server_role: ServerRole = None
+    from typing import Optional
+    server_role: Optional[ServerRole] = None
tensorrt_llm/llmapi/llm_args.py (1)

1943-1945: Clarify allowed values in docs and CLI help

Consider updating the relevant docstrings/help text where load_format is surfaced (e.g., TorchLlmArgs.load_format description and CLI) to mention 'VISION_ONLY' explicitly and its scope (encoder-only multimodal mode).
tensorrt_llm/__init__.py (2)
54-54: Silence Ruff E402 for intentional late import order

This file imports mid-module by design (xgrammar must load before the rest). Ruff flags the new import with E402. Add a targeted noqa to avoid new lint noise without reordering.
-from .llmapi import MultimodalEncoder
+from .llmapi import MultimodalEncoder  # noqa: E402
53-56: Minor nit: duplicate import of LlmArgs

LlmArgs is imported both from .llmapi and .llmapi.llm_args. While pre-existing, it’s confusing. Consider keeping only one source (prefer the specific .llmapi.llm_args).
tensorrt_llm/_torch/model_config.py (1)

99-101: Add mm_encoder_only flag to ModelConfig — OK

This cleanly expresses encoder-only mode at config level.

Consider documenting this attribute in the class docstring for Sphinx (Google style) so it surfaces in generated docs.
tensorrt_llm/_torch/pyexecutor/config.py (2)
102-104: Expose mm_encoder_only in PyTorchConfig — OK

Field addition aligns with the encoder-only pathway.

Add a short field docstring to clarify behavior for docs tooling.

133-135: update_executor_config signature extension — OK

Accepts mm_encoder_only and checkpoint_loader cleanly.

For consistency, if a PyTorchConfig is supplied, consider mirroring:
if executor_config.pytorch_backend_config is not None:
    executor_config.pytorch_backend_config.mm_encoder_only = bool(mm_encoder_only)
This avoids relying solely on later _mangle logic.
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
176-176: Type-annotate the new member to clarify intent and help static analysis.

Initialize with an explicit Optional type to match the property signature.
-        self._mm_embeddings = None
+        self._mm_embeddings: Optional[Dict[str, Any]] = None
192-194: Validate input tensor and avoid accidental CPU/GPU mismatches.

Guard against non-Tensor inputs and provide a clear error. Also document the expected device (CPU/GPU ok, SharedTensorContainer handles it) to reduce misuse.
     def append_mm_embeddings(self, mm_embeddings: torch.Tensor):
-        self._mm_embeddings = SharedTensorContainer.from_tensor(mm_embeddings).dump_to_dict()
+        if not isinstance(mm_embeddings, torch.Tensor):
+            raise TypeError(f"mm_embeddings must be a torch.Tensor, got {type(mm_embeddings)}")
+        # Note: SharedTensorContainer accepts both CPU and CUDA tensors.
+        self._mm_embeddings = (
+            tllm_shared_tensor.SharedTensorContainer
+            .from_tensor(mm_embeddings)
+            .dump_to_dict()
+        )
232-235: Add a short docstring to mm_embedding_handle for external consumers.

Keep API self-documenting; this property is part of the outward interface accessed via LlmResult.
     @property
     def mm_embedding_handle(self) -> Dict[str, Any] | None:
-        return self._mm_embeddings
+        """A serialized handle (dict) to multimodal embeddings stored in shared memory.
+
+        Consumers should recover the tensor from this handle using the appropriate
+        SharedTensorContainer recovery routine (context phase only). The handle is
+        opaque and lifetime-bound to the underlying shared resource.
+        """
+        return self._mm_embeddings
tensorrt_llm/_torch/models/modeling_auto.py (1)
48-69: Add docstring and explicit return type; minor simplification.

Clarify purpose and return type. Keep instantiation tight.
-class AutoModelForMultimodalEncoder(Generic[TModel, TConfig]):
+class AutoModelForMultimodalEncoder(Generic[TModel, TConfig]):
 
-    @staticmethod
-    def from_config(
-        config: ModelConfig[TConfig],
-    ):
+    @staticmethod
+    def from_config(
+        config: ModelConfig[TConfig],
+    ) -> nn.Module:
+        """Factory for vision-only encoder models registered via register_vision_encoder().
+
+        Args:
+            config: The unified ModelConfig carrying pretrained_config.architectures[0].
+        Returns:
+            Instantiated vision encoder nn.Module for the given architecture.
+        Raises:
+            ValueError: if the architecture is unknown/not registered.
+        """
         model_arch = config.pretrained_config.architectures[0]
 
         # Direct lookup using architecture name
         vision_encoder_info = MODEL_CLASS_VISION_ENCODER_MAPPING.get(model_arch)
         if vision_encoder_info is None:
             raise ValueError(
                 f"Unknown architecture for AutoModelForMultimodalEncoder: {model_arch}"
             )
 
         vision_encoder_cls, vlm_base_model = vision_encoder_info
-        if vlm_base_model is None:
-            model = vision_encoder_cls(config)
-        else:
-            model = vision_encoder_cls(config, vlm_base_model)
-
-        return model
+        return (vision_encoder_cls(config) if vlm_base_model is None
+                else vision_encoder_cls(config, vlm_base_model))
Note: add the following import at file top if not present:
from torch import nn
tensorrt_llm/executor/result.py (2)
201-204: Add a concise docstring for mm_embedding_handle.

Clarify that this is a handle for downstream recovery.
     @property
     def mm_embedding_handle(self) -> Optional[Dict[str, Any]]:
-        return self._mm_embedding_handle
+        """Serialized shared-tensor handle for multimodal embeddings (if present)."""
+        return self._mm_embedding_handle
567-568: Consider omitting mm_embedding_handle from repr to avoid noisy logs.

Handles may be verbose; including them in repr can bloat logs/telemetry.
-            "context_logits", "mm_embedding_handle"
+            "context_logits"
If you want it visible, consider summarized output (e.g., keys only) by customizing repr for this field.
tensorrt_llm/_torch/models/modeling_utils.py (1)
576-577: Add a precise type annotation for MODEL_CLASS_VISION_ENCODER_MAPPING.

Improves readability and static checks.
-MODEL_CLASS_VISION_ENCODER_MAPPING = {}
+from typing import Tuple
+MODEL_CLASS_VISION_ENCODER_MAPPING: Dict[str, Tuple[Type[nn.Module], Optional[Type[nn.Module]]]] = {}
tensorrt_llm/commands/serve.py (1)
390-397: Docstring nit – add terminating period
The first line of the docstring should end with a period to satisfy D415.
-    """Running an OpenAI API compatible server
+    """Running an OpenAI-API-compatible server.
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (3)

46-50: Unused variable model_name
model_name is assigned but never used – remove it or use it in log/skip messages to silence F841.

12-16: Tests fetch images from the Internet – flakey
Relying on remote URLs makes the unit test non-deterministic and slow/offline-unfriendly. Ship small test images with the repo or use pytest.skip when offline.

136-170: Very long line & style issues
Several lines exceed 120 chars. Wrap them to keep within the guideline and improve readability.

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

138-140: Line > 120 chars – wrap to satisfy project style.
tensorrt_llm/llmapi/mm_encoder.py (9)
5-5: Follow namespace import guideline.

Per coding guidelines, prefer keeping module namespace instead of importing individual symbols.
-from .llm import BaseLLM, _TorchLLM, RequestOutput
+from . import llm as llm_mod
Then update references:

class MultimodalEncoder(_TorchLLM): → class MultimodalEncoder(llm_mod._TorchLLM):

BaseLLM._build_model(self) → llm_mod.BaseLLM._build_model(self)

Type hints: RequestOutput → llm_mod.RequestOutput

If the rest of the codebase commonly uses direct imports here, consider deferring to maintain consistency, but the guideline prefers namespaced imports.

16-18: Fix one-line docstring format (D200).

Keep one-line docstrings on a single line.
-class MultimodalEncoder(_TorchLLM):
-    """MultimodalEncoder class is the main class for running a multimodal encoder model using PyTorch backend.
-"""
+class MultimodalEncoder(llm_mod._TorchLLM):
+    """MultimodalEncoder runs a multimodal encoder model using the PyTorch backend."""
30-34: Style: argument spacing.

Nit: follow PEP8 spacing for kwargs.
-                         dtype = dtype,
+                         dtype=dtype,
40-46: Wrap long comments (E501) and clarify.

Break long lines to <=120 chars and keep comments concise.
-        # Tokenizer loading should be after calling model_loader(), since model_loader() may download the model from HF hub.
-        # It should also be before bindings ExecutorConfig, which may depend on tokenizer info.
+        # Load tokenizer after model_loader() (which may download from HF hub),
+        # and before binding ExecutorConfig (which may depend on tokenizer info).
58-59: Optional: line length and readability.

Break the long assignment.
-            kwargs[
-                "batching_type"] = self.args.batching_type or tllm.BatchingType.INFLIGHT
+            kwargs["batching_type"] = (
+                self.args.batching_type or tllm.BatchingType.INFLIGHT
+            )
83-93: Confirm is_llm_executor for mm-encoder-only.

Setting is_llm_executor=True may affect executor behavior (metrics, routing). For encoder-only, this might need to be False. Please confirm.
-            is_llm_executor=True, # TODO: check if this is correct or needed
+            # Encoder-only executor; verify whether LLM-specific paths should be disabled.
+            is_llm_executor=False,
If other code paths rely on this being True, keep it and document why.

95-99: Docstring spacing (D205).

Add a blank line between summary and description.
-        """Validate that users don't pass LLM-specific arguments when using MultimodalEncoder (PyTorch).
-        Placeholder for now.
-        """
+        """Validate that users don't pass LLM-specific arguments when using MultimodalEncoder (PyTorch).
+
+        Placeholder for now.
+        """
105-114: Docstring length and spacing (E501, D205).

Wrap long lines and add a blank line between summary and details.
-        """Generate output for the given prompts in the synchronous mode.
-        Synchronous generation accepts either single prompt or batched prompts.
+        """Generate output for the given prompts in synchronous mode.
+
+        Synchronous generation accepts either a single prompt or batched prompts.
@@
-            inputs (tensorrt_llm.inputs.data.PromptInputs, Sequence[tensorrt_llm.inputs.data.PromptInputs]): The prompt text or token ids.
-                It can be single prompt or batched prompts.
+            inputs (PromptInputs | Sequence[PromptInputs]): The prompt text or token ids.
+                It can be a single prompt or batched prompts.
@@
-            Union[tensorrt_llm.llmapi.RequestOutput, List[tensorrt_llm.llmapi.RequestOutput]]: The output data of the completion request to the LLM.
+            Union[llm_mod.RequestOutput, List[llm_mod.RequestOutput]]: The outputs of the encoder requests.
148-160: Docstring spacing (D205).

Add a blank line between summary and the rest. Also align the return description.
-        """Generate output for the given multimodal request in the asynchronous mode.
-        Asynchronous generation accepts single multimodal request only.
+        """Generate output for the given multimodal request in asynchronous mode.
+
+        Asynchronous generation accepts a single multimodal request only.
@@
-            Future that resolves to tensorrt_llm.llmapi.RequestOutput containing mm_embeddings
+            A future that resolves to llm_mod.RequestOutput containing mm_embeddings.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1646-1656: Prefer MultimodalParams over raw embeddings; remove now-unused multi_modal_data

The new MultimodalParams flow is good. It appears multi_modal_data is now redundant in this function. Removing it avoids confusion and potential linter warnings.

Consider deleting the multi_modal_data list and related branches in this function.
tensorrt_llm/serve/openai_server.py (5)
10-10: Missing NVIDIA copyright header (required for OSS source files).

Per repository guidelines, add the NVIDIA copyright header with the current year.

Example header to add at the top of the file:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
45-45: Duplicate import of UsageInfo from openai_protocol.

UsageInfo is already imported on Line 39. Remove the duplicate to avoid shadowing/readability issues.
-from tensorrt_llm.serve.openai_protocol import ChatMessage, ChatCompletionResponseChoice, UsageInfo
+from tensorrt_llm.serve.openai_protocol import ChatMessage, ChatCompletionResponseChoice
160-170: MM-encoder route set looks right; consider parity and health_generate behavior.

Parity: Consider whether /kv_cache_events should be exposed for MM encoder mode (optional).

Health: As noted, ensure health_generate uses the MM encoder path to avoid calling text generation.

366-383: Fix long line and improve readability of parse_chat_messages_coroutines call.

Wrap to respect line length (Ruff E501).
-            conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(request.messages, self.model_config)
+            (
+                conversation,
+                mm_coroutines,
+                mm_placeholder_counts,
+            ) = parse_chat_messages_coroutines(
+                request.messages,
+                self.model_config,
+            )
343-344: Explicitly reject streaming for MM encoder endpoint (if unsupported).

If MM encoder doesn’t support streaming, return a 400 early when request.stream is True to avoid client confusion.
 async def openai_mm_encoder(self, request: ChatCompletionRequest, raw_request: Request) -> Response:
+        if getattr(request, "stream", False):
+            return self.create_error_response(
+                message="stream=True is not supported for multimodal encoder",
+                err_type="BadRequestError",
+                status_code=HTTPStatus.BAD_REQUEST,
+            )
tensorrt_llm/_torch/models/modeling_llava_next.py (6)
3-3: Missing NVIDIA copyright header (required for OSS source files).

Please add the NVIDIA copyright header with the current year at the top of this file.

Example header:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
132-133: Remove unused variable assignment (Ruff F841).

num_mm_tokens is never used. Keep num_medias only.
-        num_medias = num_mm_tokens = len(mm_token_positions)
+        num_medias = len(mm_token_positions)
194-196: Break long assertion message (Ruff E501) and improve readability.
-        ) == fused_length, f"Fused input_ids length {len(fused_input_ids)} should match the sum of text and multimodal embedding lengths {fused_length}"
+        ) == fused_length, (
+            f"Fused input_ids length {len(fused_input_ids)} should match the "
+            f"sum of text and multimodal embedding lengths {fused_length}"
+        )
237-242: Ensure multimodal embeddings dtype matches text model dtype.

Stacked mm_features should be cast to the text model dtype to avoid downstream dtype mismatches in fuse_input_embeds.
-        mm_features = torch.stack(multimodal_embedding['image'])
+        mm_features = torch.stack(multimodal_embedding['image'])
+        target_dtype = getattr(self.model_config.text_config, "torch_dtype", mm_features.dtype)
+        mm_features = mm_features.to(target_dtype)
521-527: Style nit: double space after 'if'.

Minor readability nit. Also consider an early return if multimodal_embedding is not provided.
-                if  multimodal_params[0].multimodal_data.get("multimodal_embedding", None) is not None:
+                if multimodal_params[0].multimodal_data.get("multimodal_embedding", None) is not None:
339-342: Docstrings for new helpers (post_config) and input processor methods.

Consider adding short Google-style docstrings for post_config, _postprocess, and get_num_tokens_per_image to aid maintainability and Sphinx docs.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9687bb4 and e7bc795.

📒 Files selected for processing (22)

tensorrt_llm/__init__.py (2 hunks)
tensorrt_llm/_torch/model_config.py (3 hunks)
tensorrt_llm/_torch/models/modeling_auto.py (3 hunks)
tensorrt_llm/_torch/models/modeling_llava_next.py (10 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (5 hunks)
tensorrt_llm/_torch/models/modeling_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (4 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/commands/serve.py (4 hunks)
tensorrt_llm/executor/result.py (5 hunks)
tensorrt_llm/llmapi/__init__.py (2 hunks)
tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/llmapi/llm.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)
tensorrt_llm/llmapi/mm_encoder.py (1 hunks)
tensorrt_llm/serve/openai_protocol.py (1 hunks)
tensorrt_llm/serve/openai_server.py (7 hunks)
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/serve/openai_server.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/serve/openai_server.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: 2025-07-22T09:22:14.726Z

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/__init__.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

📚 Learning: 2025-08-08T04:10:18.987Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

🪛 Ruff (0.12.2)

tensorrt_llm/__init__.py

54-54: Module level import not at top of file

(E402)

tensorrt_llm/_torch/pyexecutor/sampler.py

138-138: Line too long (132 > 120)

(E501)

tensorrt_llm/llmapi/mm_encoder.py

17-18: One-line docstring should fit on one line

Reformat to one line

(D200)

40-40: Line too long (124 > 120)

(E501)

45-45: Line too long (140 > 120)

(E501)

46-46: Line too long (141 > 120)

(E501)

95-97: 1 blank line required between summary line and description

(D205)

105-114: 1 blank line required between summary line and description

(D205)

109-109: Line too long (138 > 120)

(E501)

113-113: Line too long (148 > 120)

(E501)

155-160: 1 blank line required between summary line and description

(D205)

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

37-37: Blank line contains whitespace

Remove whitespace from blank line

(W293)

46-46: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

136-136: Line too long (135 > 120)

(E501)

159-159: Line too long (130 > 120)

(E501)

164-164: Line too long (137 > 120)

(E501)

tensorrt_llm/commands/serve.py

390-393: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

413-413: Undefined name server_role

(F821)

415-415: Undefined name server_role

(F821)

tensorrt_llm/_torch/pyexecutor/model_engine.py

1107-1107: Line too long (158 > 120)

(E501)

2275-2275: Line too long (127 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_llava_next.py

27-27: Redefinition of unused ModelConfig from line 23

(F811)

132-132: Local variable num_mm_tokens is assigned to but never used

Remove assignment to unused variable num_mm_tokens

(F841)

195-195: Line too long (152 > 120)

(E501)

tensorrt_llm/serve/openai_server.py

354-354: Line too long (185 > 120)

(E501)

364-364: Local variable disaggregated_params is assigned to but never used

Remove assignment to unused variable disaggregated_params

(F841)

366-366: Line too long (132 > 120)

(E501)

🔇 Additional comments (24)

tensorrt_llm/llmapi/disagg_utils.py (1)

22-22: New ServerRole.MM_ENCODER: ensure all dispatch paths handle it

Addition looks correct (unique value). Please verify that all consumers (routing, serving, executor config) that switch on ServerRole have explicit handling for MM_ENCODER to avoid falling into default/legacy CONTEXT/GENERATION branches.

tensorrt_llm/llmapi/llm_args.py (1)

1943-1945: Add LoadFormat.VISION_ONLY – integration looks good

VISION_ONLY integrates cleanly with TorchLlmArgs.convert_load_format (string-to-enum). Ensure all intended call sites accept this value and that non-PyTorch backends ignore or reject it as designed.

tensorrt_llm/llmapi/__init__.py (1)

24-25: Public API exposure looks good

Exporting MultimodalEncoder via all is correct and aligns with the new feature.

tensorrt_llm/__init__.py (1)

107-107: Export looks good

Adding MultimodalEncoder to all is correct and keeps the public API consistent.

tensorrt_llm/llmapi/llm.py (1)

56-56: Docstring addition looks good

Clear description of mm_embedding_handle and aligns with new MM encoder flow.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

33-33: Import for multimodal encoder sampler result looks good

Importing EarlyStopWithMMResult is necessary for the encoder-only sampler flow.

tensorrt_llm/_torch/model_config.py (2)

121-121: Correctly plumbs mm_encoder_only into generation classification

Passing mm_encoder_only into is_generation_model ensures encoder-only models are treated as non-generation.

170-177: All is_generation_model calls updated with mm_encoder_only
I searched across the codebase and found only the invocation in tensorrt_llm/_torch/model_config.py already passes the new parameter. No other call sites exist. Everything is correctly updated.

tensorrt_llm/_torch/model_config.py:120–121 now uses
self.is_generation_model(self.pretrained_config.architectures, mm_encoder_only=self.mm_encoder_only)

tensorrt_llm/_torch/pyexecutor/config.py (2)

118-119: Include mm_encoder_only in dynamic executor fields — OK

Ensures dynamic assignment is permitted.

148-149: Propagating mm_encoder_only to executor_config — OK

This is required for downstream checks (e.g., sampler and engine behavior).

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

239-239: LGTM: LlmResult exposes mm_embedding_handle via py_result_properties.

This enables transparent access and aligns with executor/result plumbing.

1-1: Ignore outdated shared tensor recovery suggestion

After searching the entire codebase, there are no occurrences of from_shared_tensor()—only a single strip_for_generation() call in model_engine.py. The original comment about verifying shared‐tensor recovery alignment with the org learning no longer applies and can be disregarded.

Likely an incorrect or invalid review comment.

tensorrt_llm/executor/result.py (2)

164-165: LGTM: Store optional multimodal embedding handle on GenerationResultBase.

Initialization aligns with the new property; no functional issues.

339-341: LGTM: Propagate handle from LlmResult to GenerationResultBase.

This keeps the Python-side result in sync with executor results.

tensorrt_llm/_torch/models/modeling_qwen2vl.py (4)

6-6: Import aliasing is fine here.

Using import torch.nn as nn maintains the namespace and is consistent with guidelines.

25-25: Vision encoder registration import looks correct.

register_vision_encoder import aligns with the new registration framework and enables AutoModel to resolve the encoder correctly.

654-654: Decorator registration for Qwen2VL vision encoder looks correct.

This ties the VLM base model to the vision encoder via the registry, enabling encoder-only instantiation.

666-666: Decorator registration for Qwen2_5_VL vision encoder looks correct.

Registration aligns with the 2.5 variant as well.

tensorrt_llm/llmapi/mm_encoder.py (1)

60-82: Verify mm-encoder-only execution wiring

I’ve confirmed that:

mm_encoder_only=True is correctly passed into update_executor_config.

In _util.py (around lines 597–600), the sampler is switched to EarlyStopWithMMResult when mm_encoder_only is enabled.

In model_engine.py (around lines 1270–1274), multimodal_params.strip_for_generation() is invoked before the vision-encoder forward pass.

I was unable to locate any calls to from_shared_tensor() in the context path. Per the multimodal pipeline design, you should recover shared tensors with from_shared_tensor() during context requests, then use strip_for_generation() on subsequent generation requests to avoid redundant recovery.

Please verify that:

Context-phase code invokes MultimodalParams.from_shared_tensor() to recover shared tensors.

Generation-phase code continues to call strip_for_generation() as shown.
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

1014-1015: Propagate mm_encoder_only into load_config — LGTM

Passing mm_encoder_only down to the loader is correct and keeps config as the source of truth.

1629-1630: Initialize multimodal_params_list — LGTM

Switching to MultimodalParams as the transport is the right direction for no-cache path.

1657-1658: Set py_batch_idx for context requests — LGTM

Keeps batch mapping consistent across paths.

1710-1711: Inputs now carry "multimodal_params" — LGTM

Consistent with the cached path; downstream code can consume a uniform key.

2156-2161: Unconditional mm_encoder_only dispatch before KV cache check

The mm_encoder_only branch is currently gated by kv_cache_manager is None (lines 2151–2161 in model_engine.py). If a KV cache manager is ever provisioned for an encoder-only model, the multimodal encoder-only path will be skipped. We should pull the mm_encoder_only check to the top and always use the no-cache prep in this mode.

Locations to update:

tensorrt_llm/_torch/pyexecutor/model_engine.py around line 2150

Proposed diff:
-        if kv_cache_manager is None:
-            inputs, gather_ids = self._prepare_tp_inputs_no_cache(
-                scheduled_requests, attn_metadata, spec_metadata)
-            with MoeLoadBalancerIterContext(moe_load_balancer):
-                if self.pytorch_backend_config.mm_encoder_only:
-                    return self._forward_step_mm_encoder_only(inputs, scheduled_requests)
-                else:
-                    return self._forward_step(inputs, gather_ids, gather_context_logits)
-        with self._maybe_pad_batch(scheduled_requests, kv_cache_manager,
-                                   spec_resource_manager) as scheduled_requests:
+        # Handle multimodal encoder-only mode first (always no-cache)
+        if self.pytorch_backend_config.mm_encoder_only:
+            inputs, gather_ids = self._prepare_tp_inputs_no_cache(
+                scheduled_requests, attn_metadata, spec_metadata)
+            with MoeLoadBalancerIterContext(moe_load_balancer):
+                return self._forward_step_mm_encoder_only(inputs, scheduled_requests)
+        if kv_cache_manager is None:
+            inputs, gather_ids = self._prepare_tp_inputs_no_cache(
+                scheduled_requests, attn_metadata, spec_metadata)
+        with self._maybe_pad_batch(scheduled_requests, kv_cache_manager,
+                                   spec_resource_manager) as scheduled_requests:
Please verify whether resource_manager.get_resource_manager(...) can ever return a non-None KV cache manager when mm_encoder_only is enabled, and adjust accordingly.

tensorrt_llm/_torch/models/modeling_auto.py

tensorrt_llm/_torch/models/modeling_llava_next.py

tensorrt_llm/_torch/models/modeling_qwen2vl.py

tensorrt_llm/_torch/models/modeling_utils.py

tensorrt_llm/_torch/pyexecutor/llm_request.py

tensorrt_llm/serve/openai_protocol.py

tensorrt_llm/serve/openai_server.py

tensorrt_llm/llmapi/mm_encoder.py

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

tensorrt_llm/llmapi/mm_encoder.py

tensorrt_llm/serve/openai_server.py

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

tensorrt_llm/_torch/pyexecutor/config.py

tensorrt_llm/_torch/pyexecutor/sampler.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/pyexecutor/sampler.py

tensorrt_llm/_torch/models/modeling_llava_next.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

tensorrt_llm/commands/serve.py

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

Signed-off-by: Chang Liu <[email protected]>

coderabbitai

Actionable comments posted: 6

🔭 Outside diff range comments (2)

tensorrt_llm/__init__.py (1)
112-112: Exported ‘KvCacheConfig’ is not imported here ⇒ broken re-export.

all lists KvCacheConfig, but this module doesn’t import it. from tensorrt_llm import KvCacheConfig will likely fail.

Fix by importing KvCacheConfig alongside the existing llmapi import:
-from .llmapi import LLM, MultimodalEncoder  # noqa: E402
+from .llmapi import LLM, MultimodalEncoder, KvCacheConfig  # noqa: E402
tensorrt_llm/serve/openai_server.py (1)
85-98: Fix potential None dereference for server_role in lifespan metadata.

server_role is Optional; accessing server_role.name can raise when None. Use self.server_role and guard.
-                    "server_role": server_role.name,
+                    "server_role": (self.server_role.name if self.server_role else None),

♻️ Duplicate comments (10)

tensorrt_llm/_torch/models/modeling_utils.py (1)

592-621: LGTM: Decorator docstring + fail-fast behavior implemented.

The for/else with a clear ValueError prevents silent misregistration. This aligns with prior review feedback.

tensorrt_llm/commands/serve.py (1)

413-421: Fix NameError: server_role is undefined in serve_encoder.

This block references server_role but it’s not defined in this scope. The role is fixed in launch_mm_encoder_server, so this validation is unnecessary.

Apply:

-    if metadata_server_cfg is not None:
-        assert server_role is not None, "server_role is required when metadata_server_cfg is provided"
-        try:
-            server_role = ServerRole[server_role.upper()]
-            assert server_role == ServerRole.MM_ENCODER, "server_role must be MM_ENCODER for multimodal encoder"
-        except ValueError:
-            raise ValueError(f"Invalid server role: {server_role}. " \
-                             f"Must be one of: {', '.join([role.name for role in ServerRole])}")
+    # Metadata server cfg is passed through; server role is enforced as MM_ENCODER in launch_mm_encoder_server.

tensorrt_llm/serve/openai_server.py (2)

175-207: health_generate should call openai_mm_encoder in MM_ENCODER mode.

In MM encoder role, /health_generate still calls openai_chat. Branch by role to exercise the correct path.

-            # Call the chat completion logic
-            response = await self.openai_chat(health_request, mock_request)
+            # Call the appropriate minimal-generation logic based on server role
+            if self.server_role is ServerRole.MM_ENCODER:
+                response = await self.openai_mm_encoder(health_request, mock_request)
+            else:
+                response = await self.openai_chat(health_request, mock_request)

Also applies to: 189-191

344-357: Guard mm_embedding_handle and reflow long lines (E501).

Avoid assuming mm_embedding_handle is always present; add checks and reformat for readability.

-        async def create_mm_embedding_response(
-                promise: RequestOutput):
+        async def create_mm_embedding_response(promise: RequestOutput):
             await promise.aresult()
-            mm_embedding_handle = promise.mm_embedding_handle
-            num_tokens = mm_embedding_handle["tensor_size"][0]
+            mm_embedding_handle = getattr(promise, "mm_embedding_handle", None)
+            if not mm_embedding_handle or "tensor_size" not in mm_embedding_handle:
+                return self.create_error_response(
+                    message="Multimodal embedding handle missing in response",
+                    err_type="InternalServerError",
+                    status_code=HTTPStatus.INTERNAL_SERVER_ERROR)
+            num_tokens = int(mm_embedding_handle["tensor_size"][0])
             # TODO: need to add more info like input_ids, specific_token_ids, mrope, mm_hashes, etc
-            return ChatCompletionResponse(
-                id=str(promise.request_id),
-                model=self.model,
-                choices=[ChatCompletionResponseChoice(index=0, message=ChatMessage(role="assistant", content="dummy"), mm_embedding_handle=mm_embedding_handle, finish_reason="length")],
-                usage=UsageInfo(prompt_tokens=num_tokens, completion_tokens=1, total_tokens=num_tokens+1))
+            return ChatCompletionResponse(
+                id=str(promise.request_id),
+                model=self.model,
+                choices=[
+                    ChatCompletionResponseChoice(
+                        index=0,
+                        message=ChatMessage(role="assistant", content="dummy"),
+                        mm_embedding_handle=mm_embedding_handle,
+                        finish_reason="length",
+                    )
+                ],
+                usage=UsageInfo(
+                    prompt_tokens=num_tokens,
+                    completion_tokens=1,
+                    total_tokens=num_tokens + 1,
+                ),
+            )

tensorrt_llm/llmapi/mm_encoder.py (3)

1-1: Add NVIDIA OSS copyright header.

Required for .py source files in OSS.

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.

55-58: Tokenizer attribute consistency with _try_load_tokenizer.

use self._tokenizer both for passing into input_processor and for assignment after creation.

-        self.input_processor = create_input_processor(self._hf_model_dir,
-                                                      self.tokenizer)
-        self.tokenizer = self.input_processor.tokenizer
+        self.input_processor = create_input_processor(self._hf_model_dir,
+                                                      self._tokenizer)
+        self._tokenizer = self.input_processor.tokenizer

107-151: Bug: generate() returns unresolved futures instead of RequestOutput(s).

The docstring promises resolved results. Collect future.result() and return those; also remove the unused helper.

-        def _item_at(maybe_batched: Union[Any, Sequence[Any]], pos: int) -> Any:
-            if isinstance(maybe_batched, list):
-                return maybe_batched[pos]
-            else:
-                return maybe_batched
-
-        futures = []
+        futures: List[Any] = []
         for i, request_inputs in enumerate(inputs):
             future = self.generate_async(request_inputs)
             futures.append(future)
 
-        for future in tqdm(futures,
-                           desc="Processed requests",
-                           dynamic_ncols=True,
-                           disable=not use_tqdm):
-            future.result()
+        results: List[RequestOutput] = []
+        for future in tqdm(
+            futures,
+            desc="Processed requests",
+            dynamic_ncols=True,
+            disable=not use_tqdm,
+        ):
+            results.append(future.result())
 
         if unbatched:
-            futures = futures[0]
+            return results[0]
 
-        return futures
+        return results

tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

1077-1082: VISION_ONLY: add explicit guard to ensure weights truly preloaded.

Log is good; also warn (or error) if the model doesn’t advertise preloaded vision weights.

             elif load_format == LoadFormat.VISION_ONLY:
                 # Vision weights are already loaded within the model.
                 logger.info(
                     "LoadFormat.VISION_ONLY: skipping weight loading; using preloaded vision weights."
                 )
+                if not getattr(model, "vision_weights_preloaded", False) and not getattr(model, "is_vision_encoder", False):
+                    logger.warning(
+                        "VISION_ONLY format is used but the model does not advertise preloaded vision weights."
+                    )

1108-1113: Avoid hardcoded magic number for mm_encoder_only max_seq_len.

Derive from model or backend config with a safe fallback; also shortens the long comment.

-        if self.pytorch_backend_config.mm_encoder_only:
-            # TODO: hardcoded for now, need to infer as: max_num_token_per_image * max_num_images (can be given as a parameter or defined same as max_seq_len)
-            inferred_max_seq_len = 32768
+        if self.pytorch_backend_config.mm_encoder_only:
+            # Derive from config; fallback to a safe default.
+            inferred_max_seq_len = (
+                getattr(self.model.config, "max_mm_tokens", None)
+                or getattr(self.pytorch_backend_config, "max_mm_seq_len", None)
+                or 32768
+            )

2256-2293: Make mm_encoder_only forward robust: no assert, stable returns, correct splitting, and use model_forward.

Always include 'logits': None, even when no multimodal data.
Replace assert with ValueError (asserts can be stripped).
Use model_forward to preserve tracing/compile behaviors.
Split evenly by len(multimodal_params) when lengths unavailable (not by context_requests).

     @nvtx_range("_forward_step_mm_encoder_only")
     def _forward_step_mm_encoder_only(
             self, inputs: Dict[str, Any],
             scheduled_requests: ScheduledRequests) -> Dict[str, Any]:
-        """Forward step for multimodal encoder only mode - returns mm_embeddings instead of logits."""
+        """Run the vision encoder in encoder-only mode; return embeddings, no logits."""
         # Get multimodal parameters from inputs
         multimodal_params = inputs.get("multimodal_params", [])
-        if not multimodal_params or len(multimodal_params) == 0:
-            # Return empty embeddings if no multimodal data
-            return {'mm_embeddings': []}
+        if not multimodal_params:
+            return {'mm_embeddings': [], 'logits': None}
         if getattr(scheduled_requests.context_requests[0], 'multimodal_lengths',
                    None) is None:
             multimodal_chunks = None
         else:
             multimodal_chunks = [
                 sum(request.multimodal_lengths)
                 for request in scheduled_requests.context_requests
                 if request.multimodal_lengths is not None
             ]
         # For mm_encoder_only mode, we only run the vision encoder part
         # The model should be a vision encoder (e.g., Qwen2VisionModelBase)
-        mm_embeddings = self.model.forward(multimodal_params)
-        assert len(
-            mm_embeddings
-        ) == 1, "mm_embeddings should be a 1-element list, mix modality (video+image) is not supported"
+        mm_embeddings = self.model_forward(multimodal_params=multimodal_params)
+        if len(mm_embeddings) != 1:
+            raise ValueError("Expected a single embeddings tensor; mixed modalities not supported yet.")
 
-        if multimodal_chunks is None or len(multimodal_chunks) != len(
-                multimodal_params):
-            mm_embeddings = list(
-                torch.chunk(mm_embeddings[0],
-                            len(scheduled_requests.context_requests),
-                            dim=0))
+        if multimodal_chunks is None or len(multimodal_chunks) != len(multimodal_params):
+            mm_embeddings = list(torch.chunk(mm_embeddings[0], len(multimodal_params), dim=0))
         else:
             mm_embeddings = list(
                 torch.split(mm_embeddings[0], multimodal_chunks, dim=0))
 
-        return {'mm_embeddings': mm_embeddings, 'logits': None}
+        return {'mm_embeddings': mm_embeddings, 'logits': None}

🧹 Nitpick comments (14)

tensorrt_llm/__init__.py (1)
53-53: Silence E402 for this file’s special import ordering.

This module intentionally performs early side-effectful setup and non-top imports (e.g., xgrammar pre-import). Ruff flags E402 here. Either adjust tool config for this file or mark this import with noqa.

Example:
-from .llmapi import LLM, MultimodalEncoder
+from .llmapi import LLM, MultimodalEncoder  # noqa: E402
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py (2)
98-116: Avoid redundant network downloads; reuse images or mark test as network/slow.

You download each image again in the inner loop despite already providing URLs. Either:

Reuse locally downloaded PIL images for both the loader and dimension checks, or

Mark the test as slow/external-data to reflect the cost.

Option A (reuse PIL images):
-# Prepare batch inputs with all 3 images
-prompts = ["dummy"] * len(example_images)
-media = example_images  # All image URLs
+prompts = ["dummy"] * len(example_images)
+media = [download_image(u) for u in example_images]
Then remove the per-iteration download (Lines 119-122) and instead use the already-loaded PIL image:
-# Get test image dimensions
-test_image = download_image(test_image_url)
-image_width, image_height = test_image.size
+image_width, image_height = media[image_idx].size
If default_multimodal_input_loader does not accept PIL Images, please confirm and we can instead cache the downloaded bytes and reuse them.

111-111: Fix line length to satisfy Ruff E501.

Ruff flags Line 111 as >120 chars. Wrap the call or break arguments to new lines consistently.
tensorrt_llm/_torch/models/modeling_utils.py (1)
576-577: Optional: Add explicit type annotations for registry dicts.

Improves readability and static tooling.
-MODEL_CLASS_VISION_ENCODER_MAPPING = {}
+MODEL_CLASS_VISION_ENCODER_MAPPING: Dict[str, Tuple[Type[nn.Module], Optional[Type[nn.Module]]]] = {}
tensorrt_llm/commands/serve.py (3)
391-394: Docstring punctuation (D415).

First line should end with a period.
-    """Running an OpenAI API compatible server
+    """Running an OpenAI API compatible server.
377-383: Clarify help text: not “trtllm-serve”.

This option applies to the encoder server. Adjust wording to avoid confusion.
-    "Path to a YAML file that overwrites the parameters specified by trtllm-serve."
+    "Path to a YAML file that overwrites parameters for the mm_embedding_serve command."
403-409: YAML parsing: fail fast with a clear error.

Wrap safe_load with a narrow except to surface config errors cleanly.
-    if extra_encoder_options is not None:
-        with open(extra_encoder_options, 'r') as f:
-            encoder_args_extra_dict = yaml.safe_load(f)
+    if extra_encoder_options is not None:
+        try:
+            with open(extra_encoder_options, 'r') as f:
+                encoder_args_extra_dict = yaml.safe_load(f) or {}
+        except (OSError, yaml.YAMLError) as e:
+            raise ValueError(f"Failed to load extra encoder options from {extra_encoder_options}: {e}") from e
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (2)
68-75: Pass trust_remote_code to MultimodalEncoder for consistency.

You set trust_remote_code on LLM; do the same for the encoder to avoid model-specific import issues.
-    encoder = MultimodalEncoder(model=encoder_model_dir,
-                                max_batch_size=max_batch_size)
+    encoder = MultimodalEncoder(model=encoder_model_dir,
+                                max_batch_size=max_batch_size,
+                                trust_remote_code=True)
147-153: Wrap long assertion messages (E501).

Break into multiple lines or use parentheses to satisfy line length constraints.

Example:
-            assert ref_gen.text == test_gen.text, \
-                f"Generated text doesn't match for output {i}, generation {j}:\nReference: {ref_gen.text!r}\nTest: {test_gen.text!r}"
+            assert ref_gen.text == test_gen.text, (
+                f"Generated text mismatch for output {i}, gen {j}:\n"
+                f"Reference: {ref_gen.text!r}\nTest: {test_gen.text!r}"
+            )
tests/unittest/_torch/multimodal/test_external_embedding.py (1)

89-178: Add negative-path tests (mismatch and modality errors).

Consider adding:

Mismatch between number of tokens and provided image embeddings (expect ValueError or clear failure).

Passing multimodal_embedding without 'image' key (expect ValueError).

This hardens behavior around error handling.
tensorrt_llm/serve/openai_server.py (1)
49-49: Nit: fix typo in yapf directive.

Enable is misspelled.
-# yapf: enale
+# yapf: enable
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1628-1658: Remove unused multi_modal_data path in no-cache prep.

multi_modal_data list and associated assignments are unused after switching to multimodal_params; remove to reduce confusion.
-        multi_modal_data = []
+        # multi_modal_data path removed; using multimodal_params_list
@@
-            multimodal_embedding = request.multimodal_embedding
-            if multimodal_embedding is not None:
-                multi_modal_data.append(multimodal_embedding)
+            # multimodal_embedding path removed; external embeddings are carried via multimodal_params_list
@@
-            'inputs_embeds': None,
-            "multimodal_params": multimodal_params_list
+            'inputs_embeds': None,
+            "multimodal_params": multimodal_params_list
Also applies to: 1705-1711
tensorrt_llm/_torch/models/modeling_llava_next.py (2)
84-88: Remove unused variable num_mm_tokens (Ruff F841).

Tidy up the local to avoid warnings.
-        num_medias = num_mm_tokens = len(mm_token_positions)
+        num_medias = len(mm_token_positions)
141-148: Reflow long assert message (E501) for readability.

Split the long f-string assertion.
-        assert len(
-            fused_input_ids
-        ) == fused_length, f"Fused input_ids length {len(fused_input_ids)} should match the sum of text and multimodal embedding lengths {fused_length}"
+        assert len(fused_input_ids) == fused_length, (
+            f"Fused input_ids length {len(fused_input_ids)} should match "
+            f"the sum of text and multimodal embedding lengths {fused_length}"
+        )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7bc795 and b2cd0bb.

📒 Files selected for processing (21)

tensorrt_llm/__init__.py (2 hunks)
tensorrt_llm/_torch/model_config.py (3 hunks)
tensorrt_llm/_torch/models/modeling_auto.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llava_next.py (10 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (5 hunks)
tensorrt_llm/_torch/models/modeling_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (4 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (2 hunks)
tensorrt_llm/commands/serve.py (4 hunks)
tensorrt_llm/executor/result.py (5 hunks)
tensorrt_llm/llmapi/__init__.py (1 hunks)
tensorrt_llm/llmapi/llm.py (2 hunks)
tensorrt_llm/llmapi/mm_encoder.py (1 hunks)
tensorrt_llm/serve/openai_server.py (7 hunks)
tests/unittest/_torch/multimodal/test_external_embedding.py (1 hunks)
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py (1 hunks)
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

tensorrt_llm/llmapi/init.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tensorrt_llm/_torch/models/modeling_utils.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tensorrt_llm/_torch/models/modeling_utils.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py

🧠 Learnings (12)

📓 Common learnings

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-06T21:22:55.018Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.

Applied to files:

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

📚 Learning: 2025-08-06T21:22:55.018Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Applied to files:

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

📚 Learning: 2025-07-22T09:22:14.726Z

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-08T04:10:18.987Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/llmapi/mm_encoder.py

📚 Learning: 2025-08-06T21:22:55.018Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

tensorrt_llm/llmapi/mm_encoder.py

📚 Learning: 2025-08-06T21:22:55.018Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

tensorrt_llm/llmapi/mm_encoder.py

📚 Learning: 2025-08-08T05:06:31.537Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

tensorrt_llm/llmapi/mm_encoder.py

📚 Learning: 2025-08-09T02:04:49.580Z

Learnt from: Fridah-nv
PR: NVIDIA/TensorRT-LLM#6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.580Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

tensorrt_llm/llmapi/mm_encoder.py

🪛 Ruff (0.12.2)

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py

111-111: Line too long (137 > 120)

(E501)

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

147-147: Line too long (126 > 120)

(E501)

153-153: Line too long (133 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_llava_next.py

84-84: Local variable num_mm_tokens is assigned to but never used

Remove assignment to unused variable num_mm_tokens

(F841)

147-147: Line too long (152 > 120)

(E501)

tensorrt_llm/__init__.py

53-53: Module level import not at top of file

(E402)

tensorrt_llm/_torch/pyexecutor/model_engine.py

1109-1109: Line too long (158 > 120)

(E501)

tensorrt_llm/commands/serve.py

391-394: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

414-414: Undefined name server_role

(F821)

416-416: Undefined name server_role

(F821)

tensorrt_llm/llmapi/mm_encoder.py

18-19: One-line docstring should fit on one line

Reformat to one line

(D200)

48-48: Line too long (124 > 120)

(E501)

53-53: Line too long (140 > 120)

(E501)

54-54: Line too long (141 > 120)

(E501)

103-105: 1 blank line required between summary line and description

(D205)

112-121: 1 blank line required between summary line and description

(D205)

116-116: Line too long (138 > 120)

(E501)

120-120: Line too long (148 > 120)

(E501)

162-167: 1 blank line required between summary line and description

(D205)

tensorrt_llm/serve/openai_server.py

355-355: Line too long (185 > 120)

(E501)

367-367: Line too long (132 > 120)

(E501)

🔇 Additional comments (5)

tensorrt_llm/_torch/models/modeling_utils.py (1)

576-576: LGTM: Vision encoder mapping added.

The new MODEL_CLASS_VISION_ENCODER_MAPPING is a clean extension point for encoder-only mode.

tests/unittest/_torch/multimodal/test_external_embedding.py (1)

89-178: Good coverage for external embedding fusion; assertions look solid.

The tests validate token expansion and embedding shapes for 1 and multiple images, and verify tokenizer calls. LGTM.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1272-1279: Good: reuse recovered tensors during generation.

strip_for_generation() is used for gen requests, avoiding redundant recovery. Matches the shared-tensor learning.

tensorrt_llm/_torch/models/modeling_llava_next.py (2)

27-29: Import cleanup looks good.

Avoids ModelConfig shadowing; thanks for fixing.

466-478: Precomputed embedding fast-path is correct.

Conditionally using externally provided embeddings avoids redundant vision forward; aligns with disaggregated flow.

tensorrt_llm/__init__.py

tensorrt_llm/commands/serve.py

tensorrt_llm/serve/openai_server.py

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py

tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py

Signed-off-by: Chang Liu <[email protected]>

chang-l · 2025-08-09T07:20:09Z

@Superjomn @pcastonguay @Funatiq
Thanks a lot for the review! I believe I’ve addressed/responded to all comments — ready for the next round.

chang-l · 2025-08-09T07:21:21Z

/bot run

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

tensorrt_llm/serve/openai_server.py (1)

188-196: Health-check still calls text-path in MM ENCODER mode

health_generate() always invokes self.openai_chat, so a server started with server_role == MM_ENCODER never exercises the multimodal-encoder path.
Replace the call with openai_mm_encoder when the role is MM_ENCODER, or register a dedicated /health_generate route inside register_mm_encoder_routes.

♻️ Duplicate comments (2)

tensorrt_llm/serve/openai_server.py (2)

364-392: disaggregated_params computed then dropped

to_llm_disaggregated_params() is called but its return value is discarded, and the resulting params are not forwarded to generate_async.
Either remove the call or capture the value and pass it through:

-            # TODO: placeholder for multimodal disagg e2e
-            to_llm_disaggregated_params(request.disaggregated_params)
+            disaggregated_params = to_llm_disaggregated_params(
+                request.disaggregated_params)
 ...
-            promise = self.llm.generate_async(
-                inputs=prompt,
-            )
+            promise = self.llm.generate_async(
+                inputs=prompt,
+                disaggregated_params=disaggregated_params,
+            )

345-356: Guard against missing mm_embedding_handle and fix long line

create_mm_embedding_response() assumes promise.mm_embedding_handle is always present and well-formed, and the response construction exceeds the 120-char line limit.

+            mm_embedding_handle = getattr(promise, "mm_embedding_handle", None)
+            if not mm_embedding_handle or "tensor_size" not in mm_embedding_handle:
+                return self.create_error_response(
+                    message="Multimodal embedding handle missing in response",
+                    err_type="InternalServerError",
+                    status_code=HTTPStatus.INTERNAL_SERVER_ERROR)
             num_tokens = int(mm_embedding_handle["tensor_size"][0])
-            return ChatCompletionResponse(id=str(promise.request_id), model=self.model, choices=[ChatCompletionResponseChoice(index=0, message=ChatMessage(role="assistant", content="dummy"), mm_embedding_handle=mm_embedding_handle, finish_reason="length")], usage=UsageInfo(prompt_tokens=num_tokens, completion_tokens=1, total_tokens=num_tokens+1))
+            return ChatCompletionResponse(
+                id=str(promise.request_id),
+                model=self.model,
+                choices=[
+                    ChatCompletionResponseChoice(
+                        index=0,
+                        message=ChatMessage(role="assistant", content="dummy"),
+                        mm_embedding_handle=mm_embedding_handle,
+                        finish_reason="length",
+                    )
+                ],
+                usage=UsageInfo(
+                    prompt_tokens=num_tokens,
+                    completion_tokens=1,
+                    total_tokens=num_tokens + 1,
+                ),
+            )

🧹 Nitpick comments (3)

tests/unittest/llmapi/apps/_test_openai_mmencoder.py (1)

133-139: Hard-coded token and hidden-size may make the test brittle

tensor_size[0] == 324 and tensor_size[1] == 2048 are model-specific.
If the upstream model or tokenizer changes, the test will fail needlessly.
Consider asserting the shape rank and positivity instead, or reading the expected sizes from the model config.

tensorrt_llm/_torch/models/modeling_llava_next.py (2)

80-83: num_mm_tokens computed but never used

num_mm_tokens = len(mm_token_positions) is dead code and trips Ruff F841.
Simply drop the assignment.

140-146: Line exceeds 120 chars

The assertion message at Line 144 is 152 chars long. Split for readability and to satisfy style checks.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2cd0bb and ca90390.

📒 Files selected for processing (7)

tensorrt_llm/_torch/models/modeling_llava_next.py (10 hunks)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (6 hunks)
tensorrt_llm/serve/openai_server.py (7 hunks)
tests/integration/defs/test_e2e.py (1 hunks)
tests/unittest/_torch/multimodal/test_external_embedding.py (1 hunks)
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py (1 hunks)
tests/unittest/llmapi/apps/_test_openai_mmencoder.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/integration/defs/test_e2e.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/test_e2e.py
tensorrt_llm/serve/openai_server.py

📚 Learning: 2025-07-22T09:22:14.726Z

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/models/modeling_llava_next.py

81-81: Local variable num_mm_tokens is assigned to but never used

Remove assignment to unused variable num_mm_tokens

(F841)

144-144: Line too long (152 > 120)

(E501)

tensorrt_llm/serve/openai_server.py

354-354: Line too long (185 > 120)

(E501)

366-366: Line too long (132 > 120)

(E501)

🔇 Additional comments (1)

tests/integration/defs/test_e2e.py (1)

1512-1517: New test looks good and follows existing pattern

The added test_openai_mmencoder_example mirrors the structure of other Open-AI example tests and should integrate cleanly with the suite.

Signed-off-by: Chang Liu <[email protected]>

tensorrt-cicd · 2025-08-19T06:55:13Z

PR_Github #15731 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: —post-merge

tensorrt-cicd · 2025-08-19T06:56:09Z

PR_Github #15732 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: —post-merge

chang-l · 2025-08-19T07:02:35Z

/bot run --post-merge

tensorrt-cicd · 2025-08-19T07:07:42Z

PR_Github #15735 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-19T09:05:10Z

PR_Github #15735 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11827 completed with status: 'FAILURE'

chang-l · 2025-08-19T16:20:04Z

/bot run --post-merge

tensorrt-cicd · 2025-08-19T16:25:01Z

PR_Github #15807 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-19T19:31:20Z

PR_Github #15807 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11881 completed with status: 'FAILURE'

chang-l · 2025-08-19T21:21:18Z

/bot reuse-pipeline

tensorrt-cicd · 2025-08-19T21:30:43Z

PR_Github #15822 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-08-19T21:34:37Z

PR_Github #15822 [ reuse-pipeline ] completed with state SUCCESS
Can't reuse PR_Github #15807 with status: FAILED

chang-l · 2025-08-19T21:42:51Z

/bot run

tensorrt-cicd · 2025-08-19T21:48:35Z

PR_Github #15823 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-19T23:19:49Z

PR_Github #15823 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11894 completed with status: 'FAILURE'

chang-l · 2025-08-19T23:29:05Z

/bot run

tensorrt-cicd · 2025-08-19T23:34:02Z

PR_Github #15829 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-20T02:51:40Z

PR_Github #15829 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11898 completed with status: 'FAILURE'

chang-l · 2025-08-20T04:16:53Z

/bot skip --comment "Pre-merge CI was clean here: #6743 (comment) and there is not code changes except merging w/ main; Plus, post-merge CI runs all passed except a known triton llava timeout issue, which is not related to this PR"

tensorrt-cicd · 2025-08-20T04:22:07Z

PR_Github #15860 [ skip ] triggered by Bot

tensorrt-cicd · 2025-08-20T04:41:52Z

PR_Github #15860 [ skip ] completed with state SUCCESS
Skipping testing for commit c9bedf2

chang-l requested review from a team as code owners August 8, 2025 06:54

chang-l requested review from Superjomn, byshiue, Wanli-Jiang and suyoggupta August 8, 2025 06:54

chang-l added 5 commits August 7, 2025 23:55

Add multimodal encoder engine path

3d56f70

Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Bug fixing and use tensor handle

af28597

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Enable trtllm-serve endpoint

3c7c788

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Enable llava-next

29a23e2

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Add unit test and add back attach_emb for llava model

e7bc795

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l force-pushed the add_mm_encoder branch from 810be54 to e7bc795 Compare August 8, 2025 06:55

coderabbitai bot reviewed Aug 8, 2025

View reviewed changes

Superjomn reviewed Aug 8, 2025

View reviewed changes

Minor update/cleanup

2c8c31e

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Funatiq reviewed Aug 8, 2025

View reviewed changes

pcastonguay requested changes Aug 8, 2025

View reviewed changes

chang-l added 2 commits August 8, 2025 21:46

Address comments and add more UTw

262ebcb

Signed-off-by: Chang Liu <[email protected]>

add type hints

b2cd0bb

Signed-off-by: Chang Liu <[email protected]>

coderabbitai bot reviewed Aug 9, 2025

View reviewed changes

Add integration (endpoint server) test

ca90390

Signed-off-by: Chang Liu <[email protected]>

coderabbitai bot reviewed Aug 9, 2025

View reviewed changes

chang-l requested a review from schetlur-nv August 9, 2025 07:28

Merge remote-tracking branch 'upstream/main' into add_mm_encoder

7d7b950

Signed-off-by: Chang Liu <[email protected]>

Merge branch 'main' into add_mm_encoder

c9bedf2

chang-l merged commit ce53832 into NVIDIA:main Aug 20, 2025
4 checks passed

This was referenced Sep 3, 2025

[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync #7280

Merged

[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) #7577

Open

[TRTLLM-7326][feat] Add standalone multimodal encoder #6743

[TRTLLM-7326][feat] Add standalone multimodal encoder #6743

Conversation

chang-l commented Aug 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chang-l commented Aug 9, 2025

Uh oh!

chang-l commented Aug 9, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

chang-l commented Aug 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 8, 2025 •

edited

Loading