-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-7326][feat] Add standalone multimodal encoder #6743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis change introduces a complete multimodal encoder-only execution and serving pathway. It adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAIServer
participant MultimodalEncoder
participant PyExecutor
Client->>OpenAIServer: POST /v1/chat/completions (image/text)
OpenAIServer->>MultimodalEncoder: generate_async(multimodal_inputs)
MultimodalEncoder->>PyExecutor: run_encoder_only(inputs)
PyExecutor-->>MultimodalEncoder: mm_embeddings
MultimodalEncoder-->>OpenAIServer: mm_embedding_handle
OpenAIServer-->>Client: ChatCompletionResponse (mm_embedding_handle)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Note 🔌 MCP (Model Context Protocol) integration is now available in Early Access!Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 22
🔭 Outside diff range comments (12)
tensorrt_llm/llmapi/disagg_utils.py (1)
1-1
: Missing NVIDIA copyright headerPer coding guidelines, add the NVIDIA copyright header with the current year at the top of this file.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +## Licensed under the Apache License, Version 2.0 (the "License"); +## you may not use this file except in compliance with the License. +## You may obtain a copy of the License at +## +## http://www.apache.org/licenses/LICENSE-2.0 +## +## Unless required by applicable law or agreed to in writing, software +## distributed under the License is distributed on an "AS IS" BASIS, +## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +## See the License for the specific language governing permissions and +## limitations under the License. + import loggingtensorrt_llm/llmapi/llm_args.py (1)
1-1
: Missing NVIDIA copyright headerPlease add the standard NVIDIA header at the top of this file as required by the coding guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +## Licensed under the Apache License, Version 2.0 (the "License"); +## you may not use this file except in compliance with the License. +## You may obtain a copy of the License at +## +## http://www.apache.org/licenses/LICENSE-2.0 +## +## Unless required by applicable law or agreed to in writing, software +## distributed under the License is distributed on an "AS IS" BASIS, +## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +## See the License for the specific language governing permissions and +## limitations under the License. + import copytensorrt_llm/llmapi/__init__.py (1)
1-1
: Missing NVIDIA copyright headerAdd the standard NVIDIA header at the top of this file.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +## Licensed under the Apache License, Version 2.0 (the "License"); +## you may not use this file except in compliance with the License. +## You may obtain a copy of the License at +## +## http://www.apache.org/licenses/LICENSE-2.0 +## +## Unless required by applicable law or agreed to in writing, software +## distributed under the License is distributed on an "AS IS" BASIS, +## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +## See the License for the specific language governing permissions and +## limitations under the License. + from ..disaggregated_params import DisaggregatedParamstensorrt_llm/serve/openai_protocol.py (1)
1-3
: Add NVIDIA SPDX header (production file)Per project guidelines, production .py files must start with the NVIDIA copyright:
Add at file top:
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0tensorrt_llm/_torch/pyexecutor/_util.py (1)
589-600
: Ensure mm_encoder_only short-circuits before TRTLLMSampler to avoid wrong sampler selectionIf enable_trtllm_sampler is True, the current order returns TRTLLMSampler before honoring mm_encoder_only. Encoder-only paths should bypass decoding/sampling entirely.
Apply this diff to short-circuit on mm_encoder_only before TRTLLMSampler:
@@ if engine.spec_config is not None and engine.spec_config.spec_dec_mode.has_spec_decoder( ): return get_spec_decoder(sampler_args, engine.spec_config) - if pytorch_backend_config.enable_trtllm_sampler: + # Short-circuit for multimodal encoder-only execution: no decoding/sampling. + if executor_config.mm_encoder_only: + # NOTE: handle model outputs specially for mm encoder executor/engine + return EarlyStopWithMMResult() + if pytorch_backend_config.enable_trtllm_sampler: decoding_mode = get_decoding_mode(executor_config) return TRTLLMSampler(executor_config, engine.model, engine.dtype, mapping, decoding_mode, pytorch_backend_config.disable_overlap_scheduler) - if executor_config.mm_encoder_only: - # NOTE: handle model outputs specially for mm encoder executor/engine - return EarlyStopWithMMResult()Optionally, add a warning if both mm_encoder_only and enable_trtllm_sampler are set.
tensorrt_llm/_torch/models/modeling_auto.py (1)
1-1
: Missing required NVIDIA copyright header.Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # See LICENSE for licensing terms.tensorrt_llm/executor/result.py (1)
1-1
: Missing required NVIDIA copyright header.Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # See LICENSE for licensing terms.tensorrt_llm/_torch/models/modeling_utils.py (1)
1-1
: Missing required NVIDIA copyright header.Add header per repo compliance.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # See LICENSE for licensing terms.tensorrt_llm/_torch/pyexecutor/sampler.py (1)
82-92
: AttributeError:SampleStateTensors
lacksmm_embeddings
EarlyStopSampler.update_requests
accessesstate.host.mm_embeddings
, butSampleStateTensors
defines onlynew_tokens
,logits
,log_probs
. This will crash whenpy_return_mm_embeddings
is true.Minimal fix:
class SampleStateTensors: new_tokens: torch.Tensor logits: torch.Tensor | None = None log_probs: torch.Tensor | None = None + mm_embeddings: list[torch.Tensor] | None = None
and populate it in
sample_async
:- host = SampleStateTensors(logits=model_outputs['logits'], - new_tokens=torch.empty(0)) + host = SampleStateTensors( + logits=model_outputs.get('logits'), + new_tokens=torch.empty(0), + mm_embeddings=model_outputs.get('mm_embeddings'), + )Alternatively, drop the
mm_embeddings
branch from this sampler if embeddings are guaranteed to be handled only byEarlyStopWithMMResult
.tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
1-1
: Missing NVIDIA copyright header (required).All OSS source files must include the NVIDIA copyright header with the current year. Please add it.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/_torch/models/modeling_llava_next.py (2)
318-321
: Detect GPU properties for the correct device index (not hard-coded 0).Current code queries device 0, which may not match self.device (cuda:{rank}). Use the actual device index to avoid incorrect SM detection on multi-GPU systems.
- prop = torch.cuda.get_device_properties(0) + device_idx = int(str(self.device).split(":")[1]) + prop = torch.cuda.get_device_properties(device_idx)
486-489
: self.dtype is undefined in LlavaNextModel; fix logging by defining it.This will raise at runtime. Define self.dtype before logging or log model_dtype only.
- self.model_dtype = getattr(config.text_config, "torch_dtype", - torch.float16) - logger.info(f"{self.dtype=} {self.model_dtype=}") + self.model_dtype = getattr(config.text_config, "torch_dtype", torch.float16) + self.dtype = self.model_dtype # Keep a consistent dtype attribute for logging + logger.info(f"{self.dtype=} {self.model_dtype=}")
🧹 Nitpick comments (40)
tensorrt_llm/llmapi/disagg_utils.py (1)
35-39
: Type mismatch: RouterConfig.server_role should be Optional[ServerRole]The field is annotated as ServerRole but defaulted to None. Recommend Optional for correctness and clearer intent.
- server_role: ServerRole = None + from typing import Optional + server_role: Optional[ServerRole] = Nonetensorrt_llm/llmapi/llm_args.py (1)
1943-1945
: Clarify allowed values in docs and CLI helpConsider updating the relevant docstrings/help text where load_format is surfaced (e.g., TorchLlmArgs.load_format description and CLI) to mention 'VISION_ONLY' explicitly and its scope (encoder-only multimodal mode).
tensorrt_llm/__init__.py (2)
54-54
: Silence Ruff E402 for intentional late import orderThis file imports mid-module by design (xgrammar must load before the rest). Ruff flags the new import with E402. Add a targeted noqa to avoid new lint noise without reordering.
-from .llmapi import MultimodalEncoder +from .llmapi import MultimodalEncoder # noqa: E402
53-56
: Minor nit: duplicate import of LlmArgsLlmArgs is imported both from
.llmapi
and.llmapi.llm_args
. While pre-existing, it’s confusing. Consider keeping only one source (prefer the specific.llmapi.llm_args
).tensorrt_llm/_torch/model_config.py (1)
99-101
: Add mm_encoder_only flag to ModelConfig — OKThis cleanly expresses encoder-only mode at config level.
Consider documenting this attribute in the class docstring for Sphinx (Google style) so it surfaces in generated docs.
tensorrt_llm/_torch/pyexecutor/config.py (2)
102-104
: Expose mm_encoder_only in PyTorchConfig — OKField addition aligns with the encoder-only pathway.
Add a short field docstring to clarify behavior for docs tooling.
133-135
: update_executor_config signature extension — OKAccepts mm_encoder_only and checkpoint_loader cleanly.
For consistency, if a PyTorchConfig is supplied, consider mirroring:
if executor_config.pytorch_backend_config is not None: executor_config.pytorch_backend_config.mm_encoder_only = bool(mm_encoder_only)This avoids relying solely on later _mangle logic.
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
176-176
: Type-annotate the new member to clarify intent and help static analysis.Initialize with an explicit Optional type to match the property signature.
- self._mm_embeddings = None + self._mm_embeddings: Optional[Dict[str, Any]] = None
192-194
: Validate input tensor and avoid accidental CPU/GPU mismatches.Guard against non-Tensor inputs and provide a clear error. Also document the expected device (CPU/GPU ok, SharedTensorContainer handles it) to reduce misuse.
def append_mm_embeddings(self, mm_embeddings: torch.Tensor): - self._mm_embeddings = SharedTensorContainer.from_tensor(mm_embeddings).dump_to_dict() + if not isinstance(mm_embeddings, torch.Tensor): + raise TypeError(f"mm_embeddings must be a torch.Tensor, got {type(mm_embeddings)}") + # Note: SharedTensorContainer accepts both CPU and CUDA tensors. + self._mm_embeddings = ( + tllm_shared_tensor.SharedTensorContainer + .from_tensor(mm_embeddings) + .dump_to_dict() + )
232-235
: Add a short docstring to mm_embedding_handle for external consumers.Keep API self-documenting; this property is part of the outward interface accessed via LlmResult.
@property def mm_embedding_handle(self) -> Dict[str, Any] | None: - return self._mm_embeddings + """A serialized handle (dict) to multimodal embeddings stored in shared memory. + + Consumers should recover the tensor from this handle using the appropriate + SharedTensorContainer recovery routine (context phase only). The handle is + opaque and lifetime-bound to the underlying shared resource. + """ + return self._mm_embeddingstensorrt_llm/_torch/models/modeling_auto.py (1)
48-69
: Add docstring and explicit return type; minor simplification.Clarify purpose and return type. Keep instantiation tight.
-class AutoModelForMultimodalEncoder(Generic[TModel, TConfig]): +class AutoModelForMultimodalEncoder(Generic[TModel, TConfig]): - @staticmethod - def from_config( - config: ModelConfig[TConfig], - ): + @staticmethod + def from_config( + config: ModelConfig[TConfig], + ) -> nn.Module: + """Factory for vision-only encoder models registered via register_vision_encoder(). + + Args: + config: The unified ModelConfig carrying pretrained_config.architectures[0]. + Returns: + Instantiated vision encoder nn.Module for the given architecture. + Raises: + ValueError: if the architecture is unknown/not registered. + """ model_arch = config.pretrained_config.architectures[0] # Direct lookup using architecture name vision_encoder_info = MODEL_CLASS_VISION_ENCODER_MAPPING.get(model_arch) if vision_encoder_info is None: raise ValueError( f"Unknown architecture for AutoModelForMultimodalEncoder: {model_arch}" ) vision_encoder_cls, vlm_base_model = vision_encoder_info - if vlm_base_model is None: - model = vision_encoder_cls(config) - else: - model = vision_encoder_cls(config, vlm_base_model) - - return model + return (vision_encoder_cls(config) if vlm_base_model is None + else vision_encoder_cls(config, vlm_base_model))Note: add the following import at file top if not present:
from torch import nntensorrt_llm/executor/result.py (2)
201-204
: Add a concise docstring for mm_embedding_handle.Clarify that this is a handle for downstream recovery.
@property def mm_embedding_handle(self) -> Optional[Dict[str, Any]]: - return self._mm_embedding_handle + """Serialized shared-tensor handle for multimodal embeddings (if present).""" + return self._mm_embedding_handle
567-568
: Consider omitting mm_embedding_handle from repr to avoid noisy logs.Handles may be verbose; including them in repr can bloat logs/telemetry.
- "context_logits", "mm_embedding_handle" + "context_logits"If you want it visible, consider summarized output (e.g., keys only) by customizing repr for this field.
tensorrt_llm/_torch/models/modeling_utils.py (1)
576-577
: Add a precise type annotation for MODEL_CLASS_VISION_ENCODER_MAPPING.Improves readability and static checks.
-MODEL_CLASS_VISION_ENCODER_MAPPING = {} +from typing import Tuple +MODEL_CLASS_VISION_ENCODER_MAPPING: Dict[str, Tuple[Type[nn.Module], Optional[Type[nn.Module]]]] = {}tensorrt_llm/commands/serve.py (1)
390-397
: Docstring nit – add terminating period
The first line of the docstring should end with a period to satisfy D415.- """Running an OpenAI API compatible server + """Running an OpenAI-API-compatible server.tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (3)
46-50
: Unused variablemodel_name
model_name
is assigned but never used – remove it or use it in log/skip messages to silence F841.
12-16
: Tests fetch images from the Internet – flakey
Relying on remote URLs makes the unit test non-deterministic and slow/offline-unfriendly. Ship small test images with the repo or usepytest.skip
when offline.
136-170
: Very long line & style issues
Several lines exceed 120 chars. Wrap them to keep within the guideline and improve readability.tensorrt_llm/_torch/pyexecutor/sampler.py (1)
138-140
: Line > 120 chars – wrap to satisfy project style.tensorrt_llm/llmapi/mm_encoder.py (9)
5-5
: Follow namespace import guideline.Per coding guidelines, prefer keeping module namespace instead of importing individual symbols.
-from .llm import BaseLLM, _TorchLLM, RequestOutput +from . import llm as llm_modThen update references:
class MultimodalEncoder(_TorchLLM):
→class MultimodalEncoder(llm_mod._TorchLLM):
BaseLLM._build_model(self)
→llm_mod.BaseLLM._build_model(self)
- Type hints:
RequestOutput
→llm_mod.RequestOutput
If the rest of the codebase commonly uses direct imports here, consider deferring to maintain consistency, but the guideline prefers namespaced imports.
16-18
: Fix one-line docstring format (D200).Keep one-line docstrings on a single line.
-class MultimodalEncoder(_TorchLLM): - """MultimodalEncoder class is the main class for running a multimodal encoder model using PyTorch backend. -""" +class MultimodalEncoder(llm_mod._TorchLLM): + """MultimodalEncoder runs a multimodal encoder model using the PyTorch backend."""
30-34
: Style: argument spacing.Nit: follow PEP8 spacing for kwargs.
- dtype = dtype, + dtype=dtype,
40-46
: Wrap long comments (E501) and clarify.Break long lines to <=120 chars and keep comments concise.
- # Tokenizer loading should be after calling model_loader(), since model_loader() may download the model from HF hub. - # It should also be before bindings ExecutorConfig, which may depend on tokenizer info. + # Load tokenizer after model_loader() (which may download from HF hub), + # and before binding ExecutorConfig (which may depend on tokenizer info).
58-59
: Optional: line length and readability.Break the long assignment.
- kwargs[ - "batching_type"] = self.args.batching_type or tllm.BatchingType.INFLIGHT + kwargs["batching_type"] = ( + self.args.batching_type or tllm.BatchingType.INFLIGHT + )
83-93
: Confirmis_llm_executor
for mm-encoder-only.Setting
is_llm_executor=True
may affect executor behavior (metrics, routing). For encoder-only, this might need to beFalse
. Please confirm.- is_llm_executor=True, # TODO: check if this is correct or needed + # Encoder-only executor; verify whether LLM-specific paths should be disabled. + is_llm_executor=False,If other code paths rely on this being True, keep it and document why.
95-99
: Docstring spacing (D205).Add a blank line between summary and description.
- """Validate that users don't pass LLM-specific arguments when using MultimodalEncoder (PyTorch). - Placeholder for now. - """ + """Validate that users don't pass LLM-specific arguments when using MultimodalEncoder (PyTorch). + + Placeholder for now. + """
105-114
: Docstring length and spacing (E501, D205).Wrap long lines and add a blank line between summary and details.
- """Generate output for the given prompts in the synchronous mode. - Synchronous generation accepts either single prompt or batched prompts. + """Generate output for the given prompts in synchronous mode. + + Synchronous generation accepts either a single prompt or batched prompts. @@ - inputs (tensorrt_llm.inputs.data.PromptInputs, Sequence[tensorrt_llm.inputs.data.PromptInputs]): The prompt text or token ids. - It can be single prompt or batched prompts. + inputs (PromptInputs | Sequence[PromptInputs]): The prompt text or token ids. + It can be a single prompt or batched prompts. @@ - Union[tensorrt_llm.llmapi.RequestOutput, List[tensorrt_llm.llmapi.RequestOutput]]: The output data of the completion request to the LLM. + Union[llm_mod.RequestOutput, List[llm_mod.RequestOutput]]: The outputs of the encoder requests.
148-160
: Docstring spacing (D205).Add a blank line between summary and the rest. Also align the return description.
- """Generate output for the given multimodal request in the asynchronous mode. - Asynchronous generation accepts single multimodal request only. + """Generate output for the given multimodal request in asynchronous mode. + + Asynchronous generation accepts a single multimodal request only. @@ - Future that resolves to tensorrt_llm.llmapi.RequestOutput containing mm_embeddings + A future that resolves to llm_mod.RequestOutput containing mm_embeddings.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1646-1656
: Prefer MultimodalParams over raw embeddings; remove now-unused multi_modal_dataThe new MultimodalParams flow is good. It appears multi_modal_data is now redundant in this function. Removing it avoids confusion and potential linter warnings.
Consider deleting the multi_modal_data list and related branches in this function.
tensorrt_llm/serve/openai_server.py (5)
10-10
: Missing NVIDIA copyright header (required for OSS source files).Per repository guidelines, add the NVIDIA copyright header with the current year.
Example header to add at the top of the file:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.
45-45
: Duplicate import of UsageInfo from openai_protocol.UsageInfo is already imported on Line 39. Remove the duplicate to avoid shadowing/readability issues.
-from tensorrt_llm.serve.openai_protocol import ChatMessage, ChatCompletionResponseChoice, UsageInfo +from tensorrt_llm.serve.openai_protocol import ChatMessage, ChatCompletionResponseChoice
160-170
: MM-encoder route set looks right; consider parity and health_generate behavior.
- Parity: Consider whether /kv_cache_events should be exposed for MM encoder mode (optional).
- Health: As noted, ensure health_generate uses the MM encoder path to avoid calling text generation.
366-383
: Fix long line and improve readability of parse_chat_messages_coroutines call.Wrap to respect line length (Ruff E501).
- conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(request.messages, self.model_config) + ( + conversation, + mm_coroutines, + mm_placeholder_counts, + ) = parse_chat_messages_coroutines( + request.messages, + self.model_config, + )
343-344
: Explicitly reject streaming for MM encoder endpoint (if unsupported).If MM encoder doesn’t support streaming, return a 400 early when request.stream is True to avoid client confusion.
async def openai_mm_encoder(self, request: ChatCompletionRequest, raw_request: Request) -> Response: + if getattr(request, "stream", False): + return self.create_error_response( + message="stream=True is not supported for multimodal encoder", + err_type="BadRequestError", + status_code=HTTPStatus.BAD_REQUEST, + )tensorrt_llm/_torch/models/modeling_llava_next.py (6)
3-3
: Missing NVIDIA copyright header (required for OSS source files).Please add the NVIDIA copyright header with the current year at the top of this file.
Example header:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.
132-133
: Remove unused variable assignment (Ruff F841).num_mm_tokens is never used. Keep num_medias only.
- num_medias = num_mm_tokens = len(mm_token_positions) + num_medias = len(mm_token_positions)
194-196
: Break long assertion message (Ruff E501) and improve readability.- ) == fused_length, f"Fused input_ids length {len(fused_input_ids)} should match the sum of text and multimodal embedding lengths {fused_length}" + ) == fused_length, ( + f"Fused input_ids length {len(fused_input_ids)} should match the " + f"sum of text and multimodal embedding lengths {fused_length}" + )
237-242
: Ensure multimodal embeddings dtype matches text model dtype.Stacked mm_features should be cast to the text model dtype to avoid downstream dtype mismatches in fuse_input_embeds.
- mm_features = torch.stack(multimodal_embedding['image']) + mm_features = torch.stack(multimodal_embedding['image']) + target_dtype = getattr(self.model_config.text_config, "torch_dtype", mm_features.dtype) + mm_features = mm_features.to(target_dtype)
521-527
: Style nit: double space after 'if'.Minor readability nit. Also consider an early return if multimodal_embedding is not provided.
- if multimodal_params[0].multimodal_data.get("multimodal_embedding", None) is not None: + if multimodal_params[0].multimodal_data.get("multimodal_embedding", None) is not None:
339-342
: Docstrings for new helpers (post_config) and input processor methods.Consider adding short Google-style docstrings for post_config, _postprocess, and get_num_tokens_per_image to aid maintainability and Sphinx docs.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
tensorrt_llm/__init__.py
(2 hunks)tensorrt_llm/_torch/model_config.py
(3 hunks)tensorrt_llm/_torch/models/modeling_auto.py
(3 hunks)tensorrt_llm/_torch/models/modeling_llava_next.py
(10 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(5 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/commands/serve.py
(4 hunks)tensorrt_llm/executor/result.py
(5 hunks)tensorrt_llm/llmapi/__init__.py
(2 hunks)tensorrt_llm/llmapi/disagg_utils.py
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(1 hunks)tensorrt_llm/llmapi/mm_encoder.py
(1 hunks)tensorrt_llm/serve/openai_protocol.py
(1 hunks)tensorrt_llm/serve/openai_server.py
(7 hunks)tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/serve/openai_server.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/serve/openai_server.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/__init__.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_auto.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/__init__.py
tensorrt_llm/llmapi/mm_encoder.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🪛 Ruff (0.12.2)
tensorrt_llm/__init__.py
54-54: Module level import not at top of file
(E402)
tensorrt_llm/_torch/pyexecutor/sampler.py
138-138: Line too long (132 > 120)
(E501)
tensorrt_llm/llmapi/mm_encoder.py
17-18: One-line docstring should fit on one line
Reformat to one line
(D200)
40-40: Line too long (124 > 120)
(E501)
45-45: Line too long (140 > 120)
(E501)
46-46: Line too long (141 > 120)
(E501)
95-97: 1 blank line required between summary line and description
(D205)
105-114: 1 blank line required between summary line and description
(D205)
109-109: Line too long (138 > 120)
(E501)
113-113: Line too long (148 > 120)
(E501)
155-160: 1 blank line required between summary line and description
(D205)
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
37-37: Blank line contains whitespace
Remove whitespace from blank line
(W293)
46-46: Local variable model_name
is assigned to but never used
Remove assignment to unused variable model_name
(F841)
136-136: Line too long (135 > 120)
(E501)
159-159: Line too long (130 > 120)
(E501)
164-164: Line too long (137 > 120)
(E501)
tensorrt_llm/commands/serve.py
390-393: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
413-413: Undefined name server_role
(F821)
415-415: Undefined name server_role
(F821)
tensorrt_llm/_torch/pyexecutor/model_engine.py
1107-1107: Line too long (158 > 120)
(E501)
2275-2275: Line too long (127 > 120)
(E501)
tensorrt_llm/_torch/models/modeling_llava_next.py
27-27: Redefinition of unused ModelConfig
from line 23
(F811)
132-132: Local variable num_mm_tokens
is assigned to but never used
Remove assignment to unused variable num_mm_tokens
(F841)
195-195: Line too long (152 > 120)
(E501)
tensorrt_llm/serve/openai_server.py
354-354: Line too long (185 > 120)
(E501)
364-364: Local variable disaggregated_params
is assigned to but never used
Remove assignment to unused variable disaggregated_params
(F841)
366-366: Line too long (132 > 120)
(E501)
🔇 Additional comments (24)
tensorrt_llm/llmapi/disagg_utils.py (1)
22-22
: New ServerRole.MM_ENCODER: ensure all dispatch paths handle itAddition looks correct (unique value). Please verify that all consumers (routing, serving, executor config) that switch on ServerRole have explicit handling for MM_ENCODER to avoid falling into default/legacy CONTEXT/GENERATION branches.
tensorrt_llm/llmapi/llm_args.py (1)
1943-1945
: Add LoadFormat.VISION_ONLY – integration looks goodVISION_ONLY integrates cleanly with TorchLlmArgs.convert_load_format (string-to-enum). Ensure all intended call sites accept this value and that non-PyTorch backends ignore or reject it as designed.
tensorrt_llm/llmapi/__init__.py (1)
24-25
: Public API exposure looks goodExporting MultimodalEncoder via all is correct and aligns with the new feature.
tensorrt_llm/__init__.py (1)
107-107
: Export looks goodAdding MultimodalEncoder to all is correct and keeps the public API consistent.
tensorrt_llm/llmapi/llm.py (1)
56-56
: Docstring addition looks goodClear description of mm_embedding_handle and aligns with new MM encoder flow.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
33-33
: Import for multimodal encoder sampler result looks goodImporting EarlyStopWithMMResult is necessary for the encoder-only sampler flow.
tensorrt_llm/_torch/model_config.py (2)
121-121
: Correctly plumbs mm_encoder_only into generation classificationPassing mm_encoder_only into is_generation_model ensures encoder-only models are treated as non-generation.
170-177
: Allis_generation_model
calls updated withmm_encoder_only
I searched across the codebase and found only the invocation intensorrt_llm/_torch/model_config.py
already passes the new parameter. No other call sites exist. Everything is correctly updated.
- tensorrt_llm/_torch/model_config.py:120–121 now uses
self.is_generation_model(self.pretrained_config.architectures, mm_encoder_only=self.mm_encoder_only)
tensorrt_llm/_torch/pyexecutor/config.py (2)
118-119
: Include mm_encoder_only in dynamic executor fields — OKEnsures dynamic assignment is permitted.
148-149
: Propagating mm_encoder_only to executor_config — OKThis is required for downstream checks (e.g., sampler and engine behavior).
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
239-239
: LGTM: LlmResult exposes mm_embedding_handle via py_result_properties.This enables transparent access and aligns with executor/result plumbing.
1-1
: Ignore outdated shared tensor recovery suggestionAfter searching the entire codebase, there are no occurrences of
from_shared_tensor()
—only a singlestrip_for_generation()
call inmodel_engine.py
. The original comment about verifying shared‐tensor recovery alignment with the org learning no longer applies and can be disregarded.Likely an incorrect or invalid review comment.
tensorrt_llm/executor/result.py (2)
164-165
: LGTM: Store optional multimodal embedding handle on GenerationResultBase.Initialization aligns with the new property; no functional issues.
339-341
: LGTM: Propagate handle from LlmResult to GenerationResultBase.This keeps the Python-side result in sync with executor results.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (4)
6-6
: Import aliasing is fine here.Using
import torch.nn as nn
maintains the namespace and is consistent with guidelines.
25-25
: Vision encoder registration import looks correct.
register_vision_encoder
import aligns with the new registration framework and enables AutoModel to resolve the encoder correctly.
654-654
: Decorator registration for Qwen2VL vision encoder looks correct.This ties the VLM base model to the vision encoder via the registry, enabling encoder-only instantiation.
666-666
: Decorator registration for Qwen2_5_VL vision encoder looks correct.Registration aligns with the 2.5 variant as well.
tensorrt_llm/llmapi/mm_encoder.py (1)
60-82
: Verify mm-encoder-only execution wiringI’ve confirmed that:
mm_encoder_only=True
is correctly passed intoupdate_executor_config
.- In
_util.py
(around lines 597–600), the sampler is switched toEarlyStopWithMMResult
whenmm_encoder_only
is enabled.- In
model_engine.py
(around lines 1270–1274),multimodal_params.strip_for_generation()
is invoked before the vision-encoder forward pass.I was unable to locate any calls to
from_shared_tensor()
in the context path. Per the multimodal pipeline design, you should recover shared tensors withfrom_shared_tensor()
during context requests, then usestrip_for_generation()
on subsequent generation requests to avoid redundant recovery.Please verify that:
- Context-phase code invokes
MultimodalParams.from_shared_tensor()
to recover shared tensors.- Generation-phase code continues to call
strip_for_generation()
as shown.tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
1014-1015
: Propagate mm_encoder_only into load_config — LGTMPassing mm_encoder_only down to the loader is correct and keeps config as the source of truth.
1629-1630
: Initialize multimodal_params_list — LGTMSwitching to MultimodalParams as the transport is the right direction for no-cache path.
1657-1658
: Set py_batch_idx for context requests — LGTMKeeps batch mapping consistent across paths.
1710-1711
: Inputs now carry "multimodal_params" — LGTMConsistent with the cached path; downstream code can consume a uniform key.
2156-2161
: Unconditional mm_encoder_only dispatch before KV cache checkThe
mm_encoder_only
branch is currently gated bykv_cache_manager is None
(lines 2151–2161 inmodel_engine.py
). If a KV cache manager is ever provisioned for an encoder-only model, the multimodal encoder-only path will be skipped. We should pull themm_encoder_only
check to the top and always use the no-cache prep in this mode.Locations to update:
- tensorrt_llm/_torch/pyexecutor/model_engine.py around line 2150
Proposed diff:
- if kv_cache_manager is None: - inputs, gather_ids = self._prepare_tp_inputs_no_cache( - scheduled_requests, attn_metadata, spec_metadata) - with MoeLoadBalancerIterContext(moe_load_balancer): - if self.pytorch_backend_config.mm_encoder_only: - return self._forward_step_mm_encoder_only(inputs, scheduled_requests) - else: - return self._forward_step(inputs, gather_ids, gather_context_logits) - with self._maybe_pad_batch(scheduled_requests, kv_cache_manager, - spec_resource_manager) as scheduled_requests: + # Handle multimodal encoder-only mode first (always no-cache) + if self.pytorch_backend_config.mm_encoder_only: + inputs, gather_ids = self._prepare_tp_inputs_no_cache( + scheduled_requests, attn_metadata, spec_metadata) + with MoeLoadBalancerIterContext(moe_load_balancer): + return self._forward_step_mm_encoder_only(inputs, scheduled_requests) + if kv_cache_manager is None: + inputs, gather_ids = self._prepare_tp_inputs_no_cache( + scheduled_requests, attn_metadata, spec_metadata) + with self._maybe_pad_batch(scheduled_requests, kv_cache_manager, + spec_resource_manager) as scheduled_requests:Please verify whether
resource_manager.get_resource_manager(...)
can ever return a non-None KV cache manager whenmm_encoder_only
is enabled, and adjust accordingly.
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu <[email protected]>
Signed-off-by: Chang Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🔭 Outside diff range comments (2)
tensorrt_llm/__init__.py (1)
112-112
: Exported ‘KvCacheConfig’ is not imported here ⇒ broken re-export.all lists KvCacheConfig, but this module doesn’t import it. from tensorrt_llm import KvCacheConfig will likely fail.
Fix by importing KvCacheConfig alongside the existing llmapi import:
-from .llmapi import LLM, MultimodalEncoder # noqa: E402 +from .llmapi import LLM, MultimodalEncoder, KvCacheConfig # noqa: E402tensorrt_llm/serve/openai_server.py (1)
85-98
: Fix potential None dereference for server_role in lifespan metadata.server_role is Optional; accessing server_role.name can raise when None. Use self.server_role and guard.
- "server_role": server_role.name, + "server_role": (self.server_role.name if self.server_role else None),
♻️ Duplicate comments (10)
tensorrt_llm/_torch/models/modeling_utils.py (1)
592-621
: LGTM: Decorator docstring + fail-fast behavior implemented.The for/else with a clear ValueError prevents silent misregistration. This aligns with prior review feedback.
tensorrt_llm/commands/serve.py (1)
413-421
: Fix NameError: server_role is undefined in serve_encoder.This block references server_role but it’s not defined in this scope. The role is fixed in launch_mm_encoder_server, so this validation is unnecessary.
Apply:
- if metadata_server_cfg is not None: - assert server_role is not None, "server_role is required when metadata_server_cfg is provided" - try: - server_role = ServerRole[server_role.upper()] - assert server_role == ServerRole.MM_ENCODER, "server_role must be MM_ENCODER for multimodal encoder" - except ValueError: - raise ValueError(f"Invalid server role: {server_role}. " \ - f"Must be one of: {', '.join([role.name for role in ServerRole])}") + # Metadata server cfg is passed through; server role is enforced as MM_ENCODER in launch_mm_encoder_server.tensorrt_llm/serve/openai_server.py (2)
175-207
: health_generate should call openai_mm_encoder in MM_ENCODER mode.In MM encoder role, /health_generate still calls openai_chat. Branch by role to exercise the correct path.
- # Call the chat completion logic - response = await self.openai_chat(health_request, mock_request) + # Call the appropriate minimal-generation logic based on server role + if self.server_role is ServerRole.MM_ENCODER: + response = await self.openai_mm_encoder(health_request, mock_request) + else: + response = await self.openai_chat(health_request, mock_request)Also applies to: 189-191
344-357
: Guard mm_embedding_handle and reflow long lines (E501).Avoid assuming mm_embedding_handle is always present; add checks and reformat for readability.
- async def create_mm_embedding_response( - promise: RequestOutput): + async def create_mm_embedding_response(promise: RequestOutput): await promise.aresult() - mm_embedding_handle = promise.mm_embedding_handle - num_tokens = mm_embedding_handle["tensor_size"][0] + mm_embedding_handle = getattr(promise, "mm_embedding_handle", None) + if not mm_embedding_handle or "tensor_size" not in mm_embedding_handle: + return self.create_error_response( + message="Multimodal embedding handle missing in response", + err_type="InternalServerError", + status_code=HTTPStatus.INTERNAL_SERVER_ERROR) + num_tokens = int(mm_embedding_handle["tensor_size"][0]) # TODO: need to add more info like input_ids, specific_token_ids, mrope, mm_hashes, etc - return ChatCompletionResponse( - id=str(promise.request_id), - model=self.model, - choices=[ChatCompletionResponseChoice(index=0, message=ChatMessage(role="assistant", content="dummy"), mm_embedding_handle=mm_embedding_handle, finish_reason="length")], - usage=UsageInfo(prompt_tokens=num_tokens, completion_tokens=1, total_tokens=num_tokens+1)) + return ChatCompletionResponse( + id=str(promise.request_id), + model=self.model, + choices=[ + ChatCompletionResponseChoice( + index=0, + message=ChatMessage(role="assistant", content="dummy"), + mm_embedding_handle=mm_embedding_handle, + finish_reason="length", + ) + ], + usage=UsageInfo( + prompt_tokens=num_tokens, + completion_tokens=1, + total_tokens=num_tokens + 1, + ), + )tensorrt_llm/llmapi/mm_encoder.py (3)
1-1
: Add NVIDIA OSS copyright header.Required for .py source files in OSS.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.
55-58
: Tokenizer attribute consistency with _try_load_tokenizer.use self._tokenizer both for passing into input_processor and for assignment after creation.
- self.input_processor = create_input_processor(self._hf_model_dir, - self.tokenizer) - self.tokenizer = self.input_processor.tokenizer + self.input_processor = create_input_processor(self._hf_model_dir, + self._tokenizer) + self._tokenizer = self.input_processor.tokenizer
107-151
: Bug: generate() returns unresolved futures instead of RequestOutput(s).The docstring promises resolved results. Collect future.result() and return those; also remove the unused helper.
- def _item_at(maybe_batched: Union[Any, Sequence[Any]], pos: int) -> Any: - if isinstance(maybe_batched, list): - return maybe_batched[pos] - else: - return maybe_batched - - futures = [] + futures: List[Any] = [] for i, request_inputs in enumerate(inputs): future = self.generate_async(request_inputs) futures.append(future) - for future in tqdm(futures, - desc="Processed requests", - dynamic_ncols=True, - disable=not use_tqdm): - future.result() + results: List[RequestOutput] = [] + for future in tqdm( + futures, + desc="Processed requests", + dynamic_ncols=True, + disable=not use_tqdm, + ): + results.append(future.result()) if unbatched: - futures = futures[0] + return results[0] - return futures + return resultstensorrt_llm/_torch/pyexecutor/model_engine.py (3)
1077-1082
: VISION_ONLY: add explicit guard to ensure weights truly preloaded.Log is good; also warn (or error) if the model doesn’t advertise preloaded vision weights.
elif load_format == LoadFormat.VISION_ONLY: # Vision weights are already loaded within the model. logger.info( "LoadFormat.VISION_ONLY: skipping weight loading; using preloaded vision weights." ) + if not getattr(model, "vision_weights_preloaded", False) and not getattr(model, "is_vision_encoder", False): + logger.warning( + "VISION_ONLY format is used but the model does not advertise preloaded vision weights." + )
1108-1113
: Avoid hardcoded magic number for mm_encoder_only max_seq_len.Derive from model or backend config with a safe fallback; also shortens the long comment.
- if self.pytorch_backend_config.mm_encoder_only: - # TODO: hardcoded for now, need to infer as: max_num_token_per_image * max_num_images (can be given as a parameter or defined same as max_seq_len) - inferred_max_seq_len = 32768 + if self.pytorch_backend_config.mm_encoder_only: + # Derive from config; fallback to a safe default. + inferred_max_seq_len = ( + getattr(self.model.config, "max_mm_tokens", None) + or getattr(self.pytorch_backend_config, "max_mm_seq_len", None) + or 32768 + )
2256-2293
: Make mm_encoder_only forward robust: no assert, stable returns, correct splitting, and use model_forward.
- Always include 'logits': None, even when no multimodal data.
- Replace assert with ValueError (asserts can be stripped).
- Use model_forward to preserve tracing/compile behaviors.
- Split evenly by len(multimodal_params) when lengths unavailable (not by context_requests).
@nvtx_range("_forward_step_mm_encoder_only") def _forward_step_mm_encoder_only( self, inputs: Dict[str, Any], scheduled_requests: ScheduledRequests) -> Dict[str, Any]: - """Forward step for multimodal encoder only mode - returns mm_embeddings instead of logits.""" + """Run the vision encoder in encoder-only mode; return embeddings, no logits.""" # Get multimodal parameters from inputs multimodal_params = inputs.get("multimodal_params", []) - if not multimodal_params or len(multimodal_params) == 0: - # Return empty embeddings if no multimodal data - return {'mm_embeddings': []} + if not multimodal_params: + return {'mm_embeddings': [], 'logits': None} if getattr(scheduled_requests.context_requests[0], 'multimodal_lengths', None) is None: multimodal_chunks = None else: multimodal_chunks = [ sum(request.multimodal_lengths) for request in scheduled_requests.context_requests if request.multimodal_lengths is not None ] # For mm_encoder_only mode, we only run the vision encoder part # The model should be a vision encoder (e.g., Qwen2VisionModelBase) - mm_embeddings = self.model.forward(multimodal_params) - assert len( - mm_embeddings - ) == 1, "mm_embeddings should be a 1-element list, mix modality (video+image) is not supported" + mm_embeddings = self.model_forward(multimodal_params=multimodal_params) + if len(mm_embeddings) != 1: + raise ValueError("Expected a single embeddings tensor; mixed modalities not supported yet.") - if multimodal_chunks is None or len(multimodal_chunks) != len( - multimodal_params): - mm_embeddings = list( - torch.chunk(mm_embeddings[0], - len(scheduled_requests.context_requests), - dim=0)) + if multimodal_chunks is None or len(multimodal_chunks) != len(multimodal_params): + mm_embeddings = list(torch.chunk(mm_embeddings[0], len(multimodal_params), dim=0)) else: mm_embeddings = list( torch.split(mm_embeddings[0], multimodal_chunks, dim=0)) - return {'mm_embeddings': mm_embeddings, 'logits': None} + return {'mm_embeddings': mm_embeddings, 'logits': None}
🧹 Nitpick comments (14)
tensorrt_llm/__init__.py (1)
53-53
: Silence E402 for this file’s special import ordering.This module intentionally performs early side-effectful setup and non-top imports (e.g., xgrammar pre-import). Ruff flags E402 here. Either adjust tool config for this file or mark this import with noqa.
Example:
-from .llmapi import LLM, MultimodalEncoder +from .llmapi import LLM, MultimodalEncoder # noqa: E402tests/unittest/_torch/multimodal/test_find_num_image_tokens.py (2)
98-116
: Avoid redundant network downloads; reuse images or mark test as network/slow.You download each image again in the inner loop despite already providing URLs. Either:
- Reuse locally downloaded PIL images for both the loader and dimension checks, or
- Mark the test as slow/external-data to reflect the cost.
Option A (reuse PIL images):
-# Prepare batch inputs with all 3 images -prompts = ["dummy"] * len(example_images) -media = example_images # All image URLs +prompts = ["dummy"] * len(example_images) +media = [download_image(u) for u in example_images]Then remove the per-iteration download (Lines 119-122) and instead use the already-loaded PIL image:
-# Get test image dimensions -test_image = download_image(test_image_url) -image_width, image_height = test_image.size +image_width, image_height = media[image_idx].sizeIf default_multimodal_input_loader does not accept PIL Images, please confirm and we can instead cache the downloaded bytes and reuse them.
111-111
: Fix line length to satisfy Ruff E501.Ruff flags Line 111 as >120 chars. Wrap the call or break arguments to new lines consistently.
tensorrt_llm/_torch/models/modeling_utils.py (1)
576-577
: Optional: Add explicit type annotations for registry dicts.Improves readability and static tooling.
-MODEL_CLASS_VISION_ENCODER_MAPPING = {} +MODEL_CLASS_VISION_ENCODER_MAPPING: Dict[str, Tuple[Type[nn.Module], Optional[Type[nn.Module]]]] = {}tensorrt_llm/commands/serve.py (3)
391-394
: Docstring punctuation (D415).First line should end with a period.
- """Running an OpenAI API compatible server + """Running an OpenAI API compatible server.
377-383
: Clarify help text: not “trtllm-serve”.This option applies to the encoder server. Adjust wording to avoid confusion.
- "Path to a YAML file that overwrites the parameters specified by trtllm-serve." + "Path to a YAML file that overwrites parameters for the mm_embedding_serve command."
403-409
: YAML parsing: fail fast with a clear error.Wrap safe_load with a narrow except to surface config errors cleanly.
- if extra_encoder_options is not None: - with open(extra_encoder_options, 'r') as f: - encoder_args_extra_dict = yaml.safe_load(f) + if extra_encoder_options is not None: + try: + with open(extra_encoder_options, 'r') as f: + encoder_args_extra_dict = yaml.safe_load(f) or {} + except (OSError, yaml.YAMLError) as e: + raise ValueError(f"Failed to load extra encoder options from {extra_encoder_options}: {e}") from etests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (2)
68-75
: Pass trust_remote_code to MultimodalEncoder for consistency.You set trust_remote_code on LLM; do the same for the encoder to avoid model-specific import issues.
- encoder = MultimodalEncoder(model=encoder_model_dir, - max_batch_size=max_batch_size) + encoder = MultimodalEncoder(model=encoder_model_dir, + max_batch_size=max_batch_size, + trust_remote_code=True)
147-153
: Wrap long assertion messages (E501).Break into multiple lines or use parentheses to satisfy line length constraints.
Example:
- assert ref_gen.text == test_gen.text, \ - f"Generated text doesn't match for output {i}, generation {j}:\nReference: {ref_gen.text!r}\nTest: {test_gen.text!r}" + assert ref_gen.text == test_gen.text, ( + f"Generated text mismatch for output {i}, gen {j}:\n" + f"Reference: {ref_gen.text!r}\nTest: {test_gen.text!r}" + )tests/unittest/_torch/multimodal/test_external_embedding.py (1)
89-178
: Add negative-path tests (mismatch and modality errors).Consider adding:
- Mismatch between number of
tokens and provided image embeddings (expect ValueError or clear failure).
- Passing multimodal_embedding without 'image' key (expect ValueError).
This hardens behavior around error handling.
tensorrt_llm/serve/openai_server.py (1)
49-49
: Nit: fix typo in yapf directive.Enable is misspelled.
-# yapf: enale +# yapf: enabletensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1628-1658
: Remove unused multi_modal_data path in no-cache prep.multi_modal_data list and associated assignments are unused after switching to multimodal_params; remove to reduce confusion.
- multi_modal_data = [] + # multi_modal_data path removed; using multimodal_params_list @@ - multimodal_embedding = request.multimodal_embedding - if multimodal_embedding is not None: - multi_modal_data.append(multimodal_embedding) + # multimodal_embedding path removed; external embeddings are carried via multimodal_params_list @@ - 'inputs_embeds': None, - "multimodal_params": multimodal_params_list + 'inputs_embeds': None, + "multimodal_params": multimodal_params_listAlso applies to: 1705-1711
tensorrt_llm/_torch/models/modeling_llava_next.py (2)
84-88
: Remove unused variable num_mm_tokens (Ruff F841).Tidy up the local to avoid warnings.
- num_medias = num_mm_tokens = len(mm_token_positions) + num_medias = len(mm_token_positions)
141-148
: Reflow long assert message (E501) for readability.Split the long f-string assertion.
- assert len( - fused_input_ids - ) == fused_length, f"Fused input_ids length {len(fused_input_ids)} should match the sum of text and multimodal embedding lengths {fused_length}" + assert len(fused_input_ids) == fused_length, ( + f"Fused input_ids length {len(fused_input_ids)} should match " + f"the sum of text and multimodal embedding lengths {fused_length}" + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
tensorrt_llm/__init__.py
(2 hunks)tensorrt_llm/_torch/model_config.py
(3 hunks)tensorrt_llm/_torch/models/modeling_auto.py
(2 hunks)tensorrt_llm/_torch/models/modeling_llava_next.py
(10 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(5 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(2 hunks)tensorrt_llm/commands/serve.py
(4 hunks)tensorrt_llm/executor/result.py
(5 hunks)tensorrt_llm/llmapi/__init__.py
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/llmapi/mm_encoder.py
(1 hunks)tensorrt_llm/serve/openai_server.py
(7 hunks)tests/unittest/_torch/multimodal/test_external_embedding.py
(1 hunks)tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
(1 hunks)tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- tensorrt_llm/llmapi/init.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/_torch/models/modeling_auto.py
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/_torch/model_config.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/pyexecutor/sampler.py
- tensorrt_llm/_torch/models/modeling_qwen2vl.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tensorrt_llm/_torch/models/modeling_utils.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tensorrt_llm/_torch/models/modeling_utils.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py
🧠 Learnings (12)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/unittest/_torch/multimodal/test_external_embedding.py
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.
Applied to files:
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Applied to files:
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/__init__.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/commands/serve.py
tensorrt_llm/llmapi/mm_encoder.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/__init__.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/llmapi/mm_encoder.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/llmapi/mm_encoder.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
tensorrt_llm/llmapi/mm_encoder.py
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
tensorrt_llm/llmapi/mm_encoder.py
📚 Learning: 2025-08-09T02:04:49.580Z
Learnt from: Fridah-nv
PR: NVIDIA/TensorRT-LLM#6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.580Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.
Applied to files:
tensorrt_llm/llmapi/mm_encoder.py
🪛 Ruff (0.12.2)
tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
111-111: Line too long (137 > 120)
(E501)
tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py
147-147: Line too long (126 > 120)
(E501)
153-153: Line too long (133 > 120)
(E501)
tensorrt_llm/_torch/models/modeling_llava_next.py
84-84: Local variable num_mm_tokens
is assigned to but never used
Remove assignment to unused variable num_mm_tokens
(F841)
147-147: Line too long (152 > 120)
(E501)
tensorrt_llm/__init__.py
53-53: Module level import not at top of file
(E402)
tensorrt_llm/_torch/pyexecutor/model_engine.py
1109-1109: Line too long (158 > 120)
(E501)
tensorrt_llm/commands/serve.py
391-394: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
414-414: Undefined name server_role
(F821)
416-416: Undefined name server_role
(F821)
tensorrt_llm/llmapi/mm_encoder.py
18-19: One-line docstring should fit on one line
Reformat to one line
(D200)
48-48: Line too long (124 > 120)
(E501)
53-53: Line too long (140 > 120)
(E501)
54-54: Line too long (141 > 120)
(E501)
103-105: 1 blank line required between summary line and description
(D205)
112-121: 1 blank line required between summary line and description
(D205)
116-116: Line too long (138 > 120)
(E501)
120-120: Line too long (148 > 120)
(E501)
162-167: 1 blank line required between summary line and description
(D205)
tensorrt_llm/serve/openai_server.py
355-355: Line too long (185 > 120)
(E501)
367-367: Line too long (132 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/_torch/models/modeling_utils.py (1)
576-576
: LGTM: Vision encoder mapping added.The new MODEL_CLASS_VISION_ENCODER_MAPPING is a clean extension point for encoder-only mode.
tests/unittest/_torch/multimodal/test_external_embedding.py (1)
89-178
: Good coverage for external embedding fusion; assertions look solid.The tests validate token expansion and embedding shapes for 1 and multiple images, and verify tokenizer calls. LGTM.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1272-1279
: Good: reuse recovered tensors during generation.strip_for_generation() is used for gen requests, avoiding redundant recovery. Matches the shared-tensor learning.
tensorrt_llm/_torch/models/modeling_llava_next.py (2)
27-29
: Import cleanup looks good.Avoids ModelConfig shadowing; thanks for fixing.
466-478
: Precomputed embedding fast-path is correct.Conditionally using externally provided embeddings avoids redundant vision forward; aligns with disaggregated flow.
Signed-off-by: Chang Liu <[email protected]>
@Superjomn @pcastonguay @Funatiq |
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tensorrt_llm/serve/openai_server.py (1)
188-196
: Health-check still calls text-path in MM ENCODER mode
health_generate()
always invokesself.openai_chat
, so a server started withserver_role == MM_ENCODER
never exercises the multimodal-encoder path.
Replace the call withopenai_mm_encoder
when the role isMM_ENCODER
, or register a dedicated/health_generate
route insideregister_mm_encoder_routes
.
♻️ Duplicate comments (2)
tensorrt_llm/serve/openai_server.py (2)
364-392
:disaggregated_params
computed then dropped
to_llm_disaggregated_params()
is called but its return value is discarded, and the resulting params are not forwarded togenerate_async
.
Either remove the call or capture the value and pass it through:- # TODO: placeholder for multimodal disagg e2e - to_llm_disaggregated_params(request.disaggregated_params) + disaggregated_params = to_llm_disaggregated_params( + request.disaggregated_params) ... - promise = self.llm.generate_async( - inputs=prompt, - ) + promise = self.llm.generate_async( + inputs=prompt, + disaggregated_params=disaggregated_params, + )
345-356
: Guard against missingmm_embedding_handle
and fix long line
create_mm_embedding_response()
assumespromise.mm_embedding_handle
is always present and well-formed, and the response construction exceeds the 120-char line limit.+ mm_embedding_handle = getattr(promise, "mm_embedding_handle", None) + if not mm_embedding_handle or "tensor_size" not in mm_embedding_handle: + return self.create_error_response( + message="Multimodal embedding handle missing in response", + err_type="InternalServerError", + status_code=HTTPStatus.INTERNAL_SERVER_ERROR) num_tokens = int(mm_embedding_handle["tensor_size"][0]) - return ChatCompletionResponse(id=str(promise.request_id), model=self.model, choices=[ChatCompletionResponseChoice(index=0, message=ChatMessage(role="assistant", content="dummy"), mm_embedding_handle=mm_embedding_handle, finish_reason="length")], usage=UsageInfo(prompt_tokens=num_tokens, completion_tokens=1, total_tokens=num_tokens+1)) + return ChatCompletionResponse( + id=str(promise.request_id), + model=self.model, + choices=[ + ChatCompletionResponseChoice( + index=0, + message=ChatMessage(role="assistant", content="dummy"), + mm_embedding_handle=mm_embedding_handle, + finish_reason="length", + ) + ], + usage=UsageInfo( + prompt_tokens=num_tokens, + completion_tokens=1, + total_tokens=num_tokens + 1, + ), + )
🧹 Nitpick comments (3)
tests/unittest/llmapi/apps/_test_openai_mmencoder.py (1)
133-139
: Hard-coded token and hidden-size may make the test brittle
tensor_size[0] == 324
andtensor_size[1] == 2048
are model-specific.
If the upstream model or tokenizer changes, the test will fail needlessly.
Consider asserting the shape rank and positivity instead, or reading the expected sizes from the model config.tensorrt_llm/_torch/models/modeling_llava_next.py (2)
80-83
:num_mm_tokens
computed but never used
num_mm_tokens = len(mm_token_positions)
is dead code and trips Ruff F841.
Simply drop the assignment.
140-146
: Line exceeds 120 charsThe assertion message at Line 144 is 152 chars long. Split for readability and to satisfy style checks.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/models/modeling_llava_next.py
(10 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(6 hunks)tensorrt_llm/serve/openai_server.py
(7 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/unittest/_torch/multimodal/test_external_embedding.py
(1 hunks)tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
(1 hunks)tests/unittest/llmapi/apps/_test_openai_mmencoder.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/unittest/_torch/multimodal/test_find_num_image_tokens.py
- tests/unittest/_torch/multimodal/test_external_embedding.py
- tensorrt_llm/_torch/models/modeling_qwen2vl.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_mmencoder.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/test_e2e.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/test_e2e.py
tensorrt_llm/serve/openai_server.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/serve/openai_server.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llava_next.py
81-81: Local variable num_mm_tokens
is assigned to but never used
Remove assignment to unused variable num_mm_tokens
(F841)
144-144: Line too long (152 > 120)
(E501)
tensorrt_llm/serve/openai_server.py
354-354: Line too long (185 > 120)
(E501)
366-366: Line too long (132 > 120)
(E501)
🔇 Additional comments (1)
tests/integration/defs/test_e2e.py (1)
1512-1517
: New test looks good and follows existing patternThe added
test_openai_mmencoder_example
mirrors the structure of other Open-AI example tests and should integrate cleanly with the suite.
Signed-off-by: Chang Liu <[email protected]>
PR_Github #15731 Bot args parsing error: usage: /bot [-h] |
PR_Github #15732 Bot args parsing error: usage: /bot [-h] |
/bot run --post-merge |
PR_Github #15735 [ run ] triggered by Bot |
PR_Github #15735 [ run ] completed with state |
/bot run --post-merge |
PR_Github #15807 [ run ] triggered by Bot |
PR_Github #15807 [ run ] completed with state |
/bot reuse-pipeline |
PR_Github #15822 [ reuse-pipeline ] triggered by Bot |
PR_Github #15822 [ reuse-pipeline ] completed with state |
/bot run |
PR_Github #15823 [ run ] triggered by Bot |
PR_Github #15823 [ run ] completed with state |
/bot run |
PR_Github #15829 [ run ] triggered by Bot |
PR_Github #15829 [ run ] completed with state |
/bot skip --comment "Pre-merge CI was clean here: #6743 (comment) and there is not code changes except merging w/ main; Plus, post-merge CI runs all passed except a known triton llava timeout issue, which is not related to this PR" |
PR_Github #15860 [ skip ] triggered by Bot |
PR_Github #15860 [ skip ] completed with state |
Summary by CodeRabbit
It's the second PR from #5000 that introduces a standalone multimodal encoder and runs only the ViT and returns the mm embedding (via handle).
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Refactor
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.