-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance #7250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds mRoPE support and refactors Qwen2-VL weight loading and vision wiring. Introduces MRotaryEmbedding and mrope_section handling. Removes mrope_config from multiple model forward signatures. Updates multimodal data flow through executor (position_ids/multimodal_params). Adds HF weight mapper and tests for Qwen2_5-VL and selective to_device. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Model as Qwen2VLModel
participant WM as Qwen2VLHfWeightMapper
participant Vision as VisionEncoder
participant LLM as LLM
User->>Model: load_weights(weights, WM)
Model->>Model: filter_weights(prefix, weights)
Model->>Vision: load vision weights (if not DISAGG)
Model->>LLM: llm.load_weights(WM)
alt fuse_rope enabled
Model->>Model: prepare_mrope_config(...)
end
Model-->>User: load complete
sequenceDiagram
autonumber
participant Client
participant RM as ResourceManager
participant Engine as PyTorchModelEngine
participant CG as CUDAGraphRunner
participant Model as Model.forward
participant Attn as Attention/MRotaryEmbedding
Client->>RM: build dummy/real request (use_mrope?)
RM->>RM: attach py_multimodal_data.mrope_config (position_ids [+ deltas])
RM-->>Engine: LlmRequest
Engine->>Engine: _prepare_tp_inputs(use_mrope)
Engine->>CG: capture(inputs: position_ids, multimodal_params)
Client->>CG: run(inputs: position_ids, multimodal_params)
CG->>Model: forward(attn_metadata, position_ids, multimodal_params, ...)
alt rope_fusion disabled and is_mrope
Model->>Attn: MRotaryEmbedding.forward(position_ids, qkv)
else
Model->>Attn: RotaryEmbedding or fused path
end
Model-->>Client: logits
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
examples/llm-api/quickstart_multimodal.py (1)
1-1
: Add standard NVIDIA multi-line header to Python exampleThe file
examples/llm-api/quickstart_multimodal.py
is missing the required NVIDIA copyright and license header.
Based on the most common header in other Python files (e.g.triton_backend/all_models/.../fbank.py
), please prepend this exact block at the very top of the file:+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License.• File: examples/llm-api/quickstart_multimodal.py (insert before line 1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
1-1
: Missing NVIDIA copyright header (2025) and future-annotations importPer coding guidelines, prepend the NVIDIA copyright header to all Python sources. Also, several modules in this PR use PEP 604/585 typing elsewhere; adding
from __future__ import annotations
early helps keep Python 3.8 compatibility if annotations evolve here later.Apply at file top:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations import copy import weakreftensorrt_llm/_torch/models/modeling_llama.py (1)
1-1
: Missing NVIDIA copyright header (2025) and future-annotations importHeader is required. This module uses PEP 604 unions (
|
) and PEP 585 generics (tuple[...]
). Addfrom __future__ import annotations
immediately after the header to keep Python 3.8 compatibility by deferring annotation evaluation.+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations import copy import ostensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1
: Missing NVIDIA copyright header (2025) and future-annotations importThis module uses PEP 604/585 typing in multiple places. Add the header and
from __future__ import annotations
at the top to ensure Python 3.8 compatibility and satisfy repository policy.+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations from collections.abc import Callable from functools import partialtensorrt_llm/inputs/multimodal.py (1)
1-1
: Missing NVIDIA copyright header (2025)Please prepend the required header.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. """Multimodal utilities for handling images and other media types in TensorRT-LLM."""
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
1-1
: Missing required NVIDIA copyright header (2025)Per repository guidelines, prepend the standard NVIDIA copyright header (current year) to all Python files, including tests.
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
1-1
: Missing required NVIDIA copyright header (2025)Please add the standard NVIDIA header at the top of this file.
tensorrt_llm/_torch/modules/attention.py (1)
1-1
: Missing required NVIDIA copyright header (2025)Add the standard header to comply with repo policy.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1-1
: Missing required NVIDIA copyright header (2025)Please prepend the standard NVIDIA header.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)
244-265
: Incorrect temporal padding computation (under-pads for non-multiples).
padded_num_frames = num_frames + num_frames % temporal_patch_size
produces 12 for 10 frames with size 8; the next multiple is 16. Use ceil-to-multiple.Apply:
- padded_num_frames = num_frames + num_frames % temporal_patch_size + padded_num_frames = ( + ((num_frames + temporal_patch_size - 1) // temporal_patch_size) + * temporal_patch_size + )
609-711
: Avoid padding to max_position_embeddings; vectorize and fixtorch.concatenate
usage.Current approach pads to
max_position_embeddings
per sample, which is memory-heavy and risks OOM on GPU. You can directly indexself.cos_ori/self.sin_ori
with variable-length IDs and then reshape/split without padding. Also, prefertorch.cat
(broader compatibility) and move deltas to the model device with correct dtype.Apply:
- if mrope_position_ids: - concat_cos_sin_list = [] - for mrope_position_id in mrope_position_ids: - mrope_position_id = mrope_position_id.transpose(1, 0) - - # TODO: Test whether padding is needed - mrope_position_ids_padding = torch.zeros( - mrope_position_id.shape[:-1] + - (self.model_config.pretrained_config. - max_position_embeddings, ), - dtype=torch.int32, - device=mrope_position_id.device) - mrope_position_ids_padding[:, :, :mrope_position_id. - shape[-1]] = mrope_position_id - - mrope_position_ids_padding = mrope_position_ids_padding.to( - self.cos_ori.device) - cos = self.cos_ori[mrope_position_ids_padding] - sin = self.sin_ori[mrope_position_ids_padding] - - mrope_section = [16, 24, 24] - cos = torch.cat([ - m[:, i % 3] - for i, m in enumerate(cos.split(mrope_section, dim=-1)) - ], - dim=-1).unsqueeze(-1) - sin = torch.cat([ - m[:, i % 3] - for i, m in enumerate(sin.split(mrope_section, dim=-1)) - ], - dim=-1).unsqueeze(-1) - concat_cos_sin = torch.concatenate((cos, sin), axis=-1) - concat_cos_sin = concat_cos_sin.reshape(concat_cos_sin.shape[0], - -1) - concat_cos_sin_list.append(concat_cos_sin) - mrope_config['mrope_rotary_cos_sin'] = torch.cat( - concat_cos_sin_list, dim=0).to(self.device) + if mrope_position_ids: + concat_cos_sin_list = [] + sections = getattr(self.llm.config, "mrope_section", [16, 24, 24]) + for mrope_position_id in mrope_position_ids: + pos = mrope_position_id.transpose(1, 0).to(self.cos_ori.device, dtype=torch.long) + cos = self.cos_ori[pos] + sin = self.sin_ori[pos] + cos = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(cos.split(sections, dim=-1))], dim=-1).unsqueeze(-1) + sin = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(sin.split(sections, dim=-1))], dim=-1).unsqueeze(-1) + concat = torch.cat((cos, sin), dim=-1).reshape(cos.shape[0], -1) + concat_cos_sin_list.append(concat) + mrope_config['mrope_rotary_cos_sin'] = torch.cat(concat_cos_sin_list, dim=0).to(self.device) @@ - if mrope_position_deltas: - mrope_config['mrope_position_deltas'] = torch.cat( - mrope_position_deltas, dim=0) + if mrope_position_deltas: + mrope_config['mrope_position_deltas'] = torch.cat(mrope_position_deltas, dim=0).to(self.device, dtype=torch.int32)Note: consider making
sections
configurable instead of hardcoding[16, 24, 24]
.
🧹 Nitpick comments (26)
examples/llm-api/quickstart_multimodal.py (1)
221-228
: Avoid dumping full logits to stdout; print shapes/summaries to prevent massive logs and slowdownRaw context/generation logits can be extremely large (seq_len × vocab_size × beams), which can flood stdout, slow runs, and OOM terminals. Summarize instead and be resilient to missing attributes.
- if args.return_context_logits: - print(f"[{i}] Context logits: {output.context_logits}") - if args.return_generation_logits: - print( - f"[{i}] Generation logits: {output.outputs[0].generation_logits}" - ) - if args.logprobs: - print(f"[{i}] Logprobs: {output.outputs[0].logprobs}") + if getattr(args, "return_context_logits", False): + ctx = getattr(output, "context_logits", None) + if ctx is None: + print(f"[{i}] Context logits: None") + else: + shape = getattr(ctx, "shape", None) + dtype = getattr(ctx, "dtype", type(ctx)) + print(f"[{i}] Context logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}") + + if getattr(args, "return_generation_logits", False): + gen = getattr(output.outputs[0], "generation_logits", None) + if gen is None: + print(f"[{i}] Generation logits: None") + else: + shape = getattr(gen, "shape", None) + dtype = getattr(gen, "dtype", type(gen)) + print(f"[{i}] Generation logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}") + + if getattr(args, "logprobs", False): + lps = getattr(output.outputs[0], "logprobs", None) + if isinstance(lps, list): + head = ", ".join(f"{x:.4f}" for x in lps[:5]) + more = "..." if len(lps) > 5 else "" + print(f"[{i}] Logprobs: n={len(lps)}, head=[{head}{more}]") + else: + print(f"[{i}] Logprobs: {lps}")tensorrt_llm/_torch/attention_backend/interface.py (1)
506-508
: mRoPE section: add light validation and docstring clarityGood addition. To make failures obvious when a wrong shape is passed from model wiring, add validation in
__post_init__
formrope_section
when present (e.g., positive ints, non-empty, optionally sum <= rope.dim) and briefly document expected semantics (Qwen2/2.5-VL commonly uses three sections). Keep it optional to avoid over-constraining new models.Example guard (illustrative; adapt location accordingly):
if self.type.is_rope() and self.mrope_section is not None: assert all(isinstance(x, int) and x > 0 for x in self.mrope_section), "mrope_section must be positive ints" if self.rope and isinstance(self.rope.dim, int): assert sum(self.mrope_section) <= self.rope.dim, "sum(mrope_section) must not exceed rope.dim"tensorrt_llm/_torch/models/modeling_llama.py (2)
182-183
: Forward call now passes fewer positional args; prefer keywords to avoid driftYou removed one positional
None
in theforward_impl(...)
call. To make this resilient to future signature changes and to ensure correctness now, pass these trailing optionals by name (e.g.,kv_cache_params=None
,lora_params=None
,skip_attn_scaling=...
,attention_sinks=None
). This avoids accidental misalignment ifforward_impl
’s parameter order differs across backends.Example pattern:
attn_output = self.forward_impl( q, k, v, attn_metadata, attention_mask=attention_mask, all_reduce_params=all_reduce_params, skip_attn_scaling=skip_attn_scaling, attention_sinks=None, )Please confirm the exact parameter names.
433-435
: Min-latency NOPE path: keep call-site robust with keywordsSame concern as above: the parent
_forward_nope
changed and you removed themrope_config
positional. Pass the trailing parameters by name to guard against parameter drift and improve readability.return super()._forward_nope( position_ids, hidden_states, attn_metadata, attention_mask=attention_mask, all_reduce_params=all_reduce_params, skip_attn_scaling=skip_attn_scaling, )tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
433-435
: Call into base_forward_nope
: use keyword args for clarityThe removed
mrope_config
positional looks correct. To prevent subtle regressions if_forward_nope
evolves, pass the last arguments by name.- return super()._forward_nope(position_ids, hidden_states, attn_metadata, - attention_mask, all_reduce_params, - skip_attn_scaling) + return super()._forward_nope( + position_ids, + hidden_states, + attn_metadata, + attention_mask=attention_mask, + all_reduce_params=all_reduce_params, + skip_attn_scaling=skip_attn_scaling, + )tensorrt_llm/inputs/multimodal.py (2)
388-403
: Path-filtered device move: validate dict type; behavior on empty listGood defensive check that
multimodal_data
is a dict whenkeyword
is provided. Consider treatingkeyword=[]
as a no-op explicitly (currently it is), and optionally log when a path is not found to ease debugging (silent skip is fine for “Chill”).
405-448
: Helper mutates input dict in place; document this or deep-copy
result = data
and subsequent assignments update the original dict. This is efficient, but note it in the docstring to avoid surprises for callers reusing the dict elsewhere. Alternatively, shallow-copy the parent dict(s) before mutation if you want isolation, at a small cost.Example shallow-copy at top:
- result = data + result = dict(data)And copy nested parents before mutation if needed.
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
149-167
: Selective to_device path filter works; consider minor robustness tweaksThe new test validates the dot-path filtering behavior and confirms non-targeted tensors remain on CPU. Looks good. Two optional hardening ideas:
- Add a skip guard for environments without CUDA to avoid false negatives on CPU-only CI.
- Consider adding one more assertion for a non-existent path (e.g., "image.not_there") to confirm it’s safely ignored.
Example (optional):
@unittest.skipIf(not torch.cuda.is_available(), "CUDA required for this test") def test_to_device_with_keyword(self): ...tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
18-19
: Keep all tidy (optional)Appending the new symbol is fine. Consider keeping all sorted alphabetically to reduce churn in future diffs.
tensorrt_llm/_torch/modules/attention.py (1)
267-279
: Initialize MRotaryEmbedding defensively when mrope_section may be absentThe MRope path looks correct. To avoid AttributeError with older configs that don’t carry pos_embd_params.mrope_section, construct MRotaryEmbedding without the parameter when it’s missing (class has a sensible default). This preserves backward compatibility.
Apply this diff within the block:
- if self.pos_embd_params.type.is_mrope(): - self.rotary_emb = MRotaryEmbedding( - self.pos_embd_params.rope, - head_dim=self.head_dim, - is_neox=self.pos_embd_params.is_neox, - mrope_section=self.pos_embd_params.mrope_section, - ) + if self.pos_embd_params.type.is_mrope(): + if hasattr(self.pos_embd_params, "mrope_section"): + self.rotary_emb = MRotaryEmbedding( + self.pos_embd_params.rope, + head_dim=self.head_dim, + is_neox=self.pos_embd_params.is_neox, + mrope_section=self.pos_embd_params.mrope_section, + ) + else: + self.rotary_emb = MRotaryEmbedding( + self.pos_embd_params.rope, + head_dim=self.head_dim, + is_neox=self.pos_embd_params.is_neox, + )tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
484-484
: Fix typo and wrap long NOTE to satisfy Ruff E501There’s a minor typo (“mrop_config”) and the line exceeds 120 chars. Reflow into two lines and fix the spelling.
Apply this diff:
- # NOTE: Planning to get dummy_data from each model. Before that, we need to add dummy mrop_config to the request here. + # NOTE: Planning to get dummy_data from each model. + # Before that, add a dummy mrope_config to the request here.
493-498
: Dummy mrope_position_deltas shape looks fine; consider setdefault if py_multimodal_data may exist laterFor dummy requests it’s safe to overwrite py_multimodal_data. If this function ever reuses existing requests, prefer setdefault to avoid clobbering other dummy modalities.
Alternative (for future-proofing):
req.py_multimodal_data = getattr(req, "py_multimodal_data", {}) or {} req.py_multimodal_data.setdefault("mrope_config", {})["mrope_position_ids"] = dummy_mrope_position_ids ...tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
61-69
: Consider using a factory function for MultimodalParams creation.The multimodal_params initialization with default mrope_config could be extracted into a helper method for better readability and reusability.
Consider refactoring to improve readability:
+ def _create_mrope_multimodal_params(self, batch_size: int, device: str): + """Create MultimodalParams with default mrope configuration.""" + return [ + MultimodalParams( + multimodal_data={ + "mrope_config": { + "mrope_position_deltas": + torch.zeros((1, 1), device=device, dtype=torch.int32) + } + }) for _ in range(batch_size) + ] + def __init__(self, ...): ... - self.multimodal_params = [ - MultimodalParams( - multimodal_data={ - "mrope_config": { - "mrope_position_deltas": - torch.zeros((1, 1), device=device, dtype=torch.int32) - } - }) for _ in range(batch_size) - ] if self.use_mrope else [] + self.multimodal_params = self._create_mrope_multimodal_params( + batch_size, device) if self.use_mrope else []tensorrt_llm/_torch/modules/rotary_embedding.py (1)
167-170
: Consider adding a warning for unexpected 2D position_ids in mRoPE mode.Since MRotaryEmbedding is specifically for mRoPE, receiving 2D position_ids might indicate a configuration issue.
Consider adding a warning when falling back to standard RoPE:
else: # Fallback to the original RoPE where position_ids is 2D for dummy requests + import warnings + warnings.warn( + "MRotaryEmbedding received 2D position_ids, falling back to standard RoPE. " + "This may indicate a configuration issue.", + RuntimeWarning, + stacklevel=2 + ) cos_sin = self.rotary_cos_sin[position_ids.view(-1)] cos, sin = cos_sin[:, 0, :], cos_sin[:, 1, :]tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1287-1306
: Fix line length issues for better readability.The code logic is correct, but several lines exceed the 120-character limit, affecting readability.
Apply this formatting fix:
if self.use_mrope: ctx_mrope_position_ids = multimodal_params.multimodal_data[ - 'mrope_config'][ - 'mrope_position_ids'][:, :, - begin_compute:begin_compute + - len(prompt_tokens)] + 'mrope_config']['mrope_position_ids'][ + :, :, begin_compute:begin_compute + len(prompt_tokens)] mrope_position_ids.append(ctx_mrope_position_ids) # TODO: Visit later to decide the appropriate position of sending multimodal data & selectively sending multimodal data multimodal_params.to_device( "multimodal_data", "cuda", pin_memory=True, - keyword=getattr(self.model, "multimodal_data_device_paths", - None)) + keyword=getattr( + self.model, "multimodal_data_device_paths", None))
1441-1465
: Consider extracting mRoPE position ID generation logic.The mRoPE position ID generation and multimodal parameter handling could be extracted into a helper method for better maintainability.
Consider refactoring into a helper method:
+ def _prepare_generation_mrope_params( + self, past_seen_token_num: int, + multimodal_data: dict) -> Tuple[torch.Tensor, MultimodalParams]: + """Prepare mRoPE parameters for generation requests.""" + mrope_position_deltas = multimodal_data[ + 'mrope_config']['mrope_position_deltas'] + gen_mrope_position_ids = ( + past_seen_token_num + mrope_position_deltas + ).expand(3, 1, 1) + + multimodal_params = MultimodalParams(multimodal_data=multimodal_data) + multimodal_params.strip_for_generation() + if multimodal_params.has_content(): + multimodal_params.to_device( + "multimodal_data", "cuda", pin_memory=True, + keyword=["mrope_config.mrope_position_deltas"]) + return gen_mrope_position_ids, multimodal_params + # Multimodal - multimodal_params = MultimodalParams( - multimodal_data=request.py_multimodal_data) - multimodal_params.strip_for_generation() - if multimodal_params.has_content(): + if request.py_multimodal_data and self.use_mrope: + gen_mrope_position_ids, multimodal_params = ( + self._prepare_generation_mrope_params( + past_seen_token_num, request.py_multimodal_data)) + mrope_position_ids.append(gen_mrope_position_ids) + multimodal_params_list.append(multimodal_params)tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (2)
101-102
: Fix line length in repr method.The return statement exceeds the 120-character limit.
Split the string for better readability:
def __repr__(self) -> str: - return f"modality:{self.modality.lower()}-use_cuda_graph:{self.use_cuda_graph}-disable_fuse_rope:{self.disable_fuse_rope}" + return (f"modality:{self.modality.lower()}-" + f"use_cuda_graph:{self.use_cuda_graph}-" + f"disable_fuse_rope:{self.disable_fuse_rope}")
134-136
: Consider adding error handling for tokenizer initialization.The tokenizer loading could fail if the model path is invalid.
Add error handling:
def get_inputprocessor(self, model_path): - tokenizer = AutoTokenizer.from_pretrained(model_path) - return create_input_processor(model_path, tokenizer=tokenizer) + try: + tokenizer = AutoTokenizer.from_pretrained(model_path) + return create_input_processor(model_path, tokenizer=tokenizer) + except Exception as e: + raise RuntimeError(f"Failed to load tokenizer from {model_path}: {e}")tensorrt_llm/_torch/models/modeling_qwen2vl.py (8)
62-89
: Align docstring with return shape and ensure consistent dtype/shape.
- Docstring says
mrope_position_deltas
has shape(batch_size)
, but the code returns(B, 1)
. Keep them consistent.- Consider using
torch.int32
for deltas at the source to avoid later casting churn.Apply:
- Returns: - position_ids: A tensor of shape (3, batch_size, sequence_length) - mrope_position_deltas: A tensor of shape (batch_size) + Returns: + position_ids: torch.IntTensor of shape (3, batch_size, sequence_length) + mrope_position_deltas: torch.IntTensor of shape (batch_size, 1)And when returning in the no-vision path:
- mrope_position_deltas = torch.zeros( + mrope_position_deltas = torch.zeros( [input_ids.shape[0], 1], device=input_ids.device, - dtype=input_ids.dtype, + dtype=torch.int32, )Also applies to: 232-235, 296-313
135-146
: Avoid shadowinginput_ids
parameter inside loop.Reusing
input_ids
for the per-row tensor shadows the function parameter and risks confusion.Apply:
- for i, input_ids in enumerate(total_input_ids): - input_ids = input_ids[attention_mask[i] == 1] + for i, seq_input_ids in enumerate(total_input_ids): + seq_input_ids = seq_input_ids[attention_mask[i] == 1] image_nums, video_nums = 0, 0 vision_start_indices = torch.argwhere( - input_ids == vision_start_token_id).squeeze(1) - vision_tokens = input_ids[vision_start_indices + 1] + seq_input_ids == vision_start_token_id).squeeze(1) + vision_tokens = seq_input_ids[vision_start_indices + 1] - input_tokens = input_ids.tolist() + input_tokens = seq_input_ids.tolist()
303-313
: Make CPU clones explicit and unify dtypes.
- Keeping
mrope_position_ids
as int64 is fine for indexing later, but spell out intent.mrope_position_deltas
is cast to int32; consider casting IDs to int64 explicitly to avoid implicit conversions.Apply:
- mrope_config['mrope_position_ids'] = mrope_position_ids.to( - 'cpu').clone() + mrope_config['mrope_position_ids'] = mrope_position_ids.to('cpu', dtype=torch.long).clone()
349-358
: Remove commented debug prints and long lines (Ruff E501).Dead debug prints add noise and break line-length checks. Please delete them.
Apply:
- # print(f"fused_input_ids: {fused_input_ids}") - # print(f"multimodal_data['image']['pixel_values'].shape: {multimodal_data['image']['pixel_values'].shape}") - # print(f"multimodal_data['image']['image_grid_thw']: {multimodal_data['image']['image_grid_thw']}") - # print(f"multimodal_data['mrope_config']['mrope_position_ids'].shape: {multimodal_data['mrope_config']['mrope_position_ids'].shape}") - # print(f"multimodal_data['mrope_config']['mrope_position_deltas'].shape: {multimodal_data['mrope_config']['mrope_position_deltas'].shape}")
521-533
: ReplaceUse
logger.debug
and optionally gate behind a verbosity flag; dumping entireweights.keys()
is noisy.Apply:
- if not DISAGG: - print(f"weights.keys(): {weights.keys()}") - vision_encoder_weights = filter_weights("visual", weights) - print( - f"vision_encoder_weights.keys(): {vision_encoder_weights.keys()}" - ) + if not DISAGG: + vision_encoder_weights = filter_vision_weights("visual", weights) + logger.debug("Loaded %d vision weights", len(vision_encoder_weights))
32-34
: Global naming: preferG_
prefix for globals.Per guidelines, globals should be UPPER_SNAKE_CASE prefixed with
G_
. Consider renamingDISAGG
toG_DISAGG
for consistency.Sample:
-DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1' +G_DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1'Update references accordingly.
356-357
: Resolve Ruff E501: wrap or reflow long lines.A few lines exceed 120 characters. Please wrap them to satisfy linters.
You can break the long comment at Line 477 and remove the debug prints at Lines 356–357 as suggested earlier. For other occurrences, wrap strings/comments across lines.
Also applies to: 477-479, 566-569
296-313
: Add minimal docstrings for public helpers.
get_mrope_config
andprepare_mrope_config
are externally consumed viaMultimodalParams
; add short Google-style docstrings for clarity.Also applies to: 609-711
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
examples/llm-api/quickstart_multimodal.py
(1 hunks)tensorrt_llm/_torch/attention_backend/interface.py
(1 hunks)tensorrt_llm/_torch/models/checkpoints/__init__.py
(2 hunks)tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py
(1 hunks)tensorrt_llm/_torch/models/modeling_gpt_oss.py
(1 hunks)tensorrt_llm/_torch/models/modeling_hunyuan_moe.py
(0 hunks)tensorrt_llm/_torch/models/modeling_llama.py
(1 hunks)tensorrt_llm/_torch/models/modeling_llama_min_latency.py
(1 hunks)tensorrt_llm/_torch/models/modeling_qwen.py
(2 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py
(12 hunks)tensorrt_llm/_torch/models/modeling_qwen3.py
(0 hunks)tensorrt_llm/_torch/modules/attention.py
(2 hunks)tensorrt_llm/_torch/modules/rotary_embedding.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(11 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(1 hunks)tensorrt_llm/inputs/multimodal.py
(2 hunks)tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
(1 hunks)tests/unittest/_torch/multimodal/test_share_multiparams.py
(1 hunks)
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/models/modeling_qwen3.py
- tensorrt_llm/_torch/models/modeling_hunyuan_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py
examples/llm-api/quickstart_multimodal.py
tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tests/unittest/_torch/multimodal/test_share_multiparams.py
tensorrt_llm/_torch/models/modeling_qwen.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/rotary_embedding.py
tensorrt_llm/_torch/models/checkpoints/__init__.py
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/inputs/multimodal.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py
examples/llm-api/quickstart_multimodal.py
tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tests/unittest/_torch/multimodal/test_share_multiparams.py
tensorrt_llm/_torch/models/modeling_qwen.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/rotary_embedding.py
tensorrt_llm/_torch/models/checkpoints/__init__.py
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/inputs/multimodal.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
🧠 Learnings (1)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (12)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (3)
tensorrt_llm/_torch/models/checkpoints/hf/weight_mapper.py (1)
HfWeightMapper
(10-99)tensorrt_llm/_torch/models/modeling_utils.py (1)
register_mapper
(597-608)tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
filter_weights
(35-42)
examples/llm-api/quickstart_multimodal.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
generation_logits
(208-217)tensorrt_llm/scaffolding/task.py (1)
logprobs
(99-100)
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
tensorrt_llm/inputs/multimodal.py (1)
to_device
(359-403)
tensorrt_llm/_torch/models/modeling_qwen.py (2)
tensorrt_llm/_torch/models/modeling_utils.py (1)
config
(500-501)tensorrt_llm/_torch/models/modeling_bert.py (1)
config
(266-285)
tensorrt_llm/_torch/modules/attention.py (2)
tensorrt_llm/_torch/modules/rotary_embedding.py (2)
MRotaryEmbedding
(130-193)RotaryEmbedding
(10-127)tensorrt_llm/functional.py (1)
is_mrope
(707-708)
tensorrt_llm/_torch/modules/rotary_embedding.py (1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
RopeParams
(342-494)
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
Qwen2VLHfWeightMapper
(7-21)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_mrope
(482-490)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams
(152-475)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_mrope
(482-490)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
add_dummy_requests
(68-69)add_dummy_requests
(430-499)tensorrt_llm/inputs/multimodal.py (4)
to_device
(359-403)MultimodalParams
(152-475)strip_for_generation
(449-471)has_content
(473-475)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (11)
tensorrt_llm/_torch/attention_backend/utils.py (1)
get_attention_backend
(10-24)tensorrt_llm/_torch/metadata.py (1)
KVCacheParams
(9-31)tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
Qwen2VLHfWeightMapper
(7-21)tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
Qwen2_5_VLModel
(802-815)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
DecodingCUDAGraphRunner
(12-143)capture
(78-104)run
(109-143)tensorrt_llm/_torch/pyexecutor/resource_manager.py (6)
KVCacheManager
(141-943)add_dummy_requests
(68-69)add_dummy_requests
(430-499)shutdown
(80-81)shutdown
(355-356)shutdown
(1120-1121)tensorrt_llm/inputs/registry.py (1)
create_input_processor
(274-303)tensorrt_llm/inputs/utils.py (1)
default_multimodal_input_loader
(442-602)tensorrt_llm/inputs/data.py (1)
prompt_inputs
(55-69)tensorrt_llm/inputs/multimodal.py (2)
MultimodalParams
(152-475)to_device
(359-403)tensorrt_llm/_torch/model_config.py (1)
torch_dtype
(147-152)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (5)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
BaseWeightMapper
(10-165)tensorrt_llm/_torch/models/modeling_utils.py (2)
filter_weights
(710-716)config
(500-501)tensorrt_llm/runtime/multimodal_model_runner.py (1)
get_rope_index
(1802-1968)tensorrt_llm/_torch/model_config.py (1)
torch_dtype
(147-152)tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams
(152-475)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
484-484: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/model_engine.py
1296-1296: Line too long (135 > 120)
(E501)
1462-1462: Line too long (131 > 120)
(E501)
1583-1583: Line too long (145 > 120)
(E501)
1584-1584: Line too long (140 > 120)
(E501)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
102-102: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/models/modeling_qwen2vl.py
356-356: Line too long (142 > 120)
(E501)
357-357: Line too long (148 > 120)
(E501)
477-477: Line too long (140 > 120)
(E501)
566-566: Line too long (126 > 120)
(E501)
🔇 Additional comments (22)
examples/llm-api/quickstart_multimodal.py (2)
223-226
: Double-check the correct accessor for generation logits and expected shapeThe internal API snippet shows generation_logits shaped [beam_width, seq_length, vocab_size] after transpose; here you access output.outputs[0].generation_logits. Confirm that the sample-level accessor is intended (vs a top-level output.generation_logits) and that its shape semantics match your expectations before printing/summarizing.
If needed, we can adjust the print path to whichever accessor is canonical in the updated executor.
221-228
: No missing CLI flags—add_llm_args already defines them
The flags--return_context_logits
,--return_generation_logits
, and--logprobs
are all added inexamples/llm-api/quickstart_advanced.py
(lines 143–149) withdefault=False
andaction='store_true'
, so the checks inquickstart_multimodal.py
will never raise anAttributeError
.tensorrt_llm/inputs/multimodal.py (2)
360-364
: API extension looks good; positional compatibility preservedAdding
keyword: Optional[List[str]] = None
afterpin_memory
preserves existing positional callers. Docstring updates clearly explain behavior. LGTM.
452-454
: Generation strip behavior change: only keep mRoPE deltasThe narrowing to retain only
mrope_position_deltas
undermrope_config
is sensible and aligns with MRope-only needs during generation. Please confirm no downstream consumer expectsmrope_rotary_cos_sin
at generation time; if any path does, this will now drop it.Also applies to: 458-471
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
9-9
: Expose Qwen2VL mapper in public APIImporting Qwen2VLHfWeightMapper here is correct and enables external access. LGTM.
tensorrt_llm/_torch/modules/attention.py (1)
27-27
: Publicly exporting MRotaryEmbedding is appropriateBringing MRotaryEmbedding into this module’s namespace matches the new MRope path usage. No issues.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
486-488
: Bug: invalid use of Tensor.expand with -1; shape creation for dummy mrope_position_ids will raise at runtimetorch.Tensor.expand does not accept -1 as a size. The current code will throw. Create the (3, 1, token_num) shape by first viewing to 3D, then expanding with concrete sizes.
Apply this fix:
- dummy_mrope_position_ids = torch.arange( - 0, token_num, dtype=torch.int32).expand(3, 1, -1).clone() + base = torch.arange(token_num, dtype=torch.int32).view(1, 1, token_num) + dummy_mrope_position_ids = base.expand(3, 1, token_num).clone()Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
6-21
: Mapper correctly normalizes 'model.language_model.' → 'model.' prefixes before delegating to HfWeightMapperThe transformation is correct and scoped, then reuses the base filter logic. Clean and minimal. LGTM.
tensorrt_llm/_torch/models/modeling_qwen.py (2)
35-35
: LGTM!The addition of
mrope_section
parameter retrieval from the config is properly handled with a default ofNone
.
49-49
: LGTM!The
rope_fusion
parameter is correctly derived from thedisable_fuse_rope
config option with proper negation logic.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
48-59
: LGTM! Clean implementation of mRoPE position IDs handling.The conditional initialization of position IDs based on
use_mrope
is well-structured, using the appropriate 3D shape for mRoPE and 2D shape for standard RoPE.
129-139
: LGTM! Proper handling of mRoPE data transfer.The non-blocking copy of mrope_position_deltas ensures efficient data transfer without blocking the compute stream.
tensorrt_llm/_torch/modules/rotary_embedding.py (2)
130-142
: LGTM! Well-structured MRotaryEmbedding class initialization.The MRotaryEmbedding class properly extends RotaryEmbedding with mRoPE-specific parameters. The default mrope_section values [16, 24, 24] are appropriate for Qwen2/2.5-VL models.
153-165
: LGTM! Elegant multi-section RoPE implementation.The splitting and concatenation of cos/sin values based on mrope_section is well-implemented. The logic correctly handles the 3D position_ids for mRoPE while maintaining backward compatibility with 2D position_ids.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
446-448
: LGTM! Proper initialization of mRoPE position IDs buffer.The 3D tensor allocation for mrope_position_ids_cuda with shape (3, 1, max_num_tokens) is correct for mRoPE.
1581-1600
: LGTM! Proper handling of mRoPE vs standard position IDs.The conditional logic correctly handles the 3D mRoPE position IDs when available and falls back to standard 2D position IDs otherwise.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (6)
746-749
: Only compute mRoPE config when needed; ensure it’s complete.Good gating via
disable_fuse_rope
. Ensuremrope_config
includes bothmrope_rotary_cos_sin
andmrope_position_deltas
on the correct device/dtypes as in the previous refactor.
730-759
: End-to-end data flow looks solid. Minor consistency checks.
- mm_embeds computed from encoder or cached path, then fused—LGTM.
- Passing
mrope_config
intollm.forward(...)
is consistent with the new flow. After the above fixes, device/dtype mismatches should be resolved.
750-761
: Log level/shape logging looks appropriate.Context/generation split logging and final output shape debug are helpful and not noisy.
374-376
: Ensure stability by using the publicfrom_config
API or strict version pinningThe private method
_from_config
isn’t part of the guaranteed public API and may change in futuretransformers
releases. To avoid unexpected breakages:• Replace
self.visual = model_class._from_config(config).eval()with the public constructor:
self.visual = model_class.from_config(config).eval()• If
from_config
isn’t available or doesn’t meet your needs in Transformers 4.53.x, pin your dependency explicitly (e.g. insetup.py
orrequirements.txt
):transformers>=4.53.0,<4.54.0
and add a comment noting why the pin is required.
Please verify that
Qwen2VisionTransformerPretrainedModel.from_config
exists and behaves as expected in your target version.
96-117
: Filling masked positions with 1 is the established conventionA search across the codebase shows the exact same
position_ids.masked_fill_(attention_mask == 0, 1)
pattern in:
tensorrt_llm/runtime/multimodal_model_runner.py:1950
tensorrt_llm/runtime/generation.py:2898
and2984
This confirms that using
1
for masked positions is intentional and consistent throughout. No change is needed here.
490-493
: No unreachable early return in Qwen2VLModelBase.initThe
if hasattr(self, "llm"): return
guard in Qwen2VLModelBase.init can never fire under normal usage, because:
- Subclasses (Qwen2VLModel and Qwen2_5_VLModel) do not set
self.llm
before callingsuper().__init__
.- Within Qwen2VLModelBase.init,
self.llm
is only assigned after the mm_encoder setup (lines 487–491).Therefore, when
DISAGG
is False,self.mm_encoder
will always be initialized. You can safely ignore the original concern.Likely an incorrect or invalid review comment.
7095cdc
to
f7cd755
Compare
/bot run |
PR_Github #17861 [ run ] triggered by Bot |
PR_Github #17861 [ run ] completed with state |
The fmha_v2 changes LGTM. I will let others review the rest of the codes. Thanks! |
f7cd755
to
805b4f6
Compare
/bot run |
Signed-off-by: yechank <[email protected]>
c83cbaf
to
b92fbf6
Compare
/bot run |
PR_Github #19335 [ run ] triggered by Bot |
PR_Github #19335 [ run ] completed with state |
Signed-off-by: yechank <[email protected]>
/bot run |
PR_Github #19357 [ run ] triggered by Bot |
/bot run |
PR_Github #19384 [ run ] triggered by Bot |
PR_Github #19384 [ run ] completed with state |
/bot run |
PR_Github #19410 [ run ] triggered by Bot |
PR_Github #19410 [ run ] completed with state |
/bot run |
PR_Github #19442 [ run ] triggered by Bot |
/bot kill |
PR_Github #19492 [ kill ] triggered by Bot |
PR_Github #19442 [ run ] completed with state |
PR_Github #19492 [ kill ] completed with state |
/bot run |
PR_Github #19494 [ run ] triggered by Bot |
/bot run |
PR_Github #19534 [ run ] triggered by Bot |
PR_Github #19494 [ run ] completed with state |
PR_Github #19534 [ run ] completed with state |
# NOTE: self.use_mrope is enough for differentiating whether to use mrope_position_ids but | ||
# `_create_dummy_context_requests` from `kv_cache_creater` makes an exception that I can not add multimodal_data to the dummy_request | ||
# so that we only replace position_ids with mrope_position_ids when it is not a dummy request and for models who is using mrope. | ||
mrope_position_ids = torch.cat(mrope_position_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've identified a potential device mismatch issue in the code. The tensor ctx_mrope_position_ids
might be on the CUDA device, while gen_mrope_position_ids
remains on the CPU device. This discrepancy will cause an error when attempting to concatenate these tensors.
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Summary by CodeRabbit
New Features
Performance
Improvements
Tests