-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6654][feat] Add support for external multimodal embeddings #6263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces support for directly attaching externally computed multimodal (e.g., image) embeddings into the text processing pipeline for Llama4 and LlavaNext models. It adds conversion utilities for multimodal data between tensor and handle representations, enhances multimodal input handling in the API and executor, and includes comprehensive unit tests for the new conversion logic. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant API as LLM API
participant Processor as InputProcessor
participant Executor
participant Model
User->>API: Submit prompt with multimodal embeddings
API->>Processor: attach_multimodal_embeddings(prompt, embeddings)
Processor->>Processor: Process prompt, merge embeddings, tokenize
Processor-->>API: token_ids, extra_inputs
API->>Executor: Prepare request (convert embeddings if needed)
Executor->>Model: Forward pass with tokens and embeddings
Model-->>Executor: Output
Executor-->>API: Output
API-->>User: Output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,h,hpp,cc,cxx,cu,py}📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (2)📓 Common learnings
tensorrt_llm/_torch/models/modeling_llama.py (1)Learnt from: yechank-nvidia 🪛 Ruff (0.12.2)tensorrt_llm/_torch/models/modeling_llama.py835-835: Line too long (122 > 120) (E501) 836-836: Line too long (121 > 120) (E501) 🔇 Additional comments (3)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
1-95
: Consider adding error condition tests and different tensor configurations.The test suite provides excellent coverage of the happy path scenarios. Consider enhancing with:
- Error condition tests (invalid keys, malformed data)
- Different tensor dtypes and devices
- Performance benchmarks for large tensors
tensorrt_llm/inputs/multimodal.py (2)
258-272
: Consider handling nested dictionaries within lists.The
_to_tensor_handle
function handles lists of tensors but doesn't recursively process dictionaries that might be inside lists.Consider enhancing the list handling to support nested structures:
elif isinstance(v, list): for i, item in enumerate(v): if isinstance(item, torch.Tensor): handle = SharedTensorContainer.from_tensor( item).dump_to_dict() v[i] = handle + elif isinstance(item, dict): + _to_tensor_handle(item)
336-347
: Consider handling nested dictionaries within lists (same as to_handle).Similar to the
to_handle
method, the_to_tensor
function should handle dictionaries within lists for consistency.Consider enhancing the list handling:
elif isinstance(v, list): for i, item in enumerate(v): if isinstance(item, dict) and 'method_key' in item: try: tensor = SharedTensorContainer.from_dict( item).get_local_view() v[i] = tensor except Exception as e: raise ValueError( f"Failed to convert handle to tensor in list at index {i}: {e}" ) + elif isinstance(item, dict): + _to_tensor(item)tensorrt_llm/_torch/models/modeling_llama.py (2)
808-812
: Track the TODO for obtaining special tokens from tokenizer.The hardcoded special tokens should ideally come from the tokenizer to ensure consistency.
Would you like me to help implement the logic to obtain these special tokens from the tokenizer or open an issue to track this TODO?
844-943
: Consider refactoring this method for better maintainability.This method is quite long (100+ lines) with complex logic. Consider breaking it down into smaller helper methods for better readability and maintainability.
Consider extracting helper methods:
def _validate_multimodal_embedding(self, multimodal_embedding: Dict[str, List[Dict[str, Any]]]) -> None: """Validate the structure and content of multimodal embeddings.""" if not isinstance(multimodal_embedding, dict): raise ValueError("multimodal_embedding must be a dictionary") if 'image' not in multimodal_embedding: raise ValueError("Only image modality is supported for now") mm_embedding_info = multimodal_embedding['image'] if not mm_embedding_info or not isinstance(mm_embedding_info[0], dict): raise ValueError("Llama4 image embedding must contain special token information") def _extract_embedding_components(self, mm_embedding_info: List[Dict[str, Any]]) -> Tuple[List, List, List]: """Extract embedding components from the embedding info.""" try: mm_embeddings = [mm_embedding['mm_embeddings'] for mm_embedding in mm_embedding_info] mm_embedding_special_tokens = [mm_embedding['image_special_tokens'] for mm_embedding in mm_embedding_info] mm_embedding_special_offsets = [mm_embedding['image_special_token_offsets'] for mm_embedding in mm_embedding_info] return mm_embeddings, mm_embedding_special_tokens, mm_embedding_special_offsets except KeyError as e: raise ValueError(f"Missing required key in multimodal embedding: {e}") def _validate_embedding_dimensions(self, mm_embeddings: List[torch.Tensor]) -> None: """Validate that embedding dimensions match model requirements.""" model_hidden_size = self.model_config.text_config.hidden_size for i, embedding in enumerate(mm_embeddings): if embedding.shape[-1] != model_hidden_size: raise ValueError( f"Multimodal embedding {i} hidden size {embedding.shape[-1]} " f"must match model hidden size {model_hidden_size}" )Then use these helpers in the main method to reduce its complexity.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/models/modeling_llama.py
(3 hunks)tensorrt_llm/_torch/models/modeling_llava_next.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tensorrt_llm/inputs/multimodal.py
(1 hunks)tensorrt_llm/inputs/utils.py
(4 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tests/unittest/_torch/multimodal/test_share_multiparams.py
(1 hunks)
🧠 Learnings (7)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/executor/worker.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/inputs/multimodal.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llama.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/inputs/utils.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llava_next.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py
837-837: Line too long (122 > 120)
(E501)
838-838: Line too long (121 > 120)
(E501)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/executor/worker.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/inputs/multimodal.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llama.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/inputs/utils.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llava_next.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py
837-837: Line too long (122 > 120)
(E501)
838-838: Line too long (121 > 120)
(E501)
🔇 Additional comments (13)
tensorrt_llm/executor/worker.py (1)
500-505
: LGTM: Tensor conversion logic is correctly implemented for multimodal embeddings.The conditional tensor conversion properly handles the deserialization of multimodal embeddings from shared tensor handles back to PyTorch tensors for the PyTorch backend. The conditions appropriately check for PyTorch backend, multimodal parameters existence, and multimodal data presence before conversion.
tensorrt_llm/llmapi/llm.py (3)
344-347
: LGTM: Condition properly expanded to include multimodal embeddings.The updated condition correctly triggers VLM reprocessing for both existing
multi_modal_data
and newmulti_modal_embeddings
scenarios, maintaining backward compatibility while supporting the new feature.
397-399
: LGTM: Handle conversion optimizes IPC for multimodal embeddings.The conversion to shared tensor handle with key
"multimodal_embedding"
correctly optimizes inter-process communication by serializing tensor data. This pairs with the inverseto_tensor
operation in the worker for efficient multimodal data transfer.
380-384
:attch_multimodal_embeddings
implementation verifiedThe
attch_multimodal_embeddings
method is present in both model-specific input processors, matching the new branch’s call site:
- tensorrt_llm/_torch/models/modeling_llama.py
- tensorrt_llm/_torch/models/modeling_llava_next.py
No further action required.
tests/unittest/_torch/multimodal/test_share_multiparams.py (4)
11-30
: LGTM: Well-structured test setup with comprehensive multimodal data.The setUp method creates realistic test fixtures covering various multimodal data types (embeddings, mrope config, image data) using CPU tensors, which is appropriate for unit testing since CUDA IPC requires separate processes.
31-50
: LGTM: Thorough edge case testing for None and empty data.The test properly validates behavior with None and empty multimodal data, ensuring the handle conversion methods are robust. The test also verifies that MultimodalInput objects are preserved correctly when they don't contain tensor data.
51-64
: LGTM: Basic round-trip conversion test validates data integrity.The test correctly verifies that tensor data survives the handle conversion round-trip (to_handle followed by to_tensor) while maintaining type and value integrity using
torch.allclose
.
65-91
: LGTM: Comprehensive nested data conversion test ensures full functionality.The test validates that complex nested multimodal data structures with multiple tensor types (embeddings, mrope configs, image data) are correctly preserved through the full conversion cycle. The individual tensor comparisons ensure all data types are handled properly.
tensorrt_llm/_torch/models/modeling_llava_next.py (2)
3-3
: LGTM!The addition of
Dict
import is appropriate for the new method's type annotations.
218-229
: Implementation looks correct.The method correctly:
- Validates the presence of 'image' key in multimodal_embedding
- Tokenizes the prompt
- Uses the existing _postprocess method to handle embedding expansion
- Returns the expected tuple format
tensorrt_llm/_torch/models/modeling_llama.py (1)
2-2
: LGTM!The addition of
Dict
import is appropriate for the new method's type annotations.tensorrt_llm/inputs/utils.py (2)
325-325
: LGTM!The expanded type annotation correctly supports multiple media data formats including embeddings.
483-613
: Well-implemented support for multimodal embeddings.The changes correctly:
- Add optional
mm_embeddings
parameter with proper type hints- Extend
convert_to_conversation_message
to handle embeddings- Maintain backward compatibility with raw media inputs
- Use distinct dictionary keys to differentiate between embeddings and raw data
The implementation is clean and maintains the existing functionality while adding the new capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)
820-841
: Fix the typo in method name and address line length issues.The method name has a typo and there are line length violations.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)
935-935
: Validate the out-of-vocabulary token index.Using
self.vocab_size + 1
as a hardcoded out-of-vocabulary token index could lead to conflicts. Consider making this configurable or validating that this index is reserved for this purpose.- token_ids[token_ids == self.image_token_index] = self.vocab_size + 1 + # Use a well-defined OOV token index, potentially from model config + oov_token_index = getattr(self.model_config, 'oov_token_index', self.vocab_size + 1) + token_ids[token_ids == self.image_token_index] = oov_token_index
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/models/modeling_llama.py
(3 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llama.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py
835-835: Line too long (122 > 120)
(E501)
836-836: Line too long (121 > 120)
(E501)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/models/modeling_llama.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py
835-835: Line too long (122 > 120)
(E501)
836-836: Line too long (121 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (2)
2-2
: Import addition looks good.The
Any
type import is necessary for the new method's type annotations.
808-811
: Constants definition looks good.The new image token constants are well-defined and provide clear abstractions for multimodal token processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yechank-nvidia , would it be okay to add trtllm-serve support in a follow-up PR? I plan to integrate this feature into trtllm-serve as part of the EPD integration work (#5000). |
/bot run |
PR_Github #12909 [ run ] triggered by Bot |
PR_Github #12909 [ run ] completed with state |
/bot run |
1 similar comment
/bot run |
PR_Github #13028 [ run ] triggered by Bot |
PR_Github #13028 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #13056 [ run ] triggered by Bot |
PR_Github #13056 [ run ] completed with state |
Signed-off-by: Chang Liu <[email protected]>
/bot run --disable-fail-fast |
PR_Github #13292 [ run ] triggered by Bot |
/bot run --disable-fail-fast |
PR_Github #13404 [ run ] triggered by Bot |
PR_Github #13292 [ run ] completed with state |
/bot run |
/bot run |
PR_Github #13475 [ run ] triggered by Bot |
PR_Github #13404 [ run ] completed with state |
PR_Github #13475 [ run ] completed with state |
…VIDIA#6263) Signed-off-by: Chang Liu <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…VIDIA#6263) Signed-off-by: Chang Liu <[email protected]>
[TRTLLM-6654][feat] Support external multimodal embeddings as the input to LLM decoder
This PR adds support for external embedding tensors as an additional attribute of PromptInput, which is then passed to the lead worker via the shared tensor utility (#5396). Currently, only LLaMA4 and LLaVA models are supported.
Note:
Unlike typical multimodal models, llama4's multimodal embeddings are not represented as a contiguous token block per image. Therefore, users must also provide
image_special_tokens
andimage_special_offsets
to correctly align the embeddings with the text input. See here for how image token ids are generated ininput_ids
: https://github.com/huggingface/transformers/blob/73869f2e81467db8422cbb4831cce9a7bdc85c4b/src/transformers/models/llama4/processing_llama4.py#L121-L135Note:
PR dependency: #6254
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
Test Coverage
to_tensor
andto_handle
functionsGitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.