-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][feat] Support SharedTensor on MultimodalParams #6254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes focus on improving the handling of multimodal tensor data throughout the codebase. They introduce new methods for recursively converting, validating, and transferring multimodal data between shared memory and device memory. Control flow in the model engine and API is updated to ensure shared tensor containers are properly managed and reassigned during data preparation. Additionally, runtime data support for multimodal token caching is added. Changes
Sequence Diagram(s)sequenceDiagram
participant API as LLM API
participant Params as MultimodalParams
participant Engine as ModelEngine
participant Worker as Executor Worker
API->>Params: to_shared_tensor("multimodal_data")
Params->>Params: _apply_tensor_operation (recursive conversion)
API->>Engine: Pass multimodal_params (shared tensor)
Engine->>Params: from_shared_tensor("multimodal_data")
Params->>Params: _apply_tensor_operation (recover container)
Engine->>Params: to_device("multimodal_data", "cuda")
Params->>Params: _apply_tensor_operation (move to device)
Engine->>Worker: Pass request with multimodal_data
Worker->>Params: from_shared_tensor("multimodal_data")
Params->>Params: _apply_tensor_operation (recover container)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tensorrt_llm/inputs/multimodal.py (3)
193-220
: Add validation for unknown method_key values.The method correctly validates method_key values 1 and 2, but returns False for any other value without validation. Consider explicitly handling unknown values to catch potential issues early.
# Additional validation based on method_key method_key = obj.get('method_key') if method_key == 1: # CUDA tensor cuda_keys = {'tensor_size', 'storage_handle', 'storage_device'} return cuda_keys.issubset(obj.keys()) elif method_key == 2: # CPU tensor cpu_keys = {'tensor_size', 'storage_handle', 'manager_handle'} return cpu_keys.issubset(obj.keys()) + else: + # Log warning for unknown method_key but still return False + if method_key is not None: + logger.debug(f"Unknown method_key value: {method_key}") return False
259-259
: Remove commented debug print statement.- # print(f"input_data: {input_data}") return SharedTensorContainer.from_dict(
256-257
: Consider moving imports to module level for better performance.The method imports
SharedTensorContainer
inside the function body, which could impact performance if called frequently. Consider moving these imports to the module level.At the module level:
from tensorrt_llm._torch.shared_tensor import SharedTensorContainerThen remove the import statements from within the method.
Also applies to: 278-279
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py
(2 hunks)tensorrt_llm/inputs/multimodal.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)
🔇 Additional comments (6)
tensorrt_llm/llmapi/llm.py (1)
396-397
: LGTM! SharedTensor conversion properly placed.The addition of
multimodal_params.to_shared_tensor("multimodal_data")
is correctly positioned after validating that the multimodal parameters have content. This ensures efficient shared memory usage by only converting non-empty multimodal data, and aligns well with the PR objective to support SharedTensor on MultimodalParams.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1161-1167
: Approve shared tensor recovery flowI’ve checked the implementations of
from_shared_tensor
andto_device
intensorrt_llm/inputs/multimodal.py
. Both methods perform input validation by raisingValueError
(as documented), and there are no internaltry
/except
blocks—errors are meant to propagate to higher-level handlers. No additional error handling is required here.All set!
tensorrt_llm/inputs/multimodal.py (4)
305-321
: LGTM! Clean validation helper.The method provides clear error messages and proper validation.
322-342
: LGTM! Well-implemented shared tensor conversion.The method properly validates input and leverages the recursive helper for conversion.
368-395
: LGTM! Excellent refactoring of to_device method.The refactored implementation properly validates input and leverages the unified tensor operation framework with improved error handling.
396-414
: LGTM! Correctly preserves entire mrope_config.The updated implementation properly retains the complete mrope_config dictionary during generation, which aligns with the PR objectives.
Hi @ yechank-nvidia, when you get a chance can you benchmark your branch with
You only notice this if you go really high in number of concurrent request, for low concurrency it does not break |
/bot run |
PR_Github #14347 [ run ] triggered by Bot |
PR_Github #14347 [ run ] completed with state |
6831b59
to
ce42696
Compare
/bot run |
PR_Github #14377 [ run ] triggered by Bot |
PR_Github #14377 [ run ] completed with state |
/bot run |
PR_Github #14392 [ run ] triggered by Bot |
PR_Github #14392 [ run ] completed with state |
ce42696
to
e041662
Compare
/bot run |
PR_Github #14527 [ run ] triggered by Bot |
PR_Github #14527 [ run ] completed with state |
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
e7e7a97
to
204c30a
Compare
/bot run |
PR_Github #14554 [ run ] triggered by Bot |
PR_Github #14554 [ run ] completed with state |
This PR changes the tensor object in MultimodalParasm into SharedTensor container.
Summary by CodeRabbit