-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens #6259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis update introduces a PyTorch-specific token length validation in the Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner
participant LLM (PyTorch)
participant Validator
TestRunner->>LLM (PyTorch): Submit prompt (long prompt, max_tokens=4)
LLM (PyTorch)->>Validator: Check prompt and query token length
Validator-->>LLM (PyTorch): Raise ValueError if token sum > max_num_tokens
LLM (PyTorch)-->>TestRunner: Return error or continue
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,h,hpp,cc,cxx,cu,py}📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (5)📓 Common learnings
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Applied to files:
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Applied to files:
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Applied to files:
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Applied to files:
🪛 Ruff (0.12.2)tensorrt_llm/llmapi/llm.py574-574: Line too long (147 > 120) (E501) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)
547-554
: LGTM: Sound validation logic for PyTorch backend token limits.The new validation correctly:
- Scopes to PyTorch backend only
- Skips validation for appropriate cases (generation-only, chunked prefill)
- Accounts for context parallelism by dividing by
cp_size
- Provides informative error messages
However, address the line length issue:
- if max_num_tokens and prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens: + if (max_num_tokens and + prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/disaggregated/clients/long_prompts.json
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(5 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (6)
tests/unittest/llmapi/test_llm_pytorch.py (3)
tests/unittest/llmapi/test_llm.py (4)
get_model_path
(63-67)llm_get_stats_async_test_harness
(1971-2038)_test_llm_capture_request_error
(2091-2116)test_llm_capture_request_error
(2119-2120)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
test_llm_capture_request_error
(17-18)
tests/unittest/llmapi/test_llm_multi_gpu.py (1)
tests/unittest/llmapi/test_llm.py (1)
_test_llm_capture_request_error
(2091-2116)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (3)
tests/unittest/llmapi/test_llm.py (2)
_test_llm_capture_request_error
(2091-2116)test_llm_capture_request_error
(2119-2120)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_pytorch.py (1)
test_llm_capture_request_error
(72-73)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
parallel_config
(1151-1152)tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size
(38-39)
tests/unittest/llmapi/test_llm.py (6)
tensorrt_llm/llmapi/llm.py (1)
LLM
(1023-1039)tensorrt_llm/builder.py (1)
BuildConfig
(481-704)tensorrt_llm/executor/utils.py (1)
RequestError
(76-77)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
test_llm_capture_request_error
(17-18)tests/unittest/llmapi/test_llm_pytorch.py (1)
test_llm_capture_request_error
(72-73)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (4)
disaggregated_test_root
(2335-2340)disaggregated_example_root
(270-275)llm_venv
(707-723)llama_model_root
(964-1039)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
🧰 Additional context used
🧠 Learnings (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (6)
tests/unittest/llmapi/test_llm_pytorch.py (3)
tests/unittest/llmapi/test_llm.py (4)
get_model_path
(63-67)llm_get_stats_async_test_harness
(1971-2038)_test_llm_capture_request_error
(2091-2116)test_llm_capture_request_error
(2119-2120)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
test_llm_capture_request_error
(17-18)
tests/unittest/llmapi/test_llm_multi_gpu.py (1)
tests/unittest/llmapi/test_llm.py (1)
_test_llm_capture_request_error
(2091-2116)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (3)
tests/unittest/llmapi/test_llm.py (2)
_test_llm_capture_request_error
(2091-2116)test_llm_capture_request_error
(2119-2120)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_pytorch.py (1)
test_llm_capture_request_error
(72-73)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
parallel_config
(1151-1152)tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size
(38-39)
tests/unittest/llmapi/test_llm.py (6)
tensorrt_llm/llmapi/llm.py (1)
LLM
(1023-1039)tensorrt_llm/builder.py (1)
BuildConfig
(481-704)tensorrt_llm/executor/utils.py (1)
RequestError
(76-77)tests/unittest/llmapi/test_llm_multi_gpu.py (1)
test_llm_capture_request_error
(468-469)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)
test_llm_capture_request_error
(17-18)tests/unittest/llmapi/test_llm_pytorch.py (1)
test_llm_capture_request_error
(72-73)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (4)
disaggregated_test_root
(2335-2340)disaggregated_example_root
(270-275)llm_venv
(707-723)llama_model_root
(964-1039)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
🔇 Additional comments (18)
examples/disaggregated/clients/long_prompts.json (1)
1-4
: LGTM! Well-structured test data for token limit validation.The JSON file contains a single long prompt that effectively serves its purpose for testing token limit validation in disaggregated server scenarios. The content about "Place of birth" is appropriate and provides sufficient length to test max token enforcement.
tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml (1)
1-24
: LGTM! Configuration effectively tests different max token limits across server types.The YAML configuration properly sets up the disaggregated testing scenario with:
- Context servers: 512 max tokens
- Generation servers: 256 max tokens
This difference in
max_num_tokens
values is exactly what's needed to test the token validation logic mentioned in the PR objectives.tests/integration/test_lists/test-db/l0_a10.yml (1)
24-24
: LGTM! Test case properly added to the integration test suite.The new
test_disaggregated_diff_max_tokens
test case is appropriately placed in the PyTorch backend section for A10 GPU testing, following the same pattern as other disaggregated tests.tests/unittest/llmapi/test_llm_pytorch.py (2)
9-13
: LGTM! Import statement properly updated to include the error capture test function.The import statement correctly includes
_test_llm_capture_request_error
alongside other test utilities, maintaining proper organization.
72-74
: LGTM! Test function correctly validates PyTorch backend error handling.The new
test_llm_capture_request_error
function properly tests token limit validation for the PyTorch backend with appropriate parameters (pytorch_backend=True, tp_size=1
). This aligns with the enhanced token validation mentioned in the PR objectives.tests/unittest/llmapi/test_llm_multi_gpu.py (1)
469-469
: LGTM! Function call properly updated with required backend parameter.The explicit parameters
pytorch_backend=False, tp_size=2
are appropriate for multi-GPU non-PyTorch testing, ensuring the correct error type (RequestError
) is validated when token limits are exceeded.tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (2)
10-10
: LGTM: Appropriate import for test helper function.The import of
_test_llm_capture_request_error
is correctly added to support the new test function.
16-18
: LGTM: Well-structured test for multi-GPU PyTorch backend.The test function correctly calls the helper with
pytorch_backend=True
andtp_size=2
, providing appropriate test coverage for error handling in the multi-GPU PyTorch context. The@pytest.mark.gpu2
decorator properly indicates the hardware requirements.tensorrt_llm/llmapi/llm.py (1)
571-571
: LGTM: Improved error message accuracy.Using the dynamically loaded
max_seq_len
variable instead of the staticbuild_config.max_seq_len
provides more accurate error reporting by reflecting the actual limit from the built engine configuration.tests/unittest/llmapi/test_llm.py (5)
2092-2092
: Function signature update looks good.The addition of the required
pytorch_backend
parameter enables proper backend selection while maintaining the existingtp_size
parameter with its default value.
2093-2102
: Backend configuration logic is well-structured.The conditional logic properly handles the differences between PyTorch and TensorRT backends:
- Correct LLM class selection (
LLM_torch
vsLLM
)- Appropriate configuration methods (direct parameter vs
BuildConfig
)- Consistent token limit (64) across both backends
2104-2108
: LLM instantiation is correctly implemented.The dynamic class selection and parameter passing follow established patterns, with proper backend-specific configuration through the unpacked
llm_args_extra
.
2110-2117
: Error handling logic correctly tests backend-specific behaviors.The test properly validates token limit enforcement:
- Prompt with 65 tokens correctly exceeds the 64-token limit
- Backend-specific exception types (
ValueError
for PyTorch,RequestError
for TensorRT) reflect implementation differences- Test structure ensures proper error validation
2120-2121
: Test function update maintains backward compatibility.Explicitly passing
pytorch_backend=False
preserves the existing test behavior for the TensorRT backend while enabling the new parameterized functionality.tests/integration/defs/disaggregated/test_disaggregated.py (4)
39-40
: LGTM! Configuration entry follows established patterns.The new "2_ranks_diff_max_tokens" configuration entry is well-named and follows the existing pattern in the config_map. It appropriately uses 2 ranks for testing different max_num_tokens settings between context and generation servers.
139-140
: Good backward-compatible enhancement.Adding the
prompt_file
parameter with a sensible default maintains backward compatibility while enabling the new functionality for testing with different prompt files.
185-187
: Well-designed special handling for long prompts test.The conditional logic for "long_prompts.json" is well-implemented:
- Adding
--max-tokens 4
appropriately reduces test execution time- Skipping output verification makes sense since this test focuses on configuration validation rather than output correctness
The code is clean and well-commented.
Also applies to: 216-218
263-282
: Excellent test implementation following established patterns.The new
test_disaggregated_diff_max_tokens
function is well-implemented:
- Follows the established pattern used by other tests in the file
- Proper model parameterization and symbolic link setup
- Correct usage of the modified
run_disaggregated_test
function- Uses appropriate configuration and prompt file for testing different max_num_tokens scenarios
The test effectively covers the scenario described in the PR objectives.
PR_Github #12576 [ run ] triggered by Bot |
PR_Github #12576 [ run ] completed with state |
5954e17
to
077c97c
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)
547-554
: Good token validation logic, but fix line length.The PyTorch-specific token length validation is well-implemented with appropriate conditions and clear error messaging. However, line 552 exceeds the 120-character limit.
Apply this diff to fix the line length issue:
- if max_num_tokens and prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens: + if (max_num_tokens and + prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/disaggregated/clients/long_prompts.json
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(5 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (1)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
parallel_config
(1131-1132)tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size
(38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
✅ Files skipped from review due to trivial changes (1)
- tests/unittest/llmapi/test_llm_pytorch.py
🚧 Files skipped from review as they are similar to previous changes (7)
- tests/integration/test_lists/test-db/l0_a10.yml
- tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
- examples/disaggregated/clients/long_prompts.json
- tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
- tests/unittest/llmapi/test_llm_multi_gpu.py
- tests/unittest/llmapi/test_llm.py
- tests/integration/defs/disaggregated/test_disaggregated.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (1)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
parallel_config
(1131-1132)tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size
(38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
🔇 Additional comments (1)
tensorrt_llm/llmapi/llm.py (1)
571-571
: Good consistency improvement in error message.Using the dynamically loaded
max_seq_len
value in the error message is consistent with the validation logic and provides more accurate error reporting.
PR_Github #12657 [ run ] triggered by Bot |
PR_Github #12657 [ run ] completed with state |
077c97c
to
78a7425
Compare
/bot run |
PR_Github #12665 [ run ] triggered by Bot |
PR_Github #12665 [ run ] completed with state |
78a7425
to
3cb7b91
Compare
/bot run |
PR_Github #12706 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)
547-554
: Fix line length violation and approve the validation logic.The PyTorch-specific token validation logic is correct and well-implemented. It properly accounts for context parallelism by dividing prompt length by
cp_size
and appropriately skips validation for chunked prefill and generation-only requests.However, there's a line length violation that needs to be addressed.
Apply this diff to fix the line length issue:
- if max_num_tokens and prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens: + if (max_num_tokens and + prompt_len / self.args.parallel_config.cp_size + query_len > max_num_tokens):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/disaggregated/clients/long_prompts.json
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(5 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (8)
- examples/disaggregated/clients/long_prompts.json
- tests/unittest/llmapi/test_llm_multi_gpu.py
- tests/integration/test_lists/test-db/l0_a10.yml
- tests/unittest/llmapi/test_llm_pytorch.py
- tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
- tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
- tests/unittest/llmapi/test_llm.py
- tests/integration/defs/disaggregated/test_disaggregated.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.703Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
552-552: Line too long (147 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/llmapi/llm.py (1)
571-571
: LGTM! Improved error message accuracy.The change to use the dynamically read
max_seq_len
value in the error message is correct and improves accuracy. This ensures the error message reflects the actual limit being enforced rather than the static build configuration value.
PR_Github #12706 [ run ] completed with state |
/bot run |
3cb7b91
to
c08f58d
Compare
/bot run |
Signed-off-by: Pengyun Lin <[email protected]>
b12a7dd
to
d6dd04d
Compare
/bot run |
PR_Github #13662 [ run ] triggered by Bot |
PR_Github #13662 [ run ] completed with state |
/bot run |
PR_Github #13789 [ run ] triggered by Bot |
PR_Github #13789 [ run ] completed with state |
/bot skip -m "CI passed before" |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot skip --comment "CI passed before" |
PR_Github #13939 [ skip ] triggered by Bot |
/bot reuse-pipeline |
PR_Github #13953 [ reuse-pipeline ] triggered by Bot |
PR_Github #13939 [ skip ] completed with state |
PR_Github #13953 [ reuse-pipeline ] completed with state |
…ifferent max_num_tokens (NVIDIA#6259) Signed-off-by: Pengyun Lin <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…ifferent max_num_tokens (NVIDIA#6259) Signed-off-by: Pengyun Lin <[email protected]>
Description
Test Coverage
Add a test as the description.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Bug Fixes
Tests