-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-5830][feat] Improve LoRA cache memory control #6220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-5830][feat] Improve LoRA cache memory control #6220
Conversation
📝 WalkthroughWalkthroughThis update refactors LoRA and PEFT cache configuration management in both core and test code. Deprecated LoRA fields are removed, LoRA config handling is unified, and PEFT cache merging is improved. Type annotations are updated for flexibility, and new tests are added to verify cache sizing and config override behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant LLM
participant EngineConfig
participant LoraConfig
participant PeftCacheConfig
participant Executor
User->>LLM: Build model (with/without lora_config)
LLM->>EngineConfig: Load engine config
alt lora_plugin enabled
EngineConfig->>LoraConfig: Load lora_config
alt User provides lora_config
LLM->>LoraConfig: Override with user lora_config
end
LLM->>PeftCacheConfig: Merge existing config, update fields if lora_config present
end
LLM->>Executor: Create with merged lora_config and peft_cache_config
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
331-344
: Fix the line length violation.The test logic is excellent - it validates that
lora_config
cache size parameters override conflictingpeft_cache_config
values by successfully running with small cache sizes inpeft_cache_config
but adequate sizes inlora_config
.However, there's a line length issue that needs to be addressed.
Apply this diff to fix the line length violation:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""tensorrt_llm/llmapi/llm_args.py (1)
702-719
: Good factory method implementation with minor formatting suggestion.The
create_from_pybind
method correctly implements the factory pattern for converting pybind objects to Python objects, supporting the flexible PEFT cache configuration mentioned in the PR objectives.Minor nitpick: Consider formatting the docstring as a single line per the static analysis hint.
- def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""tests/unittest/llmapi/test_llm.py (1)
1438-1453
: Fix line length and approve override testing logicThe test effectively validates that LoRA config cache size parameters override PEFT cache config by creating a scenario where PEFT config would fail but LoRA config succeeds.
However, line 1438 exceeds the 120-character limit:
-def test_llama_7b_lora_config_overrides_peft_cache_config(): - """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" +def test_llama_7b_lora_config_overrides_peft_cache_config(): + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tensorrt_llm/lora_manager.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(3 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py
332-332: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm.py
1438-1438: Line too long (124 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (12)
tensorrt_llm/lora_manager.py (1)
147-148
: LGTM! Clean type annotation update for flexible cache configuration.The change from
int = 4
toint | None = None
for bothmax_loras
andmax_cpu_loras
aligns perfectly with the PR objective of making LoRA cache sizes optional. This allows the system to determine appropriate defaults viaPeftCacheConfig
when these values are unset.tests/unittest/llmapi/test_llm_pytorch.py (3)
4-4
: LGTM! Proper import addition for test functionality.The import of
PeftCacheConfig
is necessary for the new cache configuration tests.
9-11
: LGTM! Required import for multi-LoRA test harness.The import of
check_llama_7b_multi_lora_from_request_test_harness
is correctly added to support the new cache configuration tests.
297-329
: LGTM! Well-designed test for PEFT cache configuration validation.This test effectively verifies that
PeftCacheConfig
parameters directly impact cache sizing by intentionally setting values too small to hold a single adapter and expectingRuntimeError
exceptions. The approach of testing failure cases is a solid strategy when direct cache size inspection isn't possible.The test covers both
host_cache_size
anddevice_cache_percent
parameters, providing comprehensive validation.tensorrt_llm/_torch/pyexecutor/_util.py (2)
14-14
: LGTM! Required import for PEFT cache configuration merging.The import of
PeftCacheConfig
is necessary for the new cache configuration merging logic.
471-481
: Excellent refactoring for flexible PEFT cache configuration.The updated logic elegantly handles the merging of existing PEFT cache configuration with LoRA-specific parameters:
- Preserves existing config: Creates
PeftCacheConfig
from existing pybind config when available, or uses defaults- Conditional overrides: Only updates
num_device_module_layer
andnum_host_module_layer
whenlora_config.max_loras
andlora_config.max_cpu_loras
are explicitly set (notNone
)- Maintains backwards compatibility: Falls back to defaults when no existing config is present
This approach is much more flexible than the previous direct construction and aligns perfectly with the PR objective of making LoRA cache sizes optional while allowing overrides.
tensorrt_llm/llmapi/llm_args.py (2)
644-644
: Good documentation improvement!The clarifying suffixes help users understand that these parameters affect cache sizes and take precedence over other cache size settings, which aligns well with the PR's goal of improving LoRA cache memory control.
Also applies to: 649-649
1424-1428
: Correctly implements optional LoraConfig handling.The changes properly make
lora_config
optional by:
- Always enabling
lora_plugin
whenenable_lora
is true (for non-PyTorch backends), regardless oflora_config
presence- Adding a safety check before accessing
lora_config.max_lora_rank
This aligns with the PR objective of making
LoraConfig.max_loras
andLoraConfig.max_cpu_loras
optional parameters.tests/unittest/llmapi/test_llm.py (3)
38-39
: LGTM: Import changes support new PEFT cache testing functionalityThe new imports are correctly added to support the PEFT cache configuration testing and test harness usage.
Also applies to: 53-55
1393-1400
: LGTM: Test harness integration improves consistencyThe modification to use the test harness with additional parameters is a good refactoring that standardizes the LoRA testing approach while maintaining the original test logic.
1403-1436
: LGTM: Effective negative testing approach for cache configuration validationThe test correctly validates that PEFT cache configuration parameters affect actual cache sizes by testing failure scenarios with inadequately small cache sizes. The approach of using
pytest.raises(RuntimeError)
is appropriate since direct cache size inspection isn't available.tensorrt_llm/llmapi/llm.py (1)
34-35
: LGTM! Import addition is appropriate.The addition of
PeftCacheConfig
import is necessary for the new PEFT cache configuration management logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, overall looks good
53208f5
to
0221b5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
702-719
: Fix docstring format and validate method implementation.The new
create_from_pybind
factory method enables conversion from pybind objects to Python models, which supports the refactored PEFT cache configuration handling.Apply this diff to fix the docstring format:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""tests/unittest/llmapi/test_llm.py (1)
437-452
: LGTM - Test correctly validates config override behavior with minor formatting issue.This test effectively validates that
LoraConfig
cache size parameters take precedence overPeftCacheConfig
parameters, which aligns with the PR objectives. The test design demonstrates this by using small cache sizes inPeftCacheConfig
that would normally cause failures, but providing adequate cache sizes inLoraConfig
to ensure success.Apply this diff to fix the line length issue:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""Actually, let me provide a better fix for the line length:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache + size parameters in peft_cache_config LLM arg."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tensorrt_llm/lora_manager.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(3 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/lora_manager.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/unittest/llmapi/test_llm_pytorch.py (3)
4-4
: LGTM: Import addition supports new test functionality.The import of
PeftCacheConfig
is correctly added to support the new cache configuration tests.
9-11
: LGTM: Import addition enables test harness usage.The import of
check_llama_7b_multi_lora_from_request_test_harness
is properly added to support the new LoRA cache testing functionality.
294-326
: LGTM: Comprehensive test validates PEFT cache size impact.The test correctly validates that small cache sizes cause runtime failures when loading LoRA adapters. The approach of testing failure conditions is appropriate since direct cache size inspection isn't available.
Key strengths:
- Tests both
host_cache_size
anddevice_cache_percent
parameters- Uses minimal LoRA config without explicit cache size values
- Properly expects
RuntimeError
for insufficient cache sizes- Includes appropriate CUDA graph disabling for known issues
tensorrt_llm/llmapi/llm_args.py (2)
644-644
: LGTM: Field descriptions clarified for better understanding.The updated descriptions for
num_host_module_layer
andnum_device_module_layer
provide clearer explanations of their impact on cache sizes and overriding behavior.Also applies to: 648-650
1424-1428
: LGTM: Simplified LoRA plugin configuration logic.The updated logic correctly:
- Always sets
lora_plugin
to 'auto' when LoRA is enabled for non-pytorch backends- Only assigns
max_lora_rank
whenlora_config
is present- Removes previous conditional checks that are no longer needed
This aligns with the PR's goal of removing deprecated LoRA fields and simplifying configuration.
tests/unittest/llmapi/test_llm.py (4)
38-39
: LGTM - Import addition supports new cache configuration tests.The import of
PeftCacheConfig
is necessary for the new test functions that verify cache sizing behavior.
53-55
: LGTM - Test utility import is correctly added.The import of
check_llama_7b_multi_lora_from_request_test_harness
is necessary for the new test functions that verify cache behavior.
396-399
: LGTM - Test function updated with proper parameter passing.The modification correctly updates the test to use the standardized test harness with appropriate parameters for LoRA functionality.
402-434
: LGTM - Well-designed test for cache size validation.This test effectively validates that
PeftCacheConfig
parameters affect cache sizing by testing failure scenarios with extremely small cache sizes. The approach is appropriate given that actual cache sizes cannot be directly inspected.The test covers both
host_cache_size
anddevice_cache_percent
parameters, ensuring comprehensive validation of the cache sizing functionality.
0221b5b
to
7021f17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)
329-329
: Fix line length violation.The docstring exceeds the project's line-length limit of 120 characters.
Apply this diff to split the docstring:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
702-719
: Fix docstring formatting and validate factory method logic.The static factory method correctly copies all fields from the pybind object to create a Python
PeftCacheConfig
instance. However, there's a docstring formatting issue.Apply this diff to fix the docstring format:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""The method implementation correctly maps all pybind fields to the Python model fields, supporting the conversion pattern used elsewhere in the codebase.
tests/unittest/llmapi/test_llm.py (1)
437-452
: LGTM - Excellent test for configuration override behaviorThis test perfectly validates the priority behavior where
lora_config
cache size parameters should overridepeft_cache_config
parameters. The test design is clever and effective.However, please fix the line length issue:
- peft_cache_config=PeftCacheConfig(host_cache_size=1, - device_cache_percent=0.000001)) + peft_cache_config=PeftCacheConfig( + host_cache_size=1, device_cache_percent=0.000001))
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tensorrt_llm/lora_manager.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(3 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/lora_manager.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/unittest/llmapi/test_llm_pytorch.py (4)
4-4
: LGTM: Import addition supports new test functionality.The import of
PeftCacheConfig
is correctly added to support the new cache configuration tests.
9-11
: LGTM: Test utility imports are appropriate.The import of the LoRA test harness function is correctly added to support the new cache configuration tests.
294-326
: LGTM: Test correctly validates PEFT cache size behavior.The test effectively verifies that PEFT cache configuration parameters (
host_cache_size
anddevice_cache_percent
) impact cache behavior by expectingRuntimeError
when values are too small to contain a single adapter. This is a sound testing approach when direct cache size inspection isn't available.
330-342
: LGTM: Test correctly validates cache parameter override behavior.The test properly verifies that LoRA config cache parameters (
max_loras
,max_cpu_loras
) override conflicting small cache sizes inpeft_cache_config
, ensuring the precedence hierarchy works as expected.tensorrt_llm/llmapi/llm_args.py (2)
644-650
: LGTM: Enhanced field descriptions improve clarity.The updated descriptions for
num_host_module_layer
andnum_device_module_layer
effectively clarify their impact on cache sizes and their override behavior on related cache size parameters.
1424-1428
: LGTM: Updated LoRA config logic removes deprecated field dependencies.The logic correctly updates to use
lora_config
directly instead of deprecated fields, with proper null checking to prevent attribute access onNone
. This aligns with the PR objective to centralize LoRA configuration through thelora_config
parameter.tests/unittest/llmapi/test_llm.py (3)
38-39
: LGTM - Necessary imports for new PEFT cache testsThe import additions are appropriate and necessary to support the new PEFT cache configuration test functions.
Also applies to: 53-55
392-399
: LGTM - Updated test function call with explicit argumentsThe addition of explicit keyword arguments (
LLM
,enable_lora=True
,build_config=build_config
,fast_build=True
) appears to be updating the function call to match the expected signature of the test harness function.
402-434
: LGTM - Well-designed test for PEFT cache size validationThis test effectively validates that PEFT cache configuration parameters affect cache behavior by testing failure scenarios with intentionally small cache sizes. The approach is sound given that actual cache sizes cannot be directly inspected. The test covers both host and device cache parameters and has clear documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/llmapi/llm_args.py
(5 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/llmapi/llm_args.py (5)
644-645
: Good documentation improvement.The clarification about how
num_host_module_layer
affects host cache size and overrideshost_cache_size
helps users understand the parameter relationships better.
648-650
: Good documentation improvement.The clarification about how
num_device_module_layer
affects device cache size and overridesdevice_cache_percent
provides important context for users configuring PEFT cache parameters.
685-686
: Good documentation improvement.Adding the note about pytorch backend not supporting LoRA prefetch aligns well with the validation logic added later in the file and helps prevent user confusion.
702-719
: Well-implemented factory method.The
create_from_pybind
method correctly implements the standard pattern for converting pybind objects to Python objects by copying all fields. This supports the improved PEFT cache configuration management mentioned in the PR objectives.Note: There's a static analysis hint about docstring formatting, but it appears to be a false positive as no docstring is visible in this segment.
1424-1428
: Good refactoring of LoRA configuration logic.The changes improve the validation logic by:
- Simplifying the
lora_plugin
setting to always enable when LoRA is enabled (for non-pytorch backends)- Adding a defensive check before accessing
lora_config.max_lora_rank
to prevent potential AttributeErrorThese changes align well with the PR's goal of removing deprecated LoRA arguments and making configuration handling more robust.
PR_Github #12560 [ run ] triggered by Bot |
PR_Github #12561 [ run ] triggered by Bot |
/bot run |
PR_Github #12560 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719
: Approve the factory method implementation with a minor formatting fix.The
create_from_pybind
method correctly copies all fields from the pybind object to create a PythonPeftCacheConfig
instance, supporting the cache configuration improvements in this PR.Fix the docstring formatting issue:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create a PeftCacheConfig instance from a pybind _PeftCacheConfig object."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/llmapi/llm_args.py
(5 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/llmapi/llm_args.py (3)
644-644
: Good documentation improvements.The extended field descriptions clearly explain the cache sizing impact and override behavior, which helps users understand how these parameters affect memory allocation.
Also applies to: 648-649, 685-685
1424-1427
: Good refactoring of LoRA configuration logic.The simplified condition and added null check for
lora_config
improve robustness while aligning with the goal of removing deprecated LoRA fields.
1558-1564
: Proper implementation of PEFT cache validation.The validator correctly includes the null check for
peft_cache_config
before accessinglora_prefetch_dir
, addressing the potential AttributeError concern from previous reviews. The warning appropriately informs users about PyTorch backend limitations.
PR_Github #12562 [ run ] triggered by Bot |
PR_Github #12561 [ run ] completed with state |
PR_Github #12562 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719
: Fix docstring formatting for the new static method.The method implementation is correct and provides a clean way to convert from pybind objects to Python objects. However, the docstring should be reformatted to one line.
Apply this fix to reformat the docstring:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create a PeftCacheConfig from a pybind _PeftCacheConfig object.""" return PeftCacheConfig(
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/llmapi/llm_args.py
(5 hunks)tests/unittest/llmapi/test_llm.py
(3 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unittest/llmapi/test_llm_pytorch.py
- tests/unittest/llmapi/test_llm.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/llmapi/llm_args.py (3)
644-644
: Good documentation improvements.These description updates clarify the behavior and relationships between PEFT cache configuration fields, making it clearer to users how these parameters interact and what their current support status is.
Also applies to: 648-649, 684-684
1424-1427
: Good improvement to LoRA configuration logic.The changes correctly:
- Simplify the condition for setting
lora_plugin
to 'auto' - now it only depends onenable_lora
and backend type- Add proper null checking before accessing
lora_config.max_lora_rank
This aligns with the PR objective of making LoRA config parameters optional.
1558-1562
: Well-implemented validator with proper null checking.This validator correctly warns users when they attempt to use the unsupported
lora_prefetch_dir
feature. The null checking is properly implemented, addressing the concern from previous reviews.
d1dfbeb
to
b3c3e60
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719
: LGTM: Well-implemented factory method with minor formatting fix needed.This factory method correctly creates
PeftCacheConfig
instances from pybind objects, supporting the PEFT cache merging functionality described in the PR objectives.Address the docstring formatting issue flagged by static analysis:
- def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind _PeftCacheConfig object."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)tensorrt_llm/lora_manager.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(4 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(6 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🚧 Files skipped from review as they are similar to previous changes (5)
- tensorrt_llm/_torch/pyexecutor/_util.py
- tests/unittest/llmapi/test_llm_pytorch.py
- tensorrt_llm/lora_manager.py
- tests/unittest/llmapi/test_llm.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (4)
tensorrt_llm/llmapi/llm_args.py (4)
643-649
: LGTM: Improved field descriptions for better clarity.The enhanced descriptions clearly explain how these fields affect cache sizes and their overriding behavior, which aligns with the PR's goal of improving LoRA cache memory control.
684-684
: LGTM: Clear indication of unsupported feature.The updated description correctly indicates that LoRA prefetch is currently not supported, which helps set proper user expectations.
1424-1427
: LGTM: Proper handling of optional lora_config.The changes correctly:
- Always set
lora_plugin
to 'auto' when LoRA is enabled (removing dependency onlora_config
being None)- Conditionally access
lora_config.max_lora_rank
only whenlora_config
existsThis properly supports the optional nature of
lora_config
parameters as described in the PR objectives.
1558-1562
: LGTM: Proper validation with null safety.The validator correctly:
- Performs null check on
peft_cache_config
before accessing its attributes- Provides clear warning about unsupported LoRA prefetch functionality
- Addresses the previous review comment about potential
AttributeError
This implementation aligns with the updated field documentation indicating LoRA prefetch is not supported.
PR_Github #12573 [ run ] triggered by Bot |
PR_Github #12573 [ run ] completed with state |
b3c3e60
to
3fe1f26
Compare
…rrelevant to pytorch backend Signed-off-by: Amit Zuker <[email protected]>
… PybindMirror, updated its PeftCacheConfig tests accordingly, removed default values from description, raise exception when unused peft_cache_config.lora_prefetch_dir was set instead of writing a warning log message Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
…che sizes, fix incorrect lora request creation Signed-off-by: Amit Zuker <[email protected]>
50e940e
to
bce06ad
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
604-658
: Fix docstring formatting.The implementation of the generic
from_pybind
method is well-designed and handles optional fields correctly. However, there's a minor docstring formatting issue.Apply this fix for the docstring formatting:
- """Construct an instance of the given class from the fields in the given - pybind class instance. + """Construct an instance of the given class from the fields in the given pybind class instance.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
examples/llm-api/llm_multilora.py
(1 hunks)examples/llm-api/quickstart_multimodal.py
(1 hunks)tensorrt_llm/_torch/models/modeling_phi4mm.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(7 hunks)tensorrt_llm/lora_manager.py
(1 hunks)tests/unittest/llmapi/apps/_test_openai_lora.py
(1 hunks)tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(4 hunks)tests/unittest/llmapi/test_llm_args.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(6 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/llmapi/test_llm_multi_gpu.py
✅ Files skipped from review due to trivial changes (1)
- examples/llm-api/quickstart_multimodal.py
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py
- tests/unittest/llmapi/apps/_test_openai_lora.py
- examples/llm-api/llm_multilora.py
- tensorrt_llm/lora_manager.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/models/modeling_phi4mm.py
- tests/unittest/llmapi/test_llm_pytorch.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/llmapi/test_llm_args.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/llmapi/test_llm_args.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (4)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : In function calls where parameters are not obvious from inspection, use an inline C comment to document the parameter for readers.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
tensorrt_llm/llmapi/llm_args.py (3)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
657-658: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm_args.py
269-269: PeftCacheConfig
may be undefined, or defined from star imports
(F405)
299-299: PeftCacheConfig
may be undefined, or defined from star imports
(F405)
309-309: PeftCacheConfig
may be undefined, or defined from star imports
(F405)
311-311: PeftCacheConfig
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (13)
tests/unittest/llmapi/test_llm_args.py (3)
255-282
: LGTM! Comprehensive test for pybind conversion.The test thoroughly validates the
from_pybind
class method by setting all fields explicitly and verifying they transfer correctly to the PythonPeftCacheConfig
instance.
284-314
: Excellent test for default value handling.This test validates the critical behavior where Python-side defaults are applied when pybind fields are
None
. The use ofPeftCacheConfig.model_fields
to access expected defaults is the correct approach and ensures the test remains maintainable if default values change.
14-14
: Static analysis false positives on star imports.The Ruff warnings about
PeftCacheConfig
being undefined are false positives. The class is correctly imported via the star import and used consistently throughout the test file, including in existing tests liketest_PeftCacheConfig_declaration
.tests/unittest/llmapi/test_llm.py (6)
38-39
: LGTM!The import of
PeftCacheConfig
is necessary for the new PEFT cache configuration tests and follows the existing import pattern.
54-56
: LGTM!The import of the test harness function is necessary for the new PEFT cache tests and follows the existing import pattern.
1430-1431
: LGTM!The addition of explicit LoRA cache size parameters to the
BuildConfig
aligns with the PR objective to improve LoRA cache memory control. The parameter values are appropriate for testing.
1484-1487
: LGTM!The additional arguments to the test harness function are appropriate and consistent with the LoRA testing requirements.
1490-1523
: Well-designed test for PEFT cache size validation.The test approach of using intentionally small cache sizes to trigger failures is clever since the actual cache sizes aren't directly accessible. The test covers both host and device cache configurations effectively.
The extremely small values (1 byte for host cache and 0.0000001 percent for device cache) should be sufficient to trigger failures on any realistic system configuration.
1526-1544
: Excellent test for configuration override behavior.The test effectively validates that
lora_config
parameters properly overridepeft_cache_config
parameters. Using intentionally problematic values inpeft_cache_config
that are overridden by proper values inlora_config
is a solid approach to verify the override mechanism.tensorrt_llm/llmapi/llm_args.py (4)
6-6
: LGTM!The new imports support the generic
from_pybind
method implementation and follow Python typing best practices.Also applies to: 12-12, 65-65
757-757
: LGTM!The changes properly establish default values for cache configuration and clarify the override behavior between different cache size parameters. The explicit defaults (2% for device cache, 1 GiB for host cache) align with the C++ implementation and improve configuration transparency.
Also applies to: 761-762, 789-800
1539-1542
: LGTM!The simplification correctly removes dependency on deprecated LoRA fields and adds proper null checking for
lora_config
before accessing its properties. This aligns with making LoRA configuration parameters optional.
1672-1678
: LGTM!The validator correctly enforces that the unsupported
lora_prefetch_dir
feature cannot be used. The null check prevents AttributeError whenpeft_cache_config
is None, addressing the concern from previous reviews.
PR_Github #13503 [ run ] triggered by Bot |
PR_Github #13484 [ run ] completed with state |
PR_Github #13503 [ run ] completed with state |
Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Description
LoraConfig.max_loras
andLoraConfig.max_cpu_loras
to be optional. When they're not set, the cache size would be determined by thePeftCacheConfig
.peft_cache_config
andlora_config
- when cache size fields inlora_config
have a value, they would take precedence over the relevant fields inpeft_cache_config
lora_config: LoraConfig
LLM arg:max_lora_rank
,max_loras
,max_cpu_loras
.PeftCacheConfig
class fordevice_cache_percent
to 2% andhost_cache_size
to 1GiB, the same default values that the CPP code uses when these fields have no value.lora_config
in LLM args would take precedence overlora_config
from the engine build config.peft_cache_config.lora_prefetch_dir
has a value, as currently it's not supported.Summary by CodeRabbit
New Features
Tests
Examples
Documentation & Validation
Test Coverage
tests/unittest/llmapi/test_llm.py::test_llama_7b_peft_cache_config_affects_peft_cache_size
tests/unittest/llmapi/test_llm.py::test_llama_7b_lora_config_overrides_peft_cache_config
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_peft_cache_config_affects_peft_cache_size
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_lora_config_overrides_peft_cache_config
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.