[TRTLLM-5830][feat] Improve LoRA cache memory control #6220

amitz-nv · 2025-07-21T15:41:55Z

Description

Added support for configuring LoRA cache sizes on pytorch flow.
Changed LoraConfig.max_loras and LoraConfig.max_cpu_loras to be optional. When they're not set, the cache size would be determined by the PeftCacheConfig.
Implemented a "merge" logic of fields in peft_cache_config and lora_config - when cache size fields in lora_config have a value, they would take precedence over the relevant fields in peft_cache_config
Removed deprecated LoRA LLM args, as they're already specified inside lora_config: LoraConfig LLM arg: max_lora_rank, max_loras, max_cpu_loras.
For user clarity - defined default values in python PeftCacheConfig class for device_cache_percent to 2% and host_cache_size to 1GiB, the same default values that the CPP code uses when these fields have no value.
In TRT-python flow, lora_config in LLM args would take precedence over lora_config from the engine build config.
Raise an exception when peft_cache_config.lora_prefetch_dir has a value, as currently it's not supported.
Added tests that verify the LoRA cache size LLM args take effect in the expect order.

Summary by CodeRabbit

New Features
- Improved handling of LoRA and PEFT cache configurations, allowing cache size parameters in LoRA configs to override those in PEFT cache configs.
- Enhanced flexibility for specifying LoRA adapter cache sizes by supporting optional values.
- Added validation to prevent unsupported LoRA prefetch directory usage in PEFT cache configuration.
- Refactored LoRA config integration to merge existing PEFT cache settings and support explicit overrides.
- Simplified LoRA adapter naming in multimodal models for consistent identification.
Tests
- Added tests verifying correct behavior when LoRA and PEFT cache configurations interact, including error handling for insufficient cache sizes and precedence of LoRA config values.
- Updated existing tests and fixtures to include new LoRA cache size parameters for comprehensive coverage.
Examples
- Updated example usage to pass LoRA configuration directly with explicit cache size parameters, removing deprecated fields.
Documentation & Validation
- Clarified PEFT cache configuration field descriptions and tightened default values.
- Removed deprecated LoRA-related fields and warnings from configuration validation.
- Added model validators to ensure compatibility of PEFT cache settings with backend capabilities.

Test Coverage

tests/unittest/llmapi/test_llm.py::test_llama_7b_peft_cache_config_affects_peft_cache_size
tests/unittest/llmapi/test_llm.py::test_llama_7b_lora_config_overrides_peft_cache_config
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_peft_cache_config_affects_peft_cache_size
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_lora_config_overrides_peft_cache_config

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-07-21T15:42:02Z

📝 Walkthrough

Walkthrough

This update refactors LoRA and PEFT cache configuration management in both core and test code. Deprecated LoRA fields are removed, LoRA config handling is unified, and PEFT cache merging is improved. Type annotations are updated for flexibility, and new tests are added to verify cache sizing and config override behaviors.

Changes

File(s)	Change Summary
tensorrt_llm/_torch/pyexecutor/_util.py	Refactored PEFT cache config assignment: now merges with existing config and conditionally updates fields based on lora_config presence.
tensorrt_llm/llmapi/llm.py	Refactored LoRA/PEFT config logic in model build: supports config override, merges PEFT cache, and updates import statements.
tensorrt_llm/llmapi/llm_args.py	Removed deprecated LoRA fields, added `from_pybind` class method to `PeftCacheConfig`, and updated validation logic accordingly.
tensorrt_llm/lora_manager.py	Changed `max_loras` and `max_cpu_loras` types to `int
tests/unittest/llmapi/test_llm.py	Added/updated tests for LoRA adapter caching, PEFT cache sizing, and config override behavior.
tests/unittest/llmapi/test_llm_pytorch.py	Added tests for PEFT cache sizing and LoRA config override in PyTorch flow; updated imports and test parameters.
examples/llm-api/llm_multilora.py	Replaced single `max_lora_rank` param with detailed `LoraConfig` including `max_loras` and `max_cpu_loras`.
tests/unittest/llmapi/apps/_test_openai_lora.py	Extended test fixture `temp_extra_llm_api_options_file` to include `max_loras` and `max_cpu_loras` in `lora_config`.
tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py	Extended test fixture `temp_extra_llm_api_options_file` to include `max_loras` and `max_cpu_loras` in `lora_config`.
tests/unittest/llmapi/test_llm_multi_gpu.py	Removed deprecated LoRA cache size parameters from test harness calls.
tests/unittest/llmapi/test_llm_args.py	Added tests for `PeftCacheConfig.from_pybind` correctness and default value handling; removed older default value test.
tensorrt_llm/_torch/models/modeling_phi4mm.py	Changed `lora_request` method to fix `lora_name` and `lora_int_id` per modality rather than per request index.
examples/llm-api/quickstart_multimodal.py	Added explicit `max_loras` and `max_cpu_loras` settings to LoRA config after model class load.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LLM
    participant EngineConfig
    participant LoraConfig
    participant PeftCacheConfig
    participant Executor

    User->>LLM: Build model (with/without lora_config)
    LLM->>EngineConfig: Load engine config
    alt lora_plugin enabled
        EngineConfig->>LoraConfig: Load lora_config
        alt User provides lora_config
            LLM->>LoraConfig: Override with user lora_config
        end
        LLM->>PeftCacheConfig: Merge existing config, update fields if lora_config present
    end
    LLM->>Executor: Create with merged lora_config and peft_cache_config

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~45 minutes

Possibly related PRs

[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction #5616: Modifies PEFT cache management and LoRA adapter caching in PyTorch flow, including PeftCacheManager usage and executor/scheduler components for LoRA adapter eviction, showing strong code-level connection.

Suggested reviewers

litaotju
pcastonguay
shaharmor98
juney-nvidia

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/unittest/llmapi/test_llm_pytorch.py (1)
331-344: Fix the line length violation.

The test logic is excellent - it validates that lora_config cache size parameters override conflicting peft_cache_config values by successfully running with small cache sizes in peft_cache_config but adequate sizes in lora_config.

However, there's a line length issue that needs to be addressed.

Apply this diff to fix the line length violation:
-    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
+    """Tests that cache size args in lora_config LLM arg override the cache size 
+    parameters in peft_cache_config LLM arg."""
tensorrt_llm/llmapi/llm_args.py (1)
702-719: Good factory method implementation with minor formatting suggestion.

The create_from_pybind method correctly implements the factory pattern for converting pybind objects to Python objects, supporting the flexible PEFT cache configuration mentioned in the PR objectives.

Minor nitpick: Consider formatting the docstring as a single line per the static analysis hint.
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create PeftCacheConfig from pybind object."""
tests/unittest/llmapi/test_llm.py (1)
1438-1453: Fix line length and approve override testing logic

The test effectively validates that LoRA config cache size parameters override PEFT cache config by creating a scenario where PEFT config would fail but LoRA config succeeds.

However, line 1438 exceeds the 120-character limit:
-def test_llama_7b_lora_config_overrides_peft_cache_config():
-    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
+def test_llama_7b_lora_config_overrides_peft_cache_config():
+    """Tests that cache size args in lora_config LLM arg override the cache size 
+    parameters in peft_cache_config LLM arg."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7f0b0a and 53208f5.

📒 Files selected for processing (6)

tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/llmapi/llm.py (3 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tensorrt_llm/lora_manager.py (1 hunks)
tests/unittest/llmapi/test_llm.py (3 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tests/unittest/llmapi/test_llm_pytorch.py

332-332: Line too long (124 > 120)

(E501)

tests/unittest/llmapi/test_llm.py

1438-1438: Line too long (124 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (12)

tensorrt_llm/lora_manager.py (1)

147-148: LGTM! Clean type annotation update for flexible cache configuration.

The change from int = 4 to int | None = None for both max_loras and max_cpu_loras aligns perfectly with the PR objective of making LoRA cache sizes optional. This allows the system to determine appropriate defaults via PeftCacheConfig when these values are unset.

tests/unittest/llmapi/test_llm_pytorch.py (3)

4-4: LGTM! Proper import addition for test functionality.

The import of PeftCacheConfig is necessary for the new cache configuration tests.

9-11: LGTM! Required import for multi-LoRA test harness.

The import of check_llama_7b_multi_lora_from_request_test_harness is correctly added to support the new cache configuration tests.

297-329: LGTM! Well-designed test for PEFT cache configuration validation.

This test effectively verifies that PeftCacheConfig parameters directly impact cache sizing by intentionally setting values too small to hold a single adapter and expecting RuntimeError exceptions. The approach of testing failure cases is a solid strategy when direct cache size inspection isn't possible.

The test covers both host_cache_size and device_cache_percent parameters, providing comprehensive validation.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

14-14: LGTM! Required import for PEFT cache configuration merging.

The import of PeftCacheConfig is necessary for the new cache configuration merging logic.

471-481: Excellent refactoring for flexible PEFT cache configuration.

The updated logic elegantly handles the merging of existing PEFT cache configuration with LoRA-specific parameters:

Preserves existing config: Creates PeftCacheConfig from existing pybind config when available, or uses defaults

Conditional overrides: Only updates num_device_module_layer and num_host_module_layer when lora_config.max_loras and lora_config.max_cpu_loras are explicitly set (not None)

Maintains backwards compatibility: Falls back to defaults when no existing config is present

This approach is much more flexible than the previous direct construction and aligns perfectly with the PR objective of making LoRA cache sizes optional while allowing overrides.

tensorrt_llm/llmapi/llm_args.py (2)

644-644: Good documentation improvement!

The clarifying suffixes help users understand that these parameters affect cache sizes and take precedence over other cache size settings, which aligns well with the PR's goal of improving LoRA cache memory control.

Also applies to: 649-649

1424-1428: Correctly implements optional LoraConfig handling.

The changes properly make lora_config optional by:

Always enabling lora_plugin when enable_lora is true (for non-PyTorch backends), regardless of lora_config presence

Adding a safety check before accessing lora_config.max_lora_rank

This aligns with the PR objective of making LoraConfig.max_loras and LoraConfig.max_cpu_loras optional parameters.

tests/unittest/llmapi/test_llm.py (3)

38-39: LGTM: Import changes support new PEFT cache testing functionality

The new imports are correctly added to support the PEFT cache configuration testing and test harness usage.

Also applies to: 53-55

1393-1400: LGTM: Test harness integration improves consistency

The modification to use the test harness with additional parameters is a good refactoring that standardizes the LoRA testing approach while maintaining the original test logic.

1403-1436: LGTM: Effective negative testing approach for cache configuration validation

The test correctly validates that PEFT cache configuration parameters affect actual cache sizes by testing failure scenarios with inadequately small cache sizes. The approach of using pytest.raises(RuntimeError) is appropriate since direct cache size inspection isn't available.

tensorrt_llm/llmapi/llm.py (1)

34-35: LGTM! Import addition is appropriate.

The addition of PeftCacheConfig import is necessary for the new PEFT cache configuration management logic.

tensorrt_llm/llmapi/llm.py

shaharmor98

Some minor comments, overall looks good

tensorrt_llm/llmapi/llm.py

tests/unittest/llmapi/test_llm.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring format and validate method implementation.

The new create_from_pybind factory method enables conversion from pybind objects to Python models, which supports the refactored PEFT cache configuration handling.

Apply this diff to fix the docstring format:
-    @staticmethod
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    @staticmethod  
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create PeftCacheConfig from pybind object."""
tests/unittest/llmapi/test_llm.py (1)
437-452: LGTM - Test correctly validates config override behavior with minor formatting issue.

This test effectively validates that LoraConfig cache size parameters take precedence over PeftCacheConfig parameters, which aligns with the PR objectives. The test design demonstrates this by using small cache sizes in PeftCacheConfig that would normally cause failures, but providing adequate cache sizes in LoraConfig to ensure success.

Apply this diff to fix the line length issue:
-    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
+    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
Actually, let me provide a better fix for the line length:
-    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
+    """Tests that cache size args in lora_config LLM arg override the cache
+    size parameters in peft_cache_config LLM arg."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53208f5 and 0221b5b.

📒 Files selected for processing (6)

tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/llmapi/llm.py (3 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tensorrt_llm/lora_manager.py (1 hunks)
tests/unittest/llmapi/test_llm.py (3 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

tests/unittest/llmapi/test_llm.py

1437-1437: Line too long (124 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

329-329: Line too long (124 > 120)

(E501)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/lora_manager.py

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

tests/unittest/llmapi/test_llm.py

1437-1437: Line too long (124 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

329-329: Line too long (124 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

tests/unittest/llmapi/test_llm_pytorch.py (3)

4-4: LGTM: Import addition supports new test functionality.

The import of PeftCacheConfig is correctly added to support the new cache configuration tests.

9-11: LGTM: Import addition enables test harness usage.

The import of check_llama_7b_multi_lora_from_request_test_harness is properly added to support the new LoRA cache testing functionality.

294-326: LGTM: Comprehensive test validates PEFT cache size impact.

The test correctly validates that small cache sizes cause runtime failures when loading LoRA adapters. The approach of testing failure conditions is appropriate since direct cache size inspection isn't available.

Key strengths:

Tests both host_cache_size and device_cache_percent parameters

Uses minimal LoRA config without explicit cache size values

Properly expects RuntimeError for insufficient cache sizes

Includes appropriate CUDA graph disabling for known issues

tensorrt_llm/llmapi/llm_args.py (2)

644-644: LGTM: Field descriptions clarified for better understanding.

The updated descriptions for num_host_module_layer and num_device_module_layer provide clearer explanations of their impact on cache sizes and overriding behavior.

Also applies to: 648-650

1424-1428: LGTM: Simplified LoRA plugin configuration logic.

The updated logic correctly:

Always sets lora_plugin to 'auto' when LoRA is enabled for non-pytorch backends

Only assigns max_lora_rank when lora_config is present

Removes previous conditional checks that are no longer needed

This aligns with the PR's goal of removing deprecated LoRA fields and simplifying configuration.

tests/unittest/llmapi/test_llm.py (4)

38-39: LGTM - Import addition supports new cache configuration tests.

The import of PeftCacheConfig is necessary for the new test functions that verify cache sizing behavior.

53-55: LGTM - Test utility import is correctly added.

The import of check_llama_7b_multi_lora_from_request_test_harness is necessary for the new test functions that verify cache behavior.

396-399: LGTM - Test function updated with proper parameter passing.

The modification correctly updates the test to use the standardized test harness with appropriate parameters for LoRA functionality.

402-434: LGTM - Well-designed test for cache size validation.

This test effectively validates that PeftCacheConfig parameters affect cache sizing by testing failure scenarios with extremely small cache sizes. The approach is appropriate given that actual cache sizes cannot be directly inspected.

The test covers both host_cache_size and device_cache_percent parameters, ensuring comprehensive validation of the cache sizing functionality.

tests/unittest/llmapi/test_llm_pytorch.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/unittest/llmapi/test_llm_pytorch.py (1)
329-329: Fix line length violation.

The docstring exceeds the project's line-length limit of 120 characters.

Apply this diff to split the docstring:
-    """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""
+    """Tests that cache size args in lora_config LLM arg override the cache size 
+    parameters in peft_cache_config LLM arg."""

🧹 Nitpick comments (2)

tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring formatting and validate factory method logic.

The static factory method correctly copies all fields from the pybind object to create a Python PeftCacheConfig instance. However, there's a docstring formatting issue.

Apply this diff to fix the docstring format:
-    @staticmethod
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    @staticmethod  
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create PeftCacheConfig from pybind object."""
The method implementation correctly maps all pybind fields to the Python model fields, supporting the conversion pattern used elsewhere in the codebase.
tests/unittest/llmapi/test_llm.py (1)
437-452: LGTM - Excellent test for configuration override behavior

This test perfectly validates the priority behavior where lora_config cache size parameters should override peft_cache_config parameters. The test design is clever and effective.

However, please fix the line length issue:
-        peft_cache_config=PeftCacheConfig(host_cache_size=1,
-                                          device_cache_percent=0.000001))
+        peft_cache_config=PeftCacheConfig(
+            host_cache_size=1, device_cache_percent=0.000001))

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0221b5b and 7021f17.

📒 Files selected for processing (6)

tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/llmapi/llm.py (3 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tensorrt_llm/lora_manager.py (1 hunks)
tests/unittest/llmapi/test_llm.py (3 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

tests/unittest/llmapi/test_llm.py

1437-1437: Line too long (124 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

329-329: Line too long (124 > 120)

(E501)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/lora_manager.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm.py

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

tests/unittest/llmapi/test_llm.py

1437-1437: Line too long (124 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

329-329: Line too long (124 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

tests/unittest/llmapi/test_llm_pytorch.py (4)

4-4: LGTM: Import addition supports new test functionality.

The import of PeftCacheConfig is correctly added to support the new cache configuration tests.

9-11: LGTM: Test utility imports are appropriate.

The import of the LoRA test harness function is correctly added to support the new cache configuration tests.

294-326: LGTM: Test correctly validates PEFT cache size behavior.

The test effectively verifies that PEFT cache configuration parameters (host_cache_size and device_cache_percent) impact cache behavior by expecting RuntimeError when values are too small to contain a single adapter. This is a sound testing approach when direct cache size inspection isn't available.

330-342: LGTM: Test correctly validates cache parameter override behavior.

The test properly verifies that LoRA config cache parameters (max_loras, max_cpu_loras) override conflicting small cache sizes in peft_cache_config, ensuring the precedence hierarchy works as expected.

tensorrt_llm/llmapi/llm_args.py (2)

644-650: LGTM: Enhanced field descriptions improve clarity.

The updated descriptions for num_host_module_layer and num_device_module_layer effectively clarify their impact on cache sizes and their override behavior on related cache size parameters.

1424-1428: LGTM: Updated LoRA config logic removes deprecated field dependencies.

The logic correctly updates to use lora_config directly instead of deprecated fields, with proper null checking to prevent attribute access on None. This aligns with the PR objective to centralize LoRA configuration through the lora_config parameter.

tests/unittest/llmapi/test_llm.py (3)

38-39: LGTM - Necessary imports for new PEFT cache tests

The import additions are appropriate and necessary to support the new PEFT cache configuration test functions.

Also applies to: 53-55

392-399: LGTM - Updated test function call with explicit arguments

The addition of explicit keyword arguments (LLM, enable_lora=True, build_config=build_config, fast_build=True) appears to be updating the function call to match the expected signature of the test harness function.

402-434: LGTM - Well-designed test for PEFT cache size validation

This test effectively validates that PEFT cache configuration parameters affect cache behavior by testing failure scenarios with intentionally small cache sizes. The approach is sound given that actual cache sizes cannot be directly inspected. The test covers both host and device cache parameters and has clear documentation.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7021f17 and 04db04c.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_args.py (5 hunks)

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/llmapi/llm_args.py (5)

644-645: Good documentation improvement.

The clarification about how num_host_module_layer affects host cache size and overrides host_cache_size helps users understand the parameter relationships better.

648-650: Good documentation improvement.

The clarification about how num_device_module_layer affects device cache size and overrides device_cache_percent provides important context for users configuring PEFT cache parameters.

685-686: Good documentation improvement.

Adding the note about pytorch backend not supporting LoRA prefetch aligns well with the validation logic added later in the file and helps prevent user confusion.

702-719: Well-implemented factory method.

The create_from_pybind method correctly implements the standard pattern for converting pybind objects to Python objects by copying all fields. This supports the improved PEFT cache configuration management mentioned in the PR objectives.

Note: There's a static analysis hint about docstring formatting, but it appears to be a false positive as no docstring is visible in this segment.

1424-1428: Good refactoring of LoRA configuration logic.

The changes improve the validation logic by:

Simplifying the lora_plugin setting to always enable when LoRA is enabled (for non-pytorch backends)

Adding a defensive check before accessing lora_config.max_lora_rank to prevent potential AttributeError

These changes align well with the PR's goal of removing deprecated LoRA arguments and making configuration handling more robust.

tensorrt_llm/llmapi/llm_args.py

tensorrt-cicd · 2025-07-22T11:42:02Z

PR_Github #12560 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-22T11:46:58Z

PR_Github #12561 [ run ] triggered by Bot

amitz-nv · 2025-07-22T11:46:59Z

/bot run

tensorrt-cicd · 2025-07-22T11:47:00Z

PR_Github #12560 [ run ] completed with state ABORTED

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
702-719: Approve the factory method implementation with a minor formatting fix.

The create_from_pybind method correctly copies all fields from the pybind object to create a Python PeftCacheConfig instance, supporting the cache configuration improvements in this PR.

Fix the docstring formatting issue:
-    @staticmethod
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    @staticmethod  
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create a PeftCacheConfig instance from a pybind _PeftCacheConfig object."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62bd29a and 250b240.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_args.py (5 hunks)

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/llmapi/llm_args.py (3)

644-644: Good documentation improvements.

The extended field descriptions clearly explain the cache sizing impact and override behavior, which helps users understand how these parameters affect memory allocation.

Also applies to: 648-649, 685-685

1424-1427: Good refactoring of LoRA configuration logic.

The simplified condition and added null check for lora_config improve robustness while aligning with the goal of removing deprecated LoRA fields.

1558-1564: Proper implementation of PEFT cache validation.

The validator correctly includes the null check for peft_cache_config before accessing lora_prefetch_dir, addressing the potential AttributeError concern from previous reviews. The warning appropriately informs users about PyTorch backend limitations.

tensorrt-cicd · 2025-07-22T11:52:11Z

PR_Github #12562 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-22T11:52:13Z

PR_Github #12561 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-22T13:16:10Z

PR_Github #12562 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9342 completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring formatting for the new static method.

The method implementation is correct and provides a clean way to convert from pybind objects to Python objects. However, the docstring should be reformatted to one line.

Apply this fix to reformat the docstring:
-    @staticmethod
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    @staticmethod
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create a PeftCacheConfig from a pybind _PeftCacheConfig object."""
         return PeftCacheConfig(

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 250b240 and 0265f8e.

📒 Files selected for processing (3)

tensorrt_llm/llmapi/llm_args.py (5 hunks)
tests/unittest/llmapi/test_llm.py (3 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unittest/llmapi/test_llm_pytorch.py
tests/unittest/llmapi/test_llm.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm_args.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/llmapi/llm_args.py (3)

644-644: Good documentation improvements.

These description updates clarify the behavior and relationships between PEFT cache configuration fields, making it clearer to users how these parameters interact and what their current support status is.

Also applies to: 648-649, 684-684

1424-1427: Good improvement to LoRA configuration logic.

The changes correctly:

Simplify the condition for setting lora_plugin to 'auto' - now it only depends on enable_lora and backend type

Add proper null checking before accessing lora_config.max_lora_rank

This aligns with the PR objective of making LoRA config parameters optional.

1558-1562: Well-implemented validator with proper null checking.

This validator correctly warns users when they attempt to use the unsupported lora_prefetch_dir feature. The null checking is properly implemented, addressing the concern from previous reviews.

amitz-nv · 2025-07-22T14:03:28Z

/bot run

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
702-719: LGTM: Well-implemented factory method with minor formatting fix needed.

This factory method correctly creates PeftCacheConfig instances from pybind objects, supporting the PEFT cache merging functionality described in the PR objectives.

Address the docstring formatting issue flagged by static analysis:
-    def create_from_pybind(
-            peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+    def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig":
+        """Create PeftCacheConfig from pybind _PeftCacheConfig object."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1dfbeb and b3c3e60.

📒 Files selected for processing (6)

tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/llmapi/llm.py (3 hunks)
tensorrt_llm/llmapi/llm_args.py (5 hunks)
tensorrt_llm/lora_manager.py (1 hunks)
tests/unittest/llmapi/test_llm.py (4 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (6 hunks)

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🚧 Files skipped from review as they are similar to previous changes (5)

tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/lora_manager.py
tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/llm.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

713-714: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (4)

tensorrt_llm/llmapi/llm_args.py (4)

643-649: LGTM: Improved field descriptions for better clarity.

The enhanced descriptions clearly explain how these fields affect cache sizes and their overriding behavior, which aligns with the PR's goal of improving LoRA cache memory control.

684-684: LGTM: Clear indication of unsupported feature.

The updated description correctly indicates that LoRA prefetch is currently not supported, which helps set proper user expectations.

1424-1427: LGTM: Proper handling of optional lora_config.

The changes correctly:

Always set lora_plugin to 'auto' when LoRA is enabled (removing dependency on lora_config being None)

Conditionally access lora_config.max_lora_rank only when lora_config exists

This properly supports the optional nature of lora_config parameters as described in the PR objectives.

1558-1562: LGTM: Proper validation with null safety.

The validator correctly:

Performs null check on peft_cache_config before accessing its attributes

Provides clear warning about unsupported LoRA prefetch functionality

Addresses the previous review comment about potential AttributeError

This implementation aligns with the updated field documentation indicating LoRA prefetch is not supported.

tensorrt-cicd · 2025-07-22T14:08:42Z

PR_Github #12573 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-22T15:46:36Z

PR_Github #12573 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9352 completed with status: 'FAILURE'

…rrelevant to pytorch backend Signed-off-by: Amit Zuker <[email protected]>

… PybindMirror, updated its PeftCacheConfig tests accordingly, removed default values from description, raise exception when unused peft_cache_config.lora_prefetch_dir was set instead of writing a warning log message Signed-off-by: Amit Zuker <[email protected]>

Signed-off-by: Amit Zuker <[email protected]>

…che sizes, fix incorrect lora request creation Signed-off-by: Amit Zuker <[email protected]>

amitz-nv · 2025-07-30T06:58:48Z

/bot run --disable-fail-fast

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
604-658: Fix docstring formatting.

The implementation of the generic from_pybind method is well-designed and handles optional fields correctly. However, there's a minor docstring formatting issue.

Apply this fix for the docstring formatting:
-        """Construct an instance of the given class from the fields in the given
-        pybind class instance.
+        """Construct an instance of the given class from the fields in the given pybind class instance.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50e940e and bce06ad.

📒 Files selected for processing (13)

examples/llm-api/llm_multilora.py (1 hunks)
examples/llm-api/quickstart_multimodal.py (1 hunks)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/llmapi/llm.py (3 hunks)
tensorrt_llm/llmapi/llm_args.py (7 hunks)
tensorrt_llm/lora_manager.py (1 hunks)
tests/unittest/llmapi/apps/_test_openai_lora.py (1 hunks)
tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py (1 hunks)
tests/unittest/llmapi/test_llm.py (4 hunks)
tests/unittest/llmapi/test_llm_args.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (6 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/llmapi/test_llm_multi_gpu.py

✅ Files skipped from review due to trivial changes (1)

examples/llm-api/quickstart_multimodal.py

🚧 Files skipped from review as they are similar to previous changes (8)

tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py
tests/unittest/llmapi/apps/_test_openai_lora.py
examples/llm-api/llm_multilora.py
tensorrt_llm/lora_manager.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/llmapi/llm.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/llmapi/test_llm_args.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/llmapi/test_llm_args.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (4)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : In function calls where parameters are not obvious from inspection, use an inline C comment to document the parameter for readers.

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

tensorrt_llm/llmapi/llm_args.py (3)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

657-658: One-line docstring should fit on one line

Reformat to one line

(D200)

tests/unittest/llmapi/test_llm_args.py

269-269: PeftCacheConfig may be undefined, or defined from star imports

(F405)

299-299: PeftCacheConfig may be undefined, or defined from star imports

(F405)

309-309: PeftCacheConfig may be undefined, or defined from star imports

(F405)

311-311: PeftCacheConfig may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (13)

tests/unittest/llmapi/test_llm_args.py (3)

255-282: LGTM! Comprehensive test for pybind conversion.

The test thoroughly validates the from_pybind class method by setting all fields explicitly and verifying they transfer correctly to the Python PeftCacheConfig instance.

284-314: Excellent test for default value handling.

This test validates the critical behavior where Python-side defaults are applied when pybind fields are None. The use of PeftCacheConfig.model_fields to access expected defaults is the correct approach and ensures the test remains maintainable if default values change.

14-14: Static analysis false positives on star imports.

The Ruff warnings about PeftCacheConfig being undefined are false positives. The class is correctly imported via the star import and used consistently throughout the test file, including in existing tests like test_PeftCacheConfig_declaration.

tests/unittest/llmapi/test_llm.py (6)

38-39: LGTM!

The import of PeftCacheConfig is necessary for the new PEFT cache configuration tests and follows the existing import pattern.

54-56: LGTM!

The import of the test harness function is necessary for the new PEFT cache tests and follows the existing import pattern.

1430-1431: LGTM!

The addition of explicit LoRA cache size parameters to the BuildConfig aligns with the PR objective to improve LoRA cache memory control. The parameter values are appropriate for testing.

1484-1487: LGTM!

The additional arguments to the test harness function are appropriate and consistent with the LoRA testing requirements.

1490-1523: Well-designed test for PEFT cache size validation.

The test approach of using intentionally small cache sizes to trigger failures is clever since the actual cache sizes aren't directly accessible. The test covers both host and device cache configurations effectively.

The extremely small values (1 byte for host cache and 0.0000001 percent for device cache) should be sufficient to trigger failures on any realistic system configuration.

1526-1544: Excellent test for configuration override behavior.

The test effectively validates that lora_config parameters properly override peft_cache_config parameters. Using intentionally problematic values in peft_cache_config that are overridden by proper values in lora_config is a solid approach to verify the override mechanism.

tensorrt_llm/llmapi/llm_args.py (4)

6-6: LGTM!

The new imports support the generic from_pybind method implementation and follow Python typing best practices.

Also applies to: 12-12, 65-65

757-757: LGTM!

The changes properly establish default values for cache configuration and clarify the override behavior between different cache size parameters. The explicit defaults (2% for device cache, 1 GiB for host cache) align with the C++ implementation and improve configuration transparency.

Also applies to: 761-762, 789-800

1539-1542: LGTM!

The simplification correctly removes dependency on deprecated LoRA fields and adds proper null checking for lora_config before accessing its properties. This aligns with making LoRA configuration parameters optional.

1672-1678: LGTM!

The validator correctly enforces that the unsupported lora_prefetch_dir feature cannot be used. The null check prevents AttributeError when peft_cache_config is None, addressing the concern from previous reviews.

tensorrt-cicd · 2025-07-30T07:03:45Z

PR_Github #13503 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-30T07:03:47Z

PR_Github #13484 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-30T20:18:09Z

PR_Github #13503 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10114 completed with status: 'SUCCESS'

Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

Signed-off-by: Amit Zuker <[email protected]>

amitz-nv requested review from a team as code owners July 21, 2025 15:41

amitz-nv requested review from syuoni and shaharmor98 July 21, 2025 15:41

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

tensorrt_llm/llmapi/llm.py Outdated Show resolved Hide resolved

shaharmor98 reviewed Jul 22, 2025

View reviewed changes

amitz-nv force-pushed the dev-improve-pytorch-lora-cache-memory-control branch from 53208f5 to 0221b5b Compare July 22, 2025 09:01

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

tests/unittest/llmapi/test_llm_pytorch.py Outdated Show resolved Hide resolved

amitz-nv force-pushed the dev-improve-pytorch-lora-cache-memory-control branch from 0221b5b to 7021f17 Compare July 22, 2025 10:20

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_args.py Outdated Show resolved Hide resolved

amitz-nv requested a review from shaharmor98 July 22, 2025 11:47

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

amitz-nv force-pushed the dev-improve-pytorch-lora-cache-memory-control branch from d1dfbeb to b3c3e60 Compare July 22, 2025 14:02

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

amitz-nv requested review from Superjomn and removed request for syuoni July 22, 2025 15:08

amitz-nv force-pushed the dev-improve-pytorch-lora-cache-memory-control branch from b3c3e60 to 3fe1f26 Compare July 23, 2025 07:19

amitz-nv added 5 commits July 30, 2025 09:58

Fix examples/llm-api/llm_multilora.py to not use BuildConfig that's i…

61a994b

…rrelevant to pytorch backend Signed-off-by: Amit Zuker <[email protected]>

Minor docstring fix

8cca194

Signed-off-by: Amit Zuker <[email protected]>

Fix rename

391d0f9

Signed-off-by: Amit Zuker <[email protected]>

Fix test_ptp_quickstart_multimodal_phi4mm - for stability set lora ca…

bce06ad

…che sizes, fix incorrect lora request creation Signed-off-by: Amit Zuker <[email protected]>

amitz-nv force-pushed the dev-improve-pytorch-lora-cache-memory-control branch from 50e940e to bce06ad Compare July 30, 2025 06:58

coderabbitai bot requested review from nv-guomingz, symphonylyh and yilin-void July 30, 2025 06:59

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

amitz-nv requested review from Superjomn and removed request for juney-nvidia, litaotju, pcastonguay, nv-guomingz, symphonylyh and yilin-void July 30, 2025 12:20

shaharmor98 merged commit 1ee7a08 into NVIDIA:main Jul 31, 2025
3 checks passed

amitz-nv changed the title ~~[5830][feat] Improve LoRA cache memory control~~ [TRTLLM-5830][feat] Improve LoRA cache memory control Aug 3, 2025

lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025

[5830][feat] Improve LoRA cache memory control (NVIDIA#6220)

981f2ed

Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025

[5830][feat] Improve LoRA cache memory control (NVIDIA#6220)

37acf8b

Signed-off-by: Amit Zuker <[email protected]>

This was referenced Aug 8, 2025

[None][fix] Refactoring to avoid circular import when importing torch models #6720

Merged

[None][feat] KV Cache Connector API #6488

Closed

venkywonka mentioned this pull request Aug 20, 2025

[https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config #7014

Merged

coderabbitai bot mentioned this pull request Aug 22, 2025

[TRTLLM-6825][fix] Update lora for phi4-mm #7149

Merged

amitz-nv mentioned this pull request Sep 8, 2025

[TRTLLM-7958][doc] add 1.0 release notes #7605

Open

1 task

[TRTLLM-5830][feat] Improve LoRA cache memory control #6220

[TRTLLM-5830][feat] Improve LoRA cache memory control #6220

Uh oh!

Conversation

amitz-nv commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shaharmor98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

amitz-nv commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

amitz-nv commented Jul 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

amitz-nv commented Jul 21, 2025 •

edited

Loading

coderabbitai bot commented Jul 21, 2025 •

edited

Loading