Skip to content

Conversation

amitz-nv
Copy link
Collaborator

@amitz-nv amitz-nv commented Jul 31, 2025

Description

Added support for reusing a LoRA adapter after it was evicted from LoRA CPU cache:

  • Added py_lora_path optional field to Request and to LlmRequest classes (only in python).
  • Python PeftCacheManager.add_request_peft would load the LoRA adapter if it's not loaded in cache.
  • Refactored LoRA parameterized test into separate test cases.
  • Added new LoRA test cases:
    • Test that using a LoRA adapter after it was evicted from LoRA CPU cache works.
    • Test that using LoRA adapters after they were inserted to LoRA cache work.
  • Removed "reuse after CPU eviction not supported" message from CPP exception, and the test case that verified the exception included that note.

Test Coverage

  • tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_multi_lora_evict_and_reload_evicted_adapters_in_cpu_and_gpu_cache
  • tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_multi_lora_read_from_cache_after_insert

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Improved management of LoRA adapters, including enhanced caching and support for loading LoRA weights from checkpoints when needed.
    • Added new attribute for specifying LoRA adapter paths in requests.
    • Introduced a method to remove LoRA tensors from requests to optimize caching.
  • Bug Fixes

    • Refined validation logic for LoRA configurations, allowing config-only input and providing clearer error messages for invalid cases.
  • Refactor

    • Centralized and streamlined LoRA tensor handling and caching logic for better maintainability.
    • Simplified error handling related to LoRA task caching.
    • Updated LoRA configuration validation for clearer separation of weights and config checks.
  • Tests

    • Updated and reorganized LoRA caching and eviction tests for greater clarity and coverage.
    • Removed unused subprocess and environment variable utilities from test helpers.

@amitz-nv amitz-nv requested a review from shaharmor98 July 31, 2025 08:48
@amitz-nv amitz-nv self-assigned this Jul 31, 2025
@amitz-nv amitz-nv requested review from a team as code owners July 31, 2025 08:48
@amitz-nv amitz-nv requested a review from Superjomn July 31, 2025 08:48
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

📝 Walkthrough

Walkthrough

This update refactors and enhances LoRA (Low-Rank Adaptation) adapter management across C++, Python, and test code. It introduces new methods and attributes for LoRA tensor handling, updates validation and resource management logic, and restructures related tests for clarity. Some obsolete subprocess utilities and decorators are removed.

Changes

Cohort / File(s) Change Summary
C++ LlmRequest and LoRA Tensor Management
cpp/include/tensorrt_llm/batch_manager/llmRequest.h, cpp/tensorrt_llm/batch_manager/llmRequest.cpp
Added removeLoraTensors() method to LlmRequest for clearing LoRA weights and config.
C++ PEFT Cache and LoRA Validation
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp, cpp/tensorrt_llm/executor/loraConfig.cpp
Refactored LoRA cache error handling and validation: direct use of LoRA task ID, simplified exception handling, and decoupled config/weights validation logic.
C++ Nanobind and Pybind Bindings for LlmRequest
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp, cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
Added remove_lora_tensors method binding to LlmRequest class for Python exposure via nanobind and pybind11.
Python Executor and LoRA Path Handling
tensorrt_llm/_torch/pyexecutor/llm_request.py, tensorrt_llm/executor/worker.py
Added py_lora_path attribute to LlmRequest and propagated it through request creation and enqueuing. Simplified config assignment logic.
Python PEFT Resource Management
tensorrt_llm/_torch/pyexecutor/resource_manager.py, tensorrt_llm/_torch/pyexecutor/_util.py
Enhanced PeftCacheManager to accept lora_config, centralize LoRA adapter caching/loading, and reshape tensors as needed. Updated instantiation to include lora_config.
Python Utilities
tensorrt_llm/_utils.py
Added _binding_to_str_dtype mapping and binding_to_str_dtype() function for dtype conversions.
Test Refactoring: Resource Manager
tests/unittest/_torch/test_resource_manager.py
Removed pytest usage, simplified test data handling, and updated PeftCacheManager instantiation to require lora_config.
Test Refactoring: LoRA Adapter Caching
tests/unittest/llmapi/test_llm.py, tests/unittest/llmapi/test_llm_pytorch.py
Replaced parameterized tests with explicit test functions and shared helpers for LoRA adapter eviction/loading scenarios. Removed subprocess-based failure test.
Test Utilities Cleanup
tests/unittest/utils/util.py
Removed subprocess execution and environment variable context manager utilities, including related imports.
C++ LoRA Config Test Update
cpp/tests/unit_tests/executor/loraConfigTest.cpp
Updated tests to allow config-only LoRA input and clarify validation for weights-only input.

Sequence Diagram(s)

sequenceDiagram
    participant PythonClient
    participant ExecutorWorker
    participant LlmRequest
    participant PeftCacheManager
    participant LoraManager
    participant CppBackend

    PythonClient->>ExecutorWorker: Submit request (may include LoRA adapter)
    ExecutorWorker->>LlmRequest: Create LlmRequest (with py_lora_path)
    ExecutorWorker->>PeftCacheManager: prepare_resources([LlmRequest])
    PeftCacheManager->>PeftCacheManager: add_request_peft(LlmRequest)
    alt LoRA task in C++ cache
        PeftCacheManager->>LlmRequest: removeLoraTensors()
    else Not cached, lora_path present
        PeftCacheManager->>LoraManager: load LoRA weights/config from checkpoint
        LoraManager-->>PeftCacheManager: Return weights/config
        PeftCacheManager->>LlmRequest: Set lora_weights/lora_config (reshaped)
    end
    PeftCacheManager->>CppBackend: Add LoRA tensors/config if needed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • shaharmor98
  • byshiue
  • Superjomn
  • litaotju

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13642 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d38c26b and 4041c53.

📒 Files selected for processing (22)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (19 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/executor/loraConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (5 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (5 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/pybind/executor/request.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (4 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tests/unittest/_torch/test_resource_manager.py (6 hunks)
  • tests/unittest/llmapi/test_llm.py (2 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (4 hunks)
  • tests/unittest/utils/util.py (1 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/_util.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/llmRequest.cpp (2)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/executor/worker.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tests/unittest/_torch/test_resource_manager.py (2)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/executor/loraConfig.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/include/tensorrt_llm/executor/executor.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/nanobind/executor/request.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

cpp/tensorrt_llm/pybind/executor/request.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tests/unittest/llmapi/test_llm_pytorch.py (2)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🧬 Code Graph Analysis (10)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (244-264)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/lora_manager.py (1)
  • cpp_lora_config (1157-1158)
tensorrt_llm/executor/request.py (1)
  • path (47-48)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (244-264)
tensorrt_llm/executor/request.py (1)
  • path (47-48)
tests/unittest/_torch/test_resource_manager.py (3)
tensorrt_llm/lora_manager.py (2)
  • LoraConfig (200-216)
  • lora_weights (1145-1146)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (244-264)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • PeftCacheManager (1165-1272)
cpp/tensorrt_llm/executor/loraConfig.cpp (1)
tensorrt_llm/executor/request.py (1)
  • path (47-48)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
cpp/tensorrt_llm/executor/loraConfig.cpp (8)
  • getTaskId (62-65)
  • getTaskId (62-62)
  • getWeights (67-70)
  • getWeights (67-67)
  • getConfig (72-75)
  • getConfig (72-72)
  • getPath (77-80)
  • getPath (77-77)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)
  • setLoraAdapterPath (927-930)
  • updatePerfMetrics (1821-1838)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (244-264)
cpp/tensorrt_llm/batch_manager/llmRequest.cpp (2)
  • removeLoraTensors (353-357)
  • removeLoraTensors (353-353)
cpp/tensorrt_llm/pybind/executor/request.cpp (1)
cpp/tensorrt_llm/executor/loraConfig.cpp (9)
  • LoraConfig (25-60)
  • getTaskId (62-65)
  • getTaskId (62-62)
  • getWeights (67-70)
  • getWeights (67-67)
  • getConfig (72-75)
  • getConfig (72-72)
  • getPath (77-80)
  • getPath (77-77)
tests/unittest/llmapi/test_llm.py (1)
tests/unittest/llmapi/test_llm_pytorch.py (3)
  • test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache (229-238)
  • _check_llama_7b_multi_lora_evict_load_new_adapters (207-225)
  • test_llama_7b_multi_lora_evict_and_load_new_adapters_in_cpu_and_gpu_cache (242-251)
tests/unittest/llmapi/test_llm_pytorch.py (4)
tests/unittest/utils/util.py (1)
  • similar (365-367)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)
  • similar (15-17)
tests/unittest/_torch/multi_gpu_modeling/test_llama4.py (1)
  • similar (84-85)
tests/unittest/llmapi/test_llm.py (3)
  • _check_llama_7b_multi_lora_evict_load_new_adapters (1455-1473)
  • test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache (1477-1486)
  • test_llama_7b_multi_lora_evict_and_load_new_adapters_in_cpu_and_gpu_cache (1490-1499)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (62)
tests/unittest/utils/util.py (1)

6-6: LGTM - Import cleanup aligns with removed functionality.

The reduction in typing imports correctly reflects the removal of subprocess utilities and environment variable context managers, while preserving imports needed for remaining functions.

cpp/tensorrt_llm/batch_manager/llmRequest.cpp (1)

353-357: LGTM - Clean implementation for LoRA tensor cleanup.

The method correctly clears LoRA tensors and configuration by resetting the optional members. This supports the LoRA adapter caching functionality where tensors can be removed from requests when cached elsewhere.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

506-506: LGTM - Proper propagation of LoRA configuration.

The addition of the lora_config parameter to the PeftCacheManager constructor correctly propagates the LoRA adapter configuration (including the new adapter path) for proper cache management.

cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2)

78-79: LGTM - Comment accurately reflects parameter count increase.

The comment update correctly indicates the new parameter count after adding the LoRA adapter path parameter.


102-102: LGTM - LoRA adapter path parameter correctly added.

The addition of mLoraAdapterPath parameter is properly positioned with other LoRA-related parameters and supports the new LoRA adapter path functionality.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)

79-80: LGTM - Parameter count comment consistent with pybind version.

The comment correctly reflects the increased parameter count and maintains consistency between nanobind and pybind implementations.


103-103: LGTM - LoRA adapter path parameter addition consistent across bindings.

The addition of mLoraAdapterPath parameter maintains consistency with the pybind implementation, ensuring uniform LoRA adapter path support across both binding systems.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

451-452: LGTM!

The addition of the lora_path parameter correctly follows the established conditional pattern and integrates with the broader LoRA adapter path support being added across the codebase.

tensorrt_llm/_utils.py (2)

182-182: LGTM!

The inverse mapping dictionary follows the established pattern used elsewhere in the file and provides the necessary reverse lookup functionality.


197-201: LGTM!

The new utility function correctly implements the inverse conversion with proper error handling and follows the established pattern of other dtype conversion functions in this file.

tensorrt_llm/executor/worker.py (1)

384-385: LGTM - Addresses known race condition issue

The changes correctly integrate the new LoRA adapter path parameter and appear to address the known race condition issue mentioned in the retrieved learning by always setting the config parameter unconditionally, rather than conditionally based on cache status.

The addition of the path parameter aligns with the broader LoRA adapter path support being implemented across the codebase.

tests/unittest/_torch/test_resource_manager.py (4)

19-19: LGTM! Import aligns with updated PeftCacheManager constructor.

The LoraConfig import is properly added to support the updated PeftCacheManager constructor that now requires a lora_config parameter.


264-271: LGTM! Method simplification improves clarity.

The removal of dimension expansion and updated docstring make the helper method more straightforward while maintaining its core functionality of loading LoRA test data.


276-280: LGTM! Constructor updated to match new signature.

The addition of lora_config=LoraConfig() correctly adapts to the updated PeftCacheManager constructor requirements while using appropriate default values for testing.


290-294: LGTM! Consistent constructor parameter updates.

All PeftCacheManager instantiations are consistently updated with the required lora_config=LoraConfig() parameter, ensuring compatibility with the new constructor signature across all test methods.

Also applies to: 308-312, 324-328, 358-362

cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)

596-596: LGTM! Code simplification aligns with centralized LoRA handling.

The direct use of llmRequest->getLoraTaskId().value() eliminates the intermediate variable and aligns with the broader refactoring to centralize LoRA adapter handling in the Python resource manager layer. This change supports the resolution of the known race condition issue in LoRA adapter cache optimization.

cpp/include/tensorrt_llm/executor/executor.h (3)

328-329: LGTM! Clean API extension.

The constructor signature extension follows established patterns with proper optional parameter handling and maintains backward compatibility.


334-334: LGTM! Consistent getter implementation.

The getter method follows established patterns and naming conventions in the class.


345-346: LGTM! Proper member implementation.

The new private member follows established naming conventions and includes appropriate documentation.

cpp/tensorrt_llm/nanobind/executor/request.cpp (1)

293-299: LGTM! Constructor and property implementation is correct.

The constructor properly accepts the optional path parameter with appropriate defaults, and the read-only property accessor correctly exposes the getPath() method. The implementation follows established nanobind patterns.

cpp/tensorrt_llm/pybind/executor/request.cpp (4)

259-260: LGTM - Serialization updated correctly for new path member

The loraConfigGetstate function is properly extended to include the new path member as the 4th tuple element, consistent with the C++ LoraConfig::getPath() method.


261-269: LGTM - Deserialization correctly handles new path parameter

The loraConfigSetstate function is properly updated to:

  • Expect 4 tuple elements instead of 3
  • Pass the path parameter (state[3]) to the LoraConfig constructor
  • Maintain proper type casting for the optional string parameter

271-273: LGTM - Constructor binding updated correctly

The constructor binding properly reflects the new C++ LoraConfig signature with the optional path parameter, including appropriate default value of py::none().


277-277: LGTM - Path property binding added correctly

The read-only property binding for path follows the established pattern and provides appropriate access to the new getPath() method.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (5)

13-13: LGTM - Required imports added for LoRA functionality

The import of LoraConfig, LoraManager, and LoraModelConfig is necessary for the enhanced LoRA support in PeftCacheManager.


169-169: LGTM - Constructor parameter added for LoRA configuration

The addition of the lora_config: LoraConfig parameter enables the PeftCacheManager to support LoRA adapter functionality.


200-205: LGTM - LoRA components initialized correctly

The initialization of LoRA-related components is well-structured:

  • _lora_config stores the provided configuration
  • _lora_model_config is properly constructed with target modules, HF mappings, hidden size, and data type
  • _lora_manager provides the LoRA management functionality

207-232: Enhanced LoRA request handling with multiple scenarios covered

The add_request_peft method effectively handles different LoRA adapter scenarios:

  1. Cache hit: Removes LoRA tensors when adapter is already cached (lines 210-214)
  2. Cache miss with path: Loads adapter from checkpoint using LoraManager (lines 215-224)
  3. Tensor reshaping: Adds required leading dimension for CPP backend compatibility (lines 226-231)

The implementation correctly leverages the LoraManager for checkpoint loading and handles the CPP backend's tensor format requirements.

Note: The cache optimization logic is consistent with the known race condition limitation mentioned in the retrieved learnings, which requires a comprehensive solution beyond simple error handling.


247-251: LGTM - Centralized LoRA processing for context requests

The updated prepare_resources method correctly calls add_request_peft for each context request, ensuring consistent LoRA adapter handling and tensor management across all requests.

cpp/tensorrt_llm/executor/loraConfig.cpp (3)

25-26: LGTM! Constructor signature correctly extended for LoRA path support.

The addition of the optional path parameter and proper initialization of mPath using move semantics is well-implemented and aligns with the PR objective to support LoRA reload from CPU cache evicted adapters.

Also applies to: 30-30


32-59: Excellent validation logic refactoring.

The independent validation of config and weights tensors improves code clarity and correctly handles the optional nature of both parameters. The conditional dimension matching check ensures consistency when both tensors are present while allowing either to be optional.


77-80: LGTM! Getter method follows consistent patterns.

The getPath() method implementation is correct and follows the same pattern as other getter methods in the class.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3)

122-122: LGTM! LoRA path property correctly added to Python bindings.

The lora_path property is properly exposed with appropriate getter and setter methods, and is logically placed near other LoRA-related properties.


275-275: Constructor correctly extended for LoRA path support.

The addition of the lora_path parameter to the LlmRequest constructor is properly implemented:

  • Parameter placement after lora_config is logical
  • Parameter count comment correctly updated to 50
  • Python argument binding is correctly added
  • Parameter is properly passed through to the underlying C++ constructor

Also applies to: 323-336, 347-347


384-384: LGTM! LoRA tensor cleanup method properly exposed.

The remove_lora_tensors() method is correctly bound to expose the underlying C++ functionality for clearing LoRA weights and configuration tensors from requests.

tests/unittest/llmapi/test_llm.py (3)

1456-1475: LGTM! Well-structured helper function.

The helper function properly consolidates the common LoRA adapter eviction and loading test logic. The LoraConfig is correctly configured with target modules and cache parameters, and the BuildConfig setup looks appropriate for the test scenarios.


1477-1488: LGTM! Clear and well-documented test function.

The test function is properly structured with appropriate GPU memory requirements decorator and clear documentation. The parameter combination effectively tests the eviction and reloading scenario for LoRA adapters in GPU cache.


1490-1501: LGTM! Well-designed test for new adapter loading.

The test function effectively validates the eviction and loading of new adapters across multiple generate calls. The parameter configuration properly exercises the scenario where GPU cache size is smaller than CPU cache size, testing the adapter management logic comprehensively.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (4)

121-121: LGTM! Well-implemented property addition.

The lora_path property follows the established pattern of other LoRA-related properties and uses appropriate getter/setter methods.


272-272: LGTM! Constructor parameter properly integrated.

The lora_path parameter is correctly added as an optional string, properly passed to the underlying constructor, and given an appropriate default value.

Also applies to: 325-325, 342-342


319-319: Good documentation practice.

Updating the parameter count comment helps maintain clarity about the constructor's complexity.


378-378: LGTM! Method binding follows established patterns.

The remove_lora_tensors method is properly exposed using standard nanobind patterns and provides clear functionality for LoRA tensor management.

tests/unittest/llmapi/test_llm_pytorch.py (5)

19-21: Good import cleanup.

Removing unused imports after the test refactoring improves code maintainability.


207-226: Well-designed helper function.

The helper function effectively reduces code duplication while providing clear parameters for customizing different LoRA eviction/loading test scenarios. The LoRA configuration setup is appropriate for the test use cases.


229-238: Well-documented test with clear purpose.

The test function clearly describes its scenario and uses appropriate parameters to test LoRA GPU cache eviction and reloading behavior.


242-251: Clear test scenario with appropriate parameters.

The test effectively validates LoRA adapter loading behavior across multiple calls with different cache constraints. The docstring clearly explains the expected behavior.


255-273: Excellent comprehensive test with detailed documentation.

This test function covers the most complex LoRA eviction/reloading scenario with an exceptionally clear docstring that explains the expected cache behavior step-by-step. The parameters are well-chosen to create the intended test conditions.

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (11)

101-116: LGTM: LoRA adapter path parameter addition is correct.

The new loraAdapterPath parameter is properly added with the correct type, default value, and logical positioning after other LoRA-related parameters.


166-166: LGTM: Proper initialization of LoRA adapter path member.

The member variable initialization correctly uses std::move and is positioned logically in the initializer list.


204-213: LGTM: Consistent LoRA adapter path parameter addition.

The parameter addition maintains consistency with the first constructor, using the same type and default value.


245-245: LGTM: Consistent member initialization.

The initialization pattern matches the first constructor and correctly uses move semantics.


397-400: LGTM: Proper integration with executor::Request.

The code correctly extracts the LoRA adapter path from the executor::Request's LoRA config when available, maintaining proper null safety.


922-930: LGTM: Well-implemented accessor methods.

The getter and setter methods follow the established patterns with proper const correctness, [[nodiscard]] attribute, and move semantics.


1924-1924: LGTM: Proper member variable declaration.

The member variable follows the class naming conventions and is properly initialized with a default value.


2179-2194: LGTM: LlmRequest constructor properly updated.

The first LlmRequest constructor correctly adds the loraPath parameter and forwards it to the base class. The parameter count comment is accurate.


2231-2246: LGTM: Second LlmRequest constructor properly updated.

The second constructor maintains consistency with the first, correctly adding and forwarding the loraPath parameter.


2299-2308: LGTM: Third LlmRequest constructor properly updated.

The third constructor is consistently updated with the new parameter and maintains proper forwarding to the base class.


2358-2358: LGTM: Clear method declaration for LoRA tensor cleanup.

The removeLoraTensors() method declaration is appropriately named and positioned for clearing LoRA-related tensors from requests.

cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (2)

53-68: LGTM: Consistent pybind constructor parameter addition.

The loraPath parameter is correctly added with proper type, default value, and the parameter count comment is accurately updated.


118-118: LGTM: Proper parameter forwarding to base class.

The loraPath parameter is correctly forwarded to the base class constructor, maintaining the established pattern.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)

54-54: LGTM: Parameter count updated correctly.

The comment accurately reflects the addition of the new loraPath parameter, bringing the total to 50.


119-119: LGTM: Parameter correctly forwarded to base class.

The loraPath parameter is properly passed to the base class constructor in the correct position within the member initializer list.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13642 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10237 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-support-torch-lora-reload-adapter-after-cpu-cache-eviction branch from 4041c53 to 51a77a1 Compare July 31, 2025 12:45
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13677 [ run ] triggered by Bot

@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13677 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10269 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13678 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13678 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10270 completed with status: 'FAILURE'

@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13699 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13699 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10290 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-support-torch-lora-reload-adapter-after-cpu-cache-eviction branch from db86a37 to 17a8ddf Compare August 3, 2025 08:10
@amitz-nv amitz-nv changed the title [6683][feat] Support LoRA reload CPU cache evicted adapter [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter Aug 3, 2025
@amitz-nv
Copy link
Collaborator Author

amitz-nv commented Aug 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13864 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13864 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10431 completed with status: 'FAILURE'

@amitz-nv
Copy link
Collaborator Author

amitz-nv commented Aug 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13875 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13875 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10441 completed with status: 'FAILURE'

@amitz-nv amitz-nv force-pushed the dev-support-torch-lora-reload-adapter-after-cpu-cache-eviction branch from 0e93f46 to c7a3076 Compare August 6, 2025 07:09
@amitz-nv amitz-nv requested review from a team as code owners August 6, 2025 07:09
@amitz-nv
Copy link
Collaborator Author

amitz-nv commented Aug 6, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14265 [ run ] triggered by Bot

Copy link
Collaborator

@tomeras91 tomeras91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving on behalf of nemotron-code-owners

Copy link
Collaborator

@shaharmor98 shaharmor98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
BoYang hasn't answered us yet, perhaps we should ping him (about minimum cache requirement). Other than that- looks great

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14265 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10771 completed with status: 'FAILURE'

…when it's not in LoRA CPU cache and it doesn't have weights attached, worker.py _enqueue_request now always passes the LoRA config tensor along with the request, Added path to LoraConfig, removed 'reuse after cpu veiction not supported' exception message, refactored LoRA tests

Signed-off-by: Amit Zuker <[email protected]>
…lasses, revert lora path changes in CPP

Signed-off-by: Amit Zuker <[email protected]>
…rs, added a test that verifies the flow it's required in

Signed-off-by: Amit Zuker <[email protected]>
…lmRequest instance instead of bindings.internal.batch_manager.LlmRequest

Signed-off-by: Amit Zuker <[email protected]>
@amitz-nv amitz-nv force-pushed the dev-support-torch-lora-reload-adapter-after-cpu-cache-eviction branch from d2862c2 to 6173aab Compare August 6, 2025 12:35
@amitz-nv
Copy link
Collaborator Author

amitz-nv commented Aug 6, 2025

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)

430-430: Memory skip decorators are consistent but consider CI assignment

The @skip_gpu_memory_less_than_40gb decorators are consistent with existing test patterns and prevent failures on insufficient hardware. However, as noted in past review comments, consider ensuring these tests are assigned to appropriate CI machines to avoid them being consistently skipped.

This is more of an infrastructure/CI configuration concern than a code issue, but it's worth tracking to ensure the LoRA reload functionality gets proper test coverage in CI.

Also applies to: 394-394

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7a3076 and 6173aab.

📒 Files selected for processing (16)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/executor/loraConfig.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tests/unit_tests/executor/loraConfigTest.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (4 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/executor/worker.py (3 hunks)
  • tests/unittest/_torch/test_resource_manager.py (7 hunks)
  • tests/unittest/llmapi/test_llm.py (2 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (4 hunks)
  • tests/unittest/utils/util.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • tensorrt_llm/_utils.py
🚧 Files skipped from review as they are similar to previous changes (11)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tensorrt_llm/batch_manager/llmRequest.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
  • cpp/tests/unit_tests/executor/loraConfigTest.cpp
  • tests/unittest/_torch/test_resource_manager.py
  • tensorrt_llm/executor/worker.py
  • tests/unittest/utils/util.py
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • tests/unittest/llmapi/test_llm.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is highly encouraged for clarity.
Enumerations, global constants, static constants at class-scope and fu...

Files:

  • cpp/tensorrt_llm/executor/loraConfig.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/executor/loraConfig.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • cpp/tensorrt_llm/executor/loraConfig.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

1174-1210: LGTM: Constructor enhancement for LoRA support

The constructor correctly initializes the LoRA-related components (LoraModelConfig and LoraManager) using the provided lora_config parameter. The initialization follows proper patterns and integrates well with the existing C++ backend setup.


1212-1235: Review LoRA caching logic for potential edge cases

The implementation looks sound, but consider these potential issues:

  1. Race condition: Similar to the known issue in tensorrt_llm/executor/worker.py, there could be a race between checking is_task_cached() and the actual cache state when the request is processed.

  2. Error handling: No validation that lora_task_id exists when py_lora_path is provided, or that the path is valid before attempting to load.

  3. Memory management: The unsqueeze operations create new tensors - ensure the original tensors are properly cleaned up if needed.

Consider adding validation and error handling:

 def add_request_peft(self, request: LlmRequest):
     if request.lora_task_id is not None:
         is_task_cached = self.impl.is_task_cached(request.lora_task_id)
         if is_task_cached:
             request.remove_lora_tensors()
-        elif request.lora_weights is None and request.py_lora_path:
+        elif request.lora_weights is None and request.py_lora_path:
+            if not request.py_lora_path.exists():
+                raise FileNotFoundError(f"LoRA adapter path not found: {request.py_lora_path}")
             self._lora_manager.load_from_ckpt(

1254-1254: LGTM: Good refactoring to centralize LoRA handling

The change to call add_request_peft() centralizes the LoRA tensor handling logic and makes the code more maintainable. This follows the single responsibility principle well.

cpp/tensorrt_llm/executor/loraConfig.cpp (1)

30-56: LGTM: Improved validation logic allows config-only LoRA

The refactored validation correctly implements the requirement to allow LoRA config without weights while preventing weights without config. The separation into two conditional blocks improves readability and flexibility.

Key improvements:

  • Line 30-40: Validates config tensor independently
  • Line 41-56: Validates weights tensor and enforces config requirement (line 45)
  • Maintains all existing tensor validation checks

This aligns with the past review comments and supports the new LoRA reload functionality.

tests/unittest/llmapi/test_llm_pytorch.py (2)

314-333: LGTM: Good refactoring from parameterized to explicit tests

Converting the parameterized test to a helper function with explicit test cases improves test clarity and maintainability. Each test now has a descriptive name and clear intent.


336-391: Comprehensive test coverage for LoRA reload functionality

The four new test functions provide excellent coverage of the LoRA reload scenarios:

  1. GPU cache eviction/reload - Tests basic eviction and reuse within single call
  2. CPU/GPU cache interaction - Tests complex cache hierarchy behavior
  3. Cache hit after insert - Verifies successful caching and retrieval
  4. Complex eviction scenario - Tests the core PR functionality with detailed expected behavior

The detailed docstrings clearly explain each test's purpose and expected behavior, making the test suite maintainable. The parameters are well-chosen to exercise different cache configurations and eviction patterns.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14307 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14307 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10808 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants