-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter #6510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter #6510
Conversation
📝 WalkthroughWalkthroughThis update refactors and enhances LoRA (Low-Rank Adaptation) adapter management across C++, Python, and test code. It introduces new methods and attributes for LoRA tensor handling, updates validation and resource management logic, and restructures related tests for clarity. Some obsolete subprocess utilities and decorators are removed. Changes
Sequence Diagram(s)sequenceDiagram
participant PythonClient
participant ExecutorWorker
participant LlmRequest
participant PeftCacheManager
participant LoraManager
participant CppBackend
PythonClient->>ExecutorWorker: Submit request (may include LoRA adapter)
ExecutorWorker->>LlmRequest: Create LlmRequest (with py_lora_path)
ExecutorWorker->>PeftCacheManager: prepare_resources([LlmRequest])
PeftCacheManager->>PeftCacheManager: add_request_peft(LlmRequest)
alt LoRA task in C++ cache
PeftCacheManager->>LlmRequest: removeLoraTensors()
else Not cached, lora_path present
PeftCacheManager->>LoraManager: load LoRA weights/config from checkpoint
LoraManager-->>PeftCacheManager: Return weights/config
PeftCacheManager->>LlmRequest: Set lora_weights/lora_config (reshaped)
end
PeftCacheManager->>CppBackend: Add LoRA tensors/config if needed
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
PR_Github #13642 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(19 hunks)cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/tensorrt_llm/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/executor/loraConfig.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(5 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(3 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(5 hunks)cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h
(3 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(4 hunks)tensorrt_llm/_utils.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tests/unittest/_torch/test_resource_manager.py
(6 hunks)tests/unittest/llmapi/test_llm.py
(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(4 hunks)tests/unittest/utils/util.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/llmRequest.cpp (2)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/executor/worker.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tests/unittest/_torch/test_resource_manager.py (2)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/executor/loraConfig.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/include/tensorrt_llm/executor/executor.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
cpp/tensorrt_llm/pybind/executor/request.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tests/unittest/llmapi/test_llm_pytorch.py (2)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (10)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config
(244-264)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/lora_manager.py (1)
cpp_lora_config
(1157-1158)tensorrt_llm/executor/request.py (1)
path
(47-48)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config
(244-264)tensorrt_llm/executor/request.py (1)
path
(47-48)
tests/unittest/_torch/test_resource_manager.py (3)
tensorrt_llm/lora_manager.py (2)
LoraConfig
(200-216)lora_weights
(1145-1146)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config
(244-264)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
PeftCacheManager
(1165-1272)
cpp/tensorrt_llm/executor/loraConfig.cpp (1)
tensorrt_llm/executor/request.py (1)
path
(47-48)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
cpp/tensorrt_llm/executor/loraConfig.cpp (8)
getTaskId
(62-65)getTaskId
(62-62)getWeights
(67-70)getWeights
(67-67)getConfig
(72-75)getConfig
(72-72)getPath
(77-80)getPath
(77-77)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)
setLoraAdapterPath
(927-930)updatePerfMetrics
(1821-1838)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config
(244-264)cpp/tensorrt_llm/batch_manager/llmRequest.cpp (2)
removeLoraTensors
(353-357)removeLoraTensors
(353-353)
cpp/tensorrt_llm/pybind/executor/request.cpp (1)
cpp/tensorrt_llm/executor/loraConfig.cpp (9)
LoraConfig
(25-60)getTaskId
(62-65)getTaskId
(62-62)getWeights
(67-70)getWeights
(67-67)getConfig
(72-75)getConfig
(72-72)getPath
(77-80)getPath
(77-77)
tests/unittest/llmapi/test_llm.py (1)
tests/unittest/llmapi/test_llm_pytorch.py (3)
test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache
(229-238)_check_llama_7b_multi_lora_evict_load_new_adapters
(207-225)test_llama_7b_multi_lora_evict_and_load_new_adapters_in_cpu_and_gpu_cache
(242-251)
tests/unittest/llmapi/test_llm_pytorch.py (4)
tests/unittest/utils/util.py (1)
similar
(365-367)tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)
similar
(15-17)tests/unittest/_torch/multi_gpu_modeling/test_llama4.py (1)
similar
(84-85)tests/unittest/llmapi/test_llm.py (3)
_check_llama_7b_multi_lora_evict_load_new_adapters
(1455-1473)test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache
(1477-1486)test_llama_7b_multi_lora_evict_and_load_new_adapters_in_cpu_and_gpu_cache
(1490-1499)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (62)
tests/unittest/utils/util.py (1)
6-6
: LGTM - Import cleanup aligns with removed functionality.The reduction in typing imports correctly reflects the removal of subprocess utilities and environment variable context managers, while preserving imports needed for remaining functions.
cpp/tensorrt_llm/batch_manager/llmRequest.cpp (1)
353-357
: LGTM - Clean implementation for LoRA tensor cleanup.The method correctly clears LoRA tensors and configuration by resetting the optional members. This supports the LoRA adapter caching functionality where tensors can be removed from requests when cached elsewhere.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
506-506
: LGTM - Proper propagation of LoRA configuration.The addition of the
lora_config
parameter to thePeftCacheManager
constructor correctly propagates the LoRA adapter configuration (including the new adapter path) for proper cache management.cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2)
78-79
: LGTM - Comment accurately reflects parameter count increase.The comment update correctly indicates the new parameter count after adding the LoRA adapter path parameter.
102-102
: LGTM - LoRA adapter path parameter correctly added.The addition of
mLoraAdapterPath
parameter is properly positioned with other LoRA-related parameters and supports the new LoRA adapter path functionality.cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)
79-80
: LGTM - Parameter count comment consistent with pybind version.The comment correctly reflects the increased parameter count and maintains consistency between nanobind and pybind implementations.
103-103
: LGTM - LoRA adapter path parameter addition consistent across bindings.The addition of
mLoraAdapterPath
parameter maintains consistency with the pybind implementation, ensuring uniform LoRA adapter path support across both binding systems.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
451-452
: LGTM!The addition of the
lora_path
parameter correctly follows the established conditional pattern and integrates with the broader LoRA adapter path support being added across the codebase.tensorrt_llm/_utils.py (2)
182-182
: LGTM!The inverse mapping dictionary follows the established pattern used elsewhere in the file and provides the necessary reverse lookup functionality.
197-201
: LGTM!The new utility function correctly implements the inverse conversion with proper error handling and follows the established pattern of other dtype conversion functions in this file.
tensorrt_llm/executor/worker.py (1)
384-385
: LGTM - Addresses known race condition issueThe changes correctly integrate the new LoRA adapter path parameter and appear to address the known race condition issue mentioned in the retrieved learning by always setting the
config
parameter unconditionally, rather than conditionally based on cache status.The addition of the
path
parameter aligns with the broader LoRA adapter path support being implemented across the codebase.tests/unittest/_torch/test_resource_manager.py (4)
19-19
: LGTM! Import aligns with updated PeftCacheManager constructor.The LoraConfig import is properly added to support the updated PeftCacheManager constructor that now requires a lora_config parameter.
264-271
: LGTM! Method simplification improves clarity.The removal of dimension expansion and updated docstring make the helper method more straightforward while maintaining its core functionality of loading LoRA test data.
276-280
: LGTM! Constructor updated to match new signature.The addition of
lora_config=LoraConfig()
correctly adapts to the updated PeftCacheManager constructor requirements while using appropriate default values for testing.
290-294
: LGTM! Consistent constructor parameter updates.All PeftCacheManager instantiations are consistently updated with the required
lora_config=LoraConfig()
parameter, ensuring compatibility with the new constructor signature across all test methods.Also applies to: 308-312, 324-328, 358-362
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
596-596
: LGTM! Code simplification aligns with centralized LoRA handling.The direct use of
llmRequest->getLoraTaskId().value()
eliminates the intermediate variable and aligns with the broader refactoring to centralize LoRA adapter handling in the Python resource manager layer. This change supports the resolution of the known race condition issue in LoRA adapter cache optimization.cpp/include/tensorrt_llm/executor/executor.h (3)
328-329
: LGTM! Clean API extension.The constructor signature extension follows established patterns with proper optional parameter handling and maintains backward compatibility.
334-334
: LGTM! Consistent getter implementation.The getter method follows established patterns and naming conventions in the class.
345-346
: LGTM! Proper member implementation.The new private member follows established naming conventions and includes appropriate documentation.
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
293-299
: LGTM! Constructor and property implementation is correct.The constructor properly accepts the optional
path
parameter with appropriate defaults, and the read-only property accessor correctly exposes thegetPath()
method. The implementation follows established nanobind patterns.cpp/tensorrt_llm/pybind/executor/request.cpp (4)
259-260
: LGTM - Serialization updated correctly for new path memberThe
loraConfigGetstate
function is properly extended to include the newpath
member as the 4th tuple element, consistent with the C++LoraConfig::getPath()
method.
261-269
: LGTM - Deserialization correctly handles new path parameterThe
loraConfigSetstate
function is properly updated to:
- Expect 4 tuple elements instead of 3
- Pass the path parameter (
state[3]
) to the LoraConfig constructor- Maintain proper type casting for the optional string parameter
271-273
: LGTM - Constructor binding updated correctlyThe constructor binding properly reflects the new C++
LoraConfig
signature with the optionalpath
parameter, including appropriate default value ofpy::none()
.
277-277
: LGTM - Path property binding added correctlyThe read-only property binding for
path
follows the established pattern and provides appropriate access to the newgetPath()
method.tensorrt_llm/_torch/pyexecutor/resource_manager.py (5)
13-13
: LGTM - Required imports added for LoRA functionalityThe import of
LoraConfig
,LoraManager
, andLoraModelConfig
is necessary for the enhanced LoRA support inPeftCacheManager
.
169-169
: LGTM - Constructor parameter added for LoRA configurationThe addition of the
lora_config: LoraConfig
parameter enables thePeftCacheManager
to support LoRA adapter functionality.
200-205
: LGTM - LoRA components initialized correctlyThe initialization of LoRA-related components is well-structured:
_lora_config
stores the provided configuration_lora_model_config
is properly constructed with target modules, HF mappings, hidden size, and data type_lora_manager
provides the LoRA management functionality
207-232
: Enhanced LoRA request handling with multiple scenarios coveredThe
add_request_peft
method effectively handles different LoRA adapter scenarios:
- Cache hit: Removes LoRA tensors when adapter is already cached (lines 210-214)
- Cache miss with path: Loads adapter from checkpoint using
LoraManager
(lines 215-224)- Tensor reshaping: Adds required leading dimension for CPP backend compatibility (lines 226-231)
The implementation correctly leverages the
LoraManager
for checkpoint loading and handles the CPP backend's tensor format requirements.Note: The cache optimization logic is consistent with the known race condition limitation mentioned in the retrieved learnings, which requires a comprehensive solution beyond simple error handling.
247-251
: LGTM - Centralized LoRA processing for context requestsThe updated
prepare_resources
method correctly callsadd_request_peft
for each context request, ensuring consistent LoRA adapter handling and tensor management across all requests.cpp/tensorrt_llm/executor/loraConfig.cpp (3)
25-26
: LGTM! Constructor signature correctly extended for LoRA path support.The addition of the optional
path
parameter and proper initialization ofmPath
using move semantics is well-implemented and aligns with the PR objective to support LoRA reload from CPU cache evicted adapters.Also applies to: 30-30
32-59
: Excellent validation logic refactoring.The independent validation of
config
andweights
tensors improves code clarity and correctly handles the optional nature of both parameters. The conditional dimension matching check ensures consistency when both tensors are present while allowing either to be optional.
77-80
: LGTM! Getter method follows consistent patterns.The
getPath()
method implementation is correct and follows the same pattern as other getter methods in the class.cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3)
122-122
: LGTM! LoRA path property correctly added to Python bindings.The
lora_path
property is properly exposed with appropriate getter and setter methods, and is logically placed near other LoRA-related properties.
275-275
: Constructor correctly extended for LoRA path support.The addition of the
lora_path
parameter to theLlmRequest
constructor is properly implemented:
- Parameter placement after
lora_config
is logical- Parameter count comment correctly updated to 50
- Python argument binding is correctly added
- Parameter is properly passed through to the underlying C++ constructor
Also applies to: 323-336, 347-347
384-384
: LGTM! LoRA tensor cleanup method properly exposed.The
remove_lora_tensors()
method is correctly bound to expose the underlying C++ functionality for clearing LoRA weights and configuration tensors from requests.tests/unittest/llmapi/test_llm.py (3)
1456-1475
: LGTM! Well-structured helper function.The helper function properly consolidates the common LoRA adapter eviction and loading test logic. The LoraConfig is correctly configured with target modules and cache parameters, and the BuildConfig setup looks appropriate for the test scenarios.
1477-1488
: LGTM! Clear and well-documented test function.The test function is properly structured with appropriate GPU memory requirements decorator and clear documentation. The parameter combination effectively tests the eviction and reloading scenario for LoRA adapters in GPU cache.
1490-1501
: LGTM! Well-designed test for new adapter loading.The test function effectively validates the eviction and loading of new adapters across multiple generate calls. The parameter configuration properly exercises the scenario where GPU cache size is smaller than CPU cache size, testing the adapter management logic comprehensively.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (4)
121-121
: LGTM! Well-implemented property addition.The
lora_path
property follows the established pattern of other LoRA-related properties and uses appropriate getter/setter methods.
272-272
: LGTM! Constructor parameter properly integrated.The
lora_path
parameter is correctly added as an optional string, properly passed to the underlying constructor, and given an appropriate default value.Also applies to: 325-325, 342-342
319-319
: Good documentation practice.Updating the parameter count comment helps maintain clarity about the constructor's complexity.
378-378
: LGTM! Method binding follows established patterns.The
remove_lora_tensors
method is properly exposed using standard nanobind patterns and provides clear functionality for LoRA tensor management.tests/unittest/llmapi/test_llm_pytorch.py (5)
19-21
: Good import cleanup.Removing unused imports after the test refactoring improves code maintainability.
207-226
: Well-designed helper function.The helper function effectively reduces code duplication while providing clear parameters for customizing different LoRA eviction/loading test scenarios. The LoRA configuration setup is appropriate for the test use cases.
229-238
: Well-documented test with clear purpose.The test function clearly describes its scenario and uses appropriate parameters to test LoRA GPU cache eviction and reloading behavior.
242-251
: Clear test scenario with appropriate parameters.The test effectively validates LoRA adapter loading behavior across multiple calls with different cache constraints. The docstring clearly explains the expected behavior.
255-273
: Excellent comprehensive test with detailed documentation.This test function covers the most complex LoRA eviction/reloading scenario with an exceptionally clear docstring that explains the expected cache behavior step-by-step. The parameters are well-chosen to create the intended test conditions.
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (11)
101-116
: LGTM: LoRA adapter path parameter addition is correct.The new
loraAdapterPath
parameter is properly added with the correct type, default value, and logical positioning after other LoRA-related parameters.
166-166
: LGTM: Proper initialization of LoRA adapter path member.The member variable initialization correctly uses
std::move
and is positioned logically in the initializer list.
204-213
: LGTM: Consistent LoRA adapter path parameter addition.The parameter addition maintains consistency with the first constructor, using the same type and default value.
245-245
: LGTM: Consistent member initialization.The initialization pattern matches the first constructor and correctly uses move semantics.
397-400
: LGTM: Proper integration with executor::Request.The code correctly extracts the LoRA adapter path from the executor::Request's LoRA config when available, maintaining proper null safety.
922-930
: LGTM: Well-implemented accessor methods.The getter and setter methods follow the established patterns with proper const correctness, [[nodiscard]] attribute, and move semantics.
1924-1924
: LGTM: Proper member variable declaration.The member variable follows the class naming conventions and is properly initialized with a default value.
2179-2194
: LGTM: LlmRequest constructor properly updated.The first LlmRequest constructor correctly adds the
loraPath
parameter and forwards it to the base class. The parameter count comment is accurate.
2231-2246
: LGTM: Second LlmRequest constructor properly updated.The second constructor maintains consistency with the first, correctly adding and forwarding the
loraPath
parameter.
2299-2308
: LGTM: Third LlmRequest constructor properly updated.The third constructor is consistently updated with the new parameter and maintains proper forwarding to the base class.
2358-2358
: LGTM: Clear method declaration for LoRA tensor cleanup.The
removeLoraTensors()
method declaration is appropriately named and positioned for clearing LoRA-related tensors from requests.cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (2)
53-68
: LGTM: Consistent pybind constructor parameter addition.The
loraPath
parameter is correctly added with proper type, default value, and the parameter count comment is accurately updated.
118-118
: LGTM: Proper parameter forwarding to base class.The
loraPath
parameter is correctly forwarded to the base class constructor, maintaining the established pattern.cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)
54-54
: LGTM: Parameter count updated correctly.The comment accurately reflects the addition of the new
loraPath
parameter, bringing the total to 50.
119-119
: LGTM: Parameter correctly forwarded to base class.The
loraPath
parameter is properly passed to the base class constructor in the correct position within the member initializer list.
PR_Github #13642 [ run ] completed with state |
4041c53
to
51a77a1
Compare
/bot run |
PR_Github #13677 [ run ] triggered by Bot |
/bot run |
PR_Github #13677 [ run ] completed with state |
PR_Github #13678 [ run ] triggered by Bot |
PR_Github #13678 [ run ] completed with state |
/bot run |
PR_Github #13699 [ run ] triggered by Bot |
PR_Github #13699 [ run ] completed with state |
db86a37
to
17a8ddf
Compare
/bot run |
PR_Github #13864 [ run ] triggered by Bot |
PR_Github #13864 [ run ] completed with state |
/bot run |
PR_Github #13875 [ run ] triggered by Bot |
PR_Github #13875 [ run ] completed with state |
0e93f46
to
c7a3076
Compare
/bot run |
PR_Github #14265 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving on behalf of nemotron-code-owners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BoYang hasn't answered us yet, perhaps we should ping him (about minimum cache requirement). Other than that- looks great
PR_Github #14265 [ run ] completed with state |
…when it's not in LoRA CPU cache and it doesn't have weights attached, worker.py _enqueue_request now always passes the LoRA config tensor along with the request, Added path to LoraConfig, removed 'reuse after cpu veiction not supported' exception message, refactored LoRA tests Signed-off-by: Amit Zuker <[email protected]>
…gument Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
…th argument Signed-off-by: Amit Zuker <[email protected]>
…igTest accordingly Signed-off-by: Amit Zuker <[email protected]>
…lasses, revert lora path changes in CPP Signed-off-by: Amit Zuker <[email protected]>
…rs, added a test that verifies the flow it's required in Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
…lmRequest instance instead of bindings.internal.batch_manager.LlmRequest Signed-off-by: Amit Zuker <[email protected]>
d2862c2
to
6173aab
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)
430-430
: Memory skip decorators are consistent but consider CI assignmentThe
@skip_gpu_memory_less_than_40gb
decorators are consistent with existing test patterns and prevent failures on insufficient hardware. However, as noted in past review comments, consider ensuring these tests are assigned to appropriate CI machines to avoid them being consistently skipped.This is more of an infrastructure/CI configuration concern than a code issue, but it's worth tracking to ensure the LoRA reload functionality gets proper test coverage in CI.
Also applies to: 394-394
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(1 hunks)cpp/tensorrt_llm/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/executor/loraConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(1 hunks)cpp/tests/unit_tests/executor/loraConfigTest.cpp
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(4 hunks)tensorrt_llm/_utils.py
(2 hunks)tensorrt_llm/executor/worker.py
(3 hunks)tests/unittest/_torch/test_resource_manager.py
(7 hunks)tests/unittest/llmapi/test_llm.py
(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(4 hunks)tests/unittest/utils/util.py
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
- tensorrt_llm/_utils.py
🚧 Files skipped from review as they are similar to previous changes (11)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- tensorrt_llm/_torch/pyexecutor/_util.py
- cpp/tensorrt_llm/batch_manager/llmRequest.cpp
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- cpp/tests/unit_tests/executor/loraConfigTest.cpp
- tests/unittest/_torch/test_resource_manager.py
- tensorrt_llm/executor/worker.py
- tests/unittest/utils/util.py
- cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
- tests/unittest/llmapi/test_llm.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is highly encouraged for clarity.
Enumerations, global constants, static constants at class-scope and fu...
Files:
cpp/tensorrt_llm/executor/loraConfig.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/executor/loraConfig.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/llmapi/test_llm_pytorch.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
cpp/tensorrt_llm/executor/loraConfig.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/llmapi/test_llm_pytorch.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
1174-1210
: LGTM: Constructor enhancement for LoRA supportThe constructor correctly initializes the LoRA-related components (
LoraModelConfig
andLoraManager
) using the providedlora_config
parameter. The initialization follows proper patterns and integrates well with the existing C++ backend setup.
1212-1235
: Review LoRA caching logic for potential edge casesThe implementation looks sound, but consider these potential issues:
Race condition: Similar to the known issue in
tensorrt_llm/executor/worker.py
, there could be a race between checkingis_task_cached()
and the actual cache state when the request is processed.Error handling: No validation that
lora_task_id
exists whenpy_lora_path
is provided, or that the path is valid before attempting to load.Memory management: The unsqueeze operations create new tensors - ensure the original tensors are properly cleaned up if needed.
Consider adding validation and error handling:
def add_request_peft(self, request: LlmRequest): if request.lora_task_id is not None: is_task_cached = self.impl.is_task_cached(request.lora_task_id) if is_task_cached: request.remove_lora_tensors() - elif request.lora_weights is None and request.py_lora_path: + elif request.lora_weights is None and request.py_lora_path: + if not request.py_lora_path.exists(): + raise FileNotFoundError(f"LoRA adapter path not found: {request.py_lora_path}") self._lora_manager.load_from_ckpt(
1254-1254
: LGTM: Good refactoring to centralize LoRA handlingThe change to call
add_request_peft()
centralizes the LoRA tensor handling logic and makes the code more maintainable. This follows the single responsibility principle well.cpp/tensorrt_llm/executor/loraConfig.cpp (1)
30-56
: LGTM: Improved validation logic allows config-only LoRAThe refactored validation correctly implements the requirement to allow LoRA config without weights while preventing weights without config. The separation into two conditional blocks improves readability and flexibility.
Key improvements:
- Line 30-40: Validates config tensor independently
- Line 41-56: Validates weights tensor and enforces config requirement (line 45)
- Maintains all existing tensor validation checks
This aligns with the past review comments and supports the new LoRA reload functionality.
tests/unittest/llmapi/test_llm_pytorch.py (2)
314-333
: LGTM: Good refactoring from parameterized to explicit testsConverting the parameterized test to a helper function with explicit test cases improves test clarity and maintainability. Each test now has a descriptive name and clear intent.
336-391
: Comprehensive test coverage for LoRA reload functionalityThe four new test functions provide excellent coverage of the LoRA reload scenarios:
- GPU cache eviction/reload - Tests basic eviction and reuse within single call
- CPU/GPU cache interaction - Tests complex cache hierarchy behavior
- Cache hit after insert - Verifies successful caching and retrieval
- Complex eviction scenario - Tests the core PR functionality with detailed expected behavior
The detailed docstrings clearly explain each test's purpose and expected behavior, making the test suite maintainable. The parameters are well-chosen to exercise different cache configurations and eviction patterns.
PR_Github #14307 [ run ] triggered by Bot |
PR_Github #14307 [ run ] completed with state |
…IDIA#6510) Signed-off-by: Amit Zuker <[email protected]>
Description
Added support for reusing a LoRA adapter after it was evicted from LoRA CPU cache:
py_lora_path
optional field toRequest
and toLlmRequest
classes (only in python).PeftCacheManager.add_request_peft
would load the LoRA adapter if it's not loaded in cache.Test Coverage
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_multi_lora_evict_and_reload_evicted_adapters_in_cpu_and_gpu_cache
tests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_multi_lora_read_from_cache_after_insert
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests