-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction #5616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
shaharmor98
merged 32 commits into
NVIDIA:main
from
amitz-nv:dev-support-pytorch-lora-adapter-eviction
Jul 20, 2025
Merged
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
b029c0e
Fixed BindCapacityScheduler to pass peft_cache_manager to the CPP bin…
amitz-nv c79b271
Removed unnecessary changes in test_llm.py
amitz-nv d193d30
Refactored LoRA eviction tests
amitz-nv 28d86b1
Added type hint to peft_cache_manager
amitz-nv b87d58a
Add forgotten llm_args in test_llm.py, fix formatting in test_llm_pyt…
amitz-nv a7d6ea5
Pass peft_cache_manager=None to BindCapacityScheduler in create_autod…
amitz-nv 0ebe6aa
Fix target name in test
amitz-nv 728b32f
Changed GuaranteedNoEvictScheduler to try call peftCacheManager->dete…
amitz-nv 26923b6
Format comments in tests
amitz-nv f4875fa
Remove debug prints from test
amitz-nv f5ca3b4
Update missingPeftTask CPP test to expect the error message starts wi…
amitz-nv 82e18f9
Refactored shared lora test logic into lora_test_utils.py
amitz-nv 9cc4cc4
PeftCacheManager::determineNumPages throws exception with 'not suppor…
amitz-nv e16aae7
Add docstring to check_multi_unique_lora_adapters_from_request
amitz-nv 8758975
Fix imports of test_llm.py
amitz-nv bdfa780
Improved check_multi_unique_lora_adapters_from_request docstring
amitz-nv 2c3b771
Fix imports in test_llm_multi_gpu.py and in test_llm_multi_gpu_pytorc…
amitz-nv 6625682
Revert changes in _TrtLLM._build_model, move LLM creation to test so …
amitz-nv ed68f49
Change the 'should include adapter weights with request' to be based …
amitz-nv b1d0bf6
test_llm_pytorch.py - Minor docstring fix, readability improvement
amitz-nv b0f91f2
Update test_llm_multi_gpu_pytorch.py to also disable cuda_graph until…
amitz-nv 10149b7
Fix formatting of lora_test_utils.py
amitz-nv 9e9e02e
Improve test case documentation
amitz-nv f3c330c
Fix docstring of is_adapter_in_cpu_cache
amitz-nv abad5c4
Add 'is_task_cached' method binding to CPP PeftCacheManager class
amitz-nv b6e99e9
Improve comment over not supporting LoRA optimization in TRT-python flow
amitz-nv b9d6c9e
Change cpp_peft_cache_manager argument in LoraManager constructor to …
amitz-nv 6189c47
Fix typo in lora test
amitz-nv e4ff01a
Revert added note in exception message in TRT flow, as the LoRA optim…
amitz-nv 6e0b872
Fix LLM args in multi GPU LoRA tests
amitz-nv 1085670
Improve resource release in test util function run_function_in_sub_pr…
amitz-nv 352f429
Improve formatting - split long import line
amitz-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
from typing import OrderedDict, Type | ||
|
||
from utils.llm_data import llm_models_root | ||
from utils.util import duplicate_list_to_length, flatten_list, similar | ||
|
||
from tensorrt_llm import SamplingParams | ||
from tensorrt_llm.executor.request import LoRARequest | ||
from tensorrt_llm.llmapi.llm import BaseLLM | ||
|
||
|
||
def check_llama_7b_multi_unique_lora_adapters_from_request( | ||
lora_adapter_count_per_call: list[int], repeat_calls: int, | ||
repeats_per_call: int, llm_class: Type[BaseLLM], **llm_kwargs): | ||
"""Calls llm.generate s.t. for each C in lora_adapter_count_per_call, llm.generate is called with C requests | ||
repeated 'repeats_per_call' times, where each request is configured with a unique LoRA adapter ID. | ||
This entire process is done in a loop 'repeats_per_call' times with the same requests. | ||
Asserts the output of each llm.generate call is similar to the expected. | ||
""" # noqa: D205 | ||
amitz-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
total_lora_adapters = sum(lora_adapter_count_per_call) | ||
hf_model_dir = f"{llm_models_root()}/llama-models/llama-7b-hf" | ||
hf_lora_dirs = [ | ||
f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1", | ||
f"{llm_models_root()}/llama-models/Japanese-Alpaca-LoRA-7b-v0" | ||
] | ||
# Each prompt should have a reference for every LoRA adapter dir (in the same order as in hf_lora_dirs) | ||
prompt_to_references = OrderedDict({ | ||
"美国的首都在哪里? \n答案:": [ | ||
"美国的首都是华盛顿。\n\n美国的", | ||
"纽约\n\n### カンファレンスの", | ||
], | ||
"アメリカ合衆国の首都はどこですか? \n答え:": [ | ||
"华盛顿。\n\n英国の首都是什", | ||
"ワシントン\nQ1. アメリカ合衆国", | ||
], | ||
}) | ||
|
||
prompts_to_generate = duplicate_list_to_length( | ||
flatten_list([[prompt] * len(hf_lora_dirs) | ||
for prompt in prompt_to_references.keys()]), | ||
total_lora_adapters) | ||
references = duplicate_list_to_length( | ||
flatten_list(list(prompt_to_references.values())), total_lora_adapters) | ||
lora_requests = [ | ||
LoRARequest(str(i), i, hf_lora_dirs[i % len(hf_lora_dirs)]) | ||
for i in range(total_lora_adapters) | ||
] | ||
amitz-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
llm = llm_class(hf_model_dir, **llm_kwargs) | ||
|
||
# Perform repeats of the same requests to test reuse and reload of adapters previously unloaded from cache | ||
try: | ||
for _ in range(repeat_calls): | ||
last_idx = 0 | ||
for adapter_count in lora_adapter_count_per_call: | ||
sampling_params = SamplingParams(max_tokens=20) | ||
outputs = llm.generate( | ||
prompts_to_generate[last_idx:last_idx + adapter_count] * | ||
repeats_per_call, | ||
sampling_params, | ||
lora_request=lora_requests[last_idx:last_idx + | ||
adapter_count] * | ||
repeats_per_call) | ||
for output, ref in zip( | ||
outputs, references[last_idx:last_idx + adapter_count] * | ||
repeats_per_call): | ||
assert similar(output.outputs[0].text, ref) | ||
last_idx += adapter_count | ||
finally: | ||
llm.shutdown() | ||
|
||
|
||
def check_llama_7b_multi_lora_from_request_test_harness( | ||
llm_class: Type[BaseLLM], **llm_kwargs) -> None: | ||
hf_model_dir = f"{llm_models_root()}/llama-models/llama-7b-hf" | ||
hf_lora_dir1 = f"{llm_models_root()}/llama-models/luotuo-lora-7b-0.1" | ||
hf_lora_dir2 = f"{llm_models_root()}/llama-models/Japanese-Alpaca-LoRA-7b-v0" | ||
prompts = [ | ||
"美国的首都在哪里? \n答案:", | ||
"美国的首都在哪里? \n答案:", | ||
"美国的首都在哪里? \n答案:", | ||
"アメリカ合衆国の首都はどこですか? \n答え:", | ||
"アメリカ合衆国の首都はどこですか? \n答え:", | ||
"アメリカ合衆国の首都はどこですか? \n答え:", | ||
] | ||
references = [ | ||
"沃尔玛\n\n## 新闻\n\n* ", | ||
"美国的首都是华盛顿。\n\n美国的", | ||
"纽约\n\n### カンファレンスの", | ||
"Washington, D.C.\nWashington, D.C. is the capital of the United", | ||
"华盛顿。\n\n英国の首都是什", | ||
"ワシントン\nQ1. アメリカ合衆国", | ||
] | ||
key_words = [ | ||
"沃尔玛", | ||
"华盛顿", | ||
"纽约", | ||
"Washington", | ||
"华盛顿", | ||
"ワシントン", | ||
] | ||
lora_req1 = LoRARequest("luotuo", 1, hf_lora_dir1) | ||
lora_req2 = LoRARequest("Japanese", 2, hf_lora_dir2) | ||
sampling_params = SamplingParams(max_tokens=20) | ||
|
||
llm = llm_class(hf_model_dir, **llm_kwargs) | ||
try: | ||
outputs = llm.generate(prompts, | ||
sampling_params, | ||
lora_request=[ | ||
None, lora_req1, lora_req2, None, lora_req1, | ||
lora_req2 | ||
]) | ||
finally: | ||
llm.shutdown() | ||
for output, ref, key_word in zip(outputs, references, key_words): | ||
assert similar(output.outputs[0].text, | ||
ref) or key_word in output.outputs[0].text | ||
amitz-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.