Skip to content

Commit 7409a89

Browse files
committed
add lora feature usage doc
Signed-off-by: Shahar Mor <[email protected]>
1 parent a54972e commit 7409a89

File tree

1 file changed

+224
-0
lines changed

1 file changed

+224
-0
lines changed

docs/source/torch/features/lora.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# LoRA (Low-Rank Adaptation)
2+
3+
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.
4+
5+
## Table of Contents
6+
1. [Background](#background)
7+
2. [Basic Usage](#basic-usage)
8+
- [Single LoRA Adapter](#single-lora-adapter)
9+
- [Multi-LoRA Support](#multi-lora-support)
10+
3. [Advanced Usage](#advanced-usage)
11+
- [LoRA with Quantization](#lora-with-quantization)
12+
- [NeMo LoRA Format](#nemo-lora-format)
13+
- [Cache Management](#cache-management)
14+
4. [TRTLLM serve with LoRA](#trtllm-serve-with-lora)
15+
- [YAML Configuration](#yaml-configuration)
16+
- [Starting the Server](#starting-the-server)
17+
- [Client Usage](#client-usage)
18+
5. [TRTLLM bench with LORA](#trtllm-bench-with-lora)
19+
- [YAML Configuration](#yaml-configuration)
20+
- [Run trtllm-bench](#run-trtllm-bench)
21+
22+
## Background
23+
24+
The PyTorch backend provides LoRA support, allowing you to:
25+
- Load and apply multiple LoRA adapters simultaneously
26+
- Switch between different adapters for different requests
27+
- Use LoRA with quantized models
28+
- Support both HuggingFace and NeMo LoRA formats
29+
30+
## Basic Usage
31+
32+
### Single LoRA Adapter
33+
34+
```python
35+
from tensorrt_llm import LLM
36+
from tensorrt_llm.lora_manager import LoraConfig
37+
from tensorrt_llm.executor.request import LoRARequest
38+
from tensorrt_llm.sampling_params import SamplingParams
39+
40+
# Configure LoRA
41+
lora_config = LoraConfig(
42+
lora_dir=["/path/to/lora/adapter"],
43+
max_lora_rank=8,
44+
max_loras=1,
45+
max_cpu_loras=1
46+
)
47+
48+
# Initialize LLM with LoRA support
49+
llm = LLM(
50+
model="/path/to/base/model",
51+
lora_config=lora_config
52+
)
53+
54+
# Create LoRA request
55+
lora_request = LoRARequest("my-lora-task", 0, "/path/to/lora/adapter")
56+
57+
# Generate with LoRA
58+
prompts = ["Hello, how are you?"]
59+
sampling_params = SamplingParams(max_tokens=50)
60+
61+
outputs = llm.generate(
62+
prompts,
63+
sampling_params,
64+
lora_request=[lora_request]
65+
)
66+
```
67+
68+
### Multi-LoRA Support
69+
70+
```python
71+
# Configure for multiple LoRA adapters
72+
lora_config = LoraConfig(
73+
lora_target_modules=['attn_q', 'attn_k', 'attn_v'],
74+
max_lora_rank=8,
75+
max_loras=4,
76+
max_cpu_loras=8
77+
)
78+
79+
llm = LLM(model="/path/to/base/model", lora_config=lora_config)
80+
81+
# Create multiple LoRA requests
82+
lora_req1 = LoRARequest("task-1", 0, "/path/to/adapter1")
83+
lora_req2 = LoRARequest("task-2", 1, "/path/to/adapter2")
84+
85+
prompts = [
86+
"Translate to French: Hello world",
87+
"Summarize: This is a long document..."
88+
]
89+
90+
# Apply different LoRAs to different prompts
91+
outputs = llm.generate(
92+
prompts,
93+
sampling_params,
94+
lora_request=[lora_req1, lora_req2]
95+
)
96+
```
97+
98+
## Advanced Usage
99+
100+
### LoRA with Quantization
101+
102+
```python
103+
from tensorrt_llm.models.modeling_utils import QuantConfig
104+
from tensorrt_llm.quantization.mode import QuantAlgo
105+
106+
# Configure quantization
107+
quant_config = QuantConfig(
108+
quant_algo=QuantAlgo.FP8,
109+
kv_cache_quant_algo=QuantAlgo.FP8
110+
)
111+
112+
# LoRA works with quantized models
113+
llm = LLM(
114+
model="/path/to/model",
115+
quant_config=quant_config,
116+
lora_config=lora_config
117+
)
118+
```
119+
120+
### NeMo LoRA Format
121+
122+
```python
123+
# For NeMo-format LoRA checkpoints
124+
lora_config = LoraConfig(
125+
lora_dir=["/path/to/nemo/lora"],
126+
lora_ckpt_source="nemo",
127+
max_lora_rank=8
128+
)
129+
130+
lora_request = LoRARequest(
131+
"nemo-task",
132+
0,
133+
"/path/to/nemo/lora",
134+
lora_ckpt_source="nemo"
135+
)
136+
```
137+
138+
### Cache Management
139+
140+
```python
141+
from tensorrt_llm.llmapi.llm_args import PeftCacheConfig
142+
143+
# Fine-tune cache sizes
144+
peft_cache_config = PeftCacheConfig(
145+
host_cache_size=1024*1024*1024, # 1GB CPU cache
146+
device_cache_percent=0.1 # 10% of GPU memory
147+
)
148+
149+
llm = LLM(
150+
model="/path/to/model",
151+
lora_config=lora_config,
152+
peft_cache_config=peft_cache_config
153+
)
154+
```
155+
156+
## TRTLLM serve with LoRA
157+
158+
### YAML Configuration
159+
160+
Create an `extra_llm_api_options.yaml` file:
161+
162+
```yaml
163+
lora_config:
164+
lora_target_modules: ['attn_q', 'attn_k', 'attn_v']
165+
max_lora_rank: 8
166+
```
167+
168+
### Starting the Server
169+
170+
```bash
171+
python -m tensorrt_llm.commands.serve
172+
/path/to/model \
173+
--extra_llm_api_options extra_llm_api_options.yaml
174+
```
175+
176+
### Client Usage
177+
178+
```python
179+
import openai
180+
181+
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
182+
183+
response = client.completions.create(
184+
model="/path/to/model",
185+
prompt="What is the capital city of France?",
186+
max_tokens=20,
187+
extra_body={
188+
"lora_request": {
189+
"lora_name": "lora-example-0",
190+
"lora_int_id": 0,
191+
"lora_path": "/path/to/lora_adapter"
192+
}
193+
},
194+
)
195+
```
196+
197+
## TRTLLM bench with LORA
198+
199+
### YAML Configuration
200+
201+
Create an `extra_llm_api_options.yaml` file:
202+
203+
```yaml
204+
lora_config:
205+
lora_dir:
206+
- /workspaces/tensorrt_llm/loras/0
207+
max_lora_rank: 64
208+
max_loras: 8
209+
max_cpu_loras: 8
210+
lora_target_modules:
211+
- attn_q
212+
- attn_k
213+
- attn_v
214+
trtllm_modules_to_hf_modules:
215+
attn_q: q_proj
216+
attn_k: k_proj
217+
attn_v: v_proj
218+
```
219+
220+
### Run trtllm-bench
221+
222+
```bash
223+
trtllm-bench --model $model_path throughput --dataset $dataset_path --extra_llm_api_options extra-llm-api-options.yaml --num_requests 64 --concurrency 16
224+
```

0 commit comments

Comments
 (0)