|
| 1 | +# LoRA (Low-Rank Adaptation) |
| 2 | + |
| 3 | +LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +1. [Background](#background) |
| 7 | +2. [Basic Usage](#basic-usage) |
| 8 | + - [Single LoRA Adapter](#single-lora-adapter) |
| 9 | + - [Multi-LoRA Support](#multi-lora-support) |
| 10 | +3. [Advanced Usage](#advanced-usage) |
| 11 | + - [LoRA with Quantization](#lora-with-quantization) |
| 12 | + - [NeMo LoRA Format](#nemo-lora-format) |
| 13 | + - [Cache Management](#cache-management) |
| 14 | +4. [TRTLLM serve with LoRA](#trtllm-serve-with-lora) |
| 15 | + - [YAML Configuration](#yaml-configuration) |
| 16 | + - [Starting the Server](#starting-the-server) |
| 17 | + - [Client Usage](#client-usage) |
| 18 | +5. [TRTLLM bench with LORA](#trtllm-bench-with-lora) |
| 19 | + - [YAML Configuration](#yaml-configuration) |
| 20 | + - [Run trtllm-bench](#run-trtllm-bench) |
| 21 | + |
| 22 | +## Background |
| 23 | + |
| 24 | +The PyTorch backend provides LoRA support, allowing you to: |
| 25 | +- Load and apply multiple LoRA adapters simultaneously |
| 26 | +- Switch between different adapters for different requests |
| 27 | +- Use LoRA with quantized models |
| 28 | +- Support both HuggingFace and NeMo LoRA formats |
| 29 | + |
| 30 | +## Basic Usage |
| 31 | + |
| 32 | +### Single LoRA Adapter |
| 33 | + |
| 34 | +```python |
| 35 | +from tensorrt_llm import LLM |
| 36 | +from tensorrt_llm.lora_manager import LoraConfig |
| 37 | +from tensorrt_llm.executor.request import LoRARequest |
| 38 | +from tensorrt_llm.sampling_params import SamplingParams |
| 39 | + |
| 40 | +# Configure LoRA |
| 41 | +lora_config = LoraConfig( |
| 42 | + lora_dir=["/path/to/lora/adapter"], |
| 43 | + max_lora_rank=8, |
| 44 | + max_loras=1, |
| 45 | + max_cpu_loras=1 |
| 46 | +) |
| 47 | + |
| 48 | +# Initialize LLM with LoRA support |
| 49 | +llm = LLM( |
| 50 | + model="/path/to/base/model", |
| 51 | + lora_config=lora_config |
| 52 | +) |
| 53 | + |
| 54 | +# Create LoRA request |
| 55 | +lora_request = LoRARequest("my-lora-task", 0, "/path/to/lora/adapter") |
| 56 | + |
| 57 | +# Generate with LoRA |
| 58 | +prompts = ["Hello, how are you?"] |
| 59 | +sampling_params = SamplingParams(max_tokens=50) |
| 60 | + |
| 61 | +outputs = llm.generate( |
| 62 | + prompts, |
| 63 | + sampling_params, |
| 64 | + lora_request=[lora_request] |
| 65 | +) |
| 66 | +``` |
| 67 | + |
| 68 | +### Multi-LoRA Support |
| 69 | + |
| 70 | +```python |
| 71 | +# Configure for multiple LoRA adapters |
| 72 | +lora_config = LoraConfig( |
| 73 | + lora_target_modules=['attn_q', 'attn_k', 'attn_v'], |
| 74 | + max_lora_rank=8, |
| 75 | + max_loras=4, |
| 76 | + max_cpu_loras=8 |
| 77 | +) |
| 78 | + |
| 79 | +llm = LLM(model="/path/to/base/model", lora_config=lora_config) |
| 80 | + |
| 81 | +# Create multiple LoRA requests |
| 82 | +lora_req1 = LoRARequest("task-1", 0, "/path/to/adapter1") |
| 83 | +lora_req2 = LoRARequest("task-2", 1, "/path/to/adapter2") |
| 84 | + |
| 85 | +prompts = [ |
| 86 | + "Translate to French: Hello world", |
| 87 | + "Summarize: This is a long document..." |
| 88 | +] |
| 89 | + |
| 90 | +# Apply different LoRAs to different prompts |
| 91 | +outputs = llm.generate( |
| 92 | + prompts, |
| 93 | + sampling_params, |
| 94 | + lora_request=[lora_req1, lora_req2] |
| 95 | +) |
| 96 | +``` |
| 97 | + |
| 98 | +## Advanced Usage |
| 99 | + |
| 100 | +### LoRA with Quantization |
| 101 | + |
| 102 | +```python |
| 103 | +from tensorrt_llm.models.modeling_utils import QuantConfig |
| 104 | +from tensorrt_llm.quantization.mode import QuantAlgo |
| 105 | + |
| 106 | +# Configure quantization |
| 107 | +quant_config = QuantConfig( |
| 108 | + quant_algo=QuantAlgo.FP8, |
| 109 | + kv_cache_quant_algo=QuantAlgo.FP8 |
| 110 | +) |
| 111 | + |
| 112 | +# LoRA works with quantized models |
| 113 | +llm = LLM( |
| 114 | + model="/path/to/model", |
| 115 | + quant_config=quant_config, |
| 116 | + lora_config=lora_config |
| 117 | +) |
| 118 | +``` |
| 119 | + |
| 120 | +### NeMo LoRA Format |
| 121 | + |
| 122 | +```python |
| 123 | +# For NeMo-format LoRA checkpoints |
| 124 | +lora_config = LoraConfig( |
| 125 | + lora_dir=["/path/to/nemo/lora"], |
| 126 | + lora_ckpt_source="nemo", |
| 127 | + max_lora_rank=8 |
| 128 | +) |
| 129 | + |
| 130 | +lora_request = LoRARequest( |
| 131 | + "nemo-task", |
| 132 | + 0, |
| 133 | + "/path/to/nemo/lora", |
| 134 | + lora_ckpt_source="nemo" |
| 135 | +) |
| 136 | +``` |
| 137 | + |
| 138 | +### Cache Management |
| 139 | + |
| 140 | +```python |
| 141 | +from tensorrt_llm.llmapi.llm_args import PeftCacheConfig |
| 142 | + |
| 143 | +# Fine-tune cache sizes |
| 144 | +peft_cache_config = PeftCacheConfig( |
| 145 | + host_cache_size=1024*1024*1024, # 1GB CPU cache |
| 146 | + device_cache_percent=0.1 # 10% of GPU memory |
| 147 | +) |
| 148 | + |
| 149 | +llm = LLM( |
| 150 | + model="/path/to/model", |
| 151 | + lora_config=lora_config, |
| 152 | + peft_cache_config=peft_cache_config |
| 153 | +) |
| 154 | +``` |
| 155 | + |
| 156 | +## TRTLLM serve with LoRA |
| 157 | + |
| 158 | +### YAML Configuration |
| 159 | + |
| 160 | +Create an `extra_llm_api_options.yaml` file: |
| 161 | + |
| 162 | +```yaml |
| 163 | +lora_config: |
| 164 | + lora_target_modules: ['attn_q', 'attn_k', 'attn_v'] |
| 165 | + max_lora_rank: 8 |
| 166 | +``` |
| 167 | +
|
| 168 | +### Starting the Server |
| 169 | +
|
| 170 | +```bash |
| 171 | +python -m tensorrt_llm.commands.serve |
| 172 | + /path/to/model \ |
| 173 | + --extra_llm_api_options extra_llm_api_options.yaml |
| 174 | +``` |
| 175 | + |
| 176 | +### Client Usage |
| 177 | + |
| 178 | +```python |
| 179 | +import openai |
| 180 | + |
| 181 | +client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") |
| 182 | + |
| 183 | +response = client.completions.create( |
| 184 | + model="/path/to/model", |
| 185 | + prompt="What is the capital city of France?", |
| 186 | + max_tokens=20, |
| 187 | + extra_body={ |
| 188 | + "lora_request": { |
| 189 | + "lora_name": "lora-example-0", |
| 190 | + "lora_int_id": 0, |
| 191 | + "lora_path": "/path/to/lora_adapter" |
| 192 | + } |
| 193 | + }, |
| 194 | +) |
| 195 | +``` |
| 196 | + |
| 197 | +## TRTLLM bench with LORA |
| 198 | + |
| 199 | +### YAML Configuration |
| 200 | + |
| 201 | +Create an `extra_llm_api_options.yaml` file: |
| 202 | + |
| 203 | +```yaml |
| 204 | +lora_config: |
| 205 | + lora_dir: |
| 206 | + - /workspaces/tensorrt_llm/loras/0 |
| 207 | + max_lora_rank: 64 |
| 208 | + max_loras: 8 |
| 209 | + max_cpu_loras: 8 |
| 210 | + lora_target_modules: |
| 211 | + - attn_q |
| 212 | + - attn_k |
| 213 | + - attn_v |
| 214 | + trtllm_modules_to_hf_modules: |
| 215 | + attn_q: q_proj |
| 216 | + attn_k: k_proj |
| 217 | + attn_v: v_proj |
| 218 | +``` |
| 219 | +
|
| 220 | +### Run trtllm-bench |
| 221 | +
|
| 222 | +```bash |
| 223 | +trtllm-bench --model $model_path throughput --dataset $dataset_path --extra_llm_api_options extra-llm-api-options.yaml --num_requests 64 --concurrency 16 |
| 224 | +``` |
0 commit comments