Skip to content

Commit b3955bc

Browse files
committed
review comments + docs
1 parent 4f91d4a commit b3955bc

File tree

4 files changed

+80
-2
lines changed

4 files changed

+80
-2
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Serving with trtllm-serve
2+
3+
AutoDeploy integrates with the OpenAI-compatible `trtllm-serve` CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request.
4+
5+
## Quick start
6+
7+
Launch `trtllm-serve` with the AutoDeploy backend by setting `--backend _autodeploy`:
8+
9+
```bash
10+
trtllm-serve \
11+
meta-llama/Llama-3.1-8B-Instruct \
12+
--backend _autodeploy \
13+
```
14+
15+
- `model`: HF name or local path
16+
- `--backend _autodeploy`: uses AutoDeploy runtime
17+
18+
Once the server is ready, test with an OpenAI-compatible request:
19+
20+
```bash
21+
curl -s http://localhost:8000/v1/chat/completions \
22+
-H 'Content-Type: application/json' \
23+
-d '{
24+
"model": "meta-llama/Llama-3.1-8B-Instruct",
25+
"messages":[{"role": "system", "content": "You are a helpful assistant."},
26+
{"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
27+
"max_tokens": 32
28+
}'
29+
```
30+
31+
## Configuration via YAML
32+
33+
Use `--extra_llm_api_options` to supply a YAML file that augments or overrides server/runtime settings.
34+
35+
```bash
36+
trtllm-serve \
37+
meta-llama/Llama-3.1-8B \
38+
--backend _autodeploy \
39+
--extra_llm_api_options autodeploy_config.yaml
40+
```
41+
42+
Example `autodeploy_config.yaml`:
43+
44+
```yaml
45+
# Compilation backend for AutoDeploy
46+
compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt
47+
48+
# Runtime engine
49+
runtime: trtllm # options: trtllm, demollm
50+
51+
# Model loading
52+
skip_loading_weights: false # set true for architecture-only perf runs
53+
54+
# KV cache memory
55+
free_mem_ratio: 0.8 # fraction of free GPU mem for KV cache
56+
57+
# CUDA graph optimization
58+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
59+
60+
# Attention backend
61+
attn_backend: flashinfer # recommended for best performance
62+
```
63+
64+
## Limitations and tips
65+
66+
- KV cache block reuse is disabled automatically for AutoDeploy backend
67+
- AutoDeploy backend doesn't yet support disaggregated serving. WIP
68+
- For best performance:
69+
- Prefer `compile_backend: torch-opt`
70+
- Use `attn_backend: flashinfer`
71+
- Set realistic `cuda_graph_batch_sizes` that match expected traffic
72+
- Tune `free_mem_ratio` to 0.8–0.9
73+
74+
## See also
75+
76+
- [AutoDeploy overview](../auto-deploy.md)
77+
- [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md)
78+

docs/source/torch/auto_deploy/auto-deploy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ The exported graph then undergoes a series of automated transformations, includi
5959
- [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
6060
- [Expert Configurations](./advanced/expert_configurations.md)
6161
- [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
62+
- [Serving with trtllm-serve](./advanced/serving_with_trtllm_serve.md)
6263

6364
## Roadmap
6465

tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,6 @@ def prepare_flashinfer_metadata(
198198
flashinfer.get_seq_lens(paged_kv_indptr, paged_kv_last_page_len, page_size),
199199
position_ids.numel(),
200200
)
201-
202201
# return metadata
203202
return (
204203
qo_indptr,

tensorrt_llm/commands/serve.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ def launch_server(host: str,
167167
llm = PyTorchLLM(**llm_args)
168168
elif backend == '_autodeploy':
169169
# AutoDeploy does not support build_config
170-
del llm_args["build_config"]
170+
llm_args.pop("build_config", None)
171171
# TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142):
172172
# AutoDeploy does not support cache reuse yet.
173173
llm_args["kv_cache_config"].enable_block_reuse = False

0 commit comments

Comments
 (0)