|
| 1 | +# Serving with trtllm-serve |
| 2 | + |
| 3 | +AutoDeploy integrates with the OpenAI-compatible `trtllm-serve` CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request. |
| 4 | + |
| 5 | +## Quick start |
| 6 | + |
| 7 | +Launch `trtllm-serve` with the AutoDeploy backend by setting `--backend _autodeploy`: |
| 8 | + |
| 9 | +```bash |
| 10 | +trtllm-serve \ |
| 11 | + meta-llama/Llama-3.1-8B-Instruct \ |
| 12 | + --backend _autodeploy \ |
| 13 | +``` |
| 14 | + |
| 15 | +- `model`: HF name or local path |
| 16 | +- `--backend _autodeploy`: uses AutoDeploy runtime |
| 17 | + |
| 18 | +Once the server is ready, test with an OpenAI-compatible request: |
| 19 | + |
| 20 | +```bash |
| 21 | +curl -s http://localhost:8000/v1/chat/completions \ |
| 22 | + -H 'Content-Type: application/json' \ |
| 23 | + -d '{ |
| 24 | + "model": "meta-llama/Llama-3.1-8B-Instruct", |
| 25 | + "messages":[{"role": "system", "content": "You are a helpful assistant."}, |
| 26 | + {"role": "user", "content": "Where is New York? Tell me in a single sentence."}], |
| 27 | + "max_tokens": 32 |
| 28 | + }' |
| 29 | +``` |
| 30 | + |
| 31 | +## Configuration via YAML |
| 32 | + |
| 33 | +Use `--extra_llm_api_options` to supply a YAML file that augments or overrides server/runtime settings. |
| 34 | + |
| 35 | +```bash |
| 36 | +trtllm-serve \ |
| 37 | + meta-llama/Llama-3.1-8B \ |
| 38 | + --backend _autodeploy \ |
| 39 | + --extra_llm_api_options autodeploy_config.yaml |
| 40 | +``` |
| 41 | + |
| 42 | +Example `autodeploy_config.yaml`: |
| 43 | + |
| 44 | +```yaml |
| 45 | +# Compilation backend for AutoDeploy |
| 46 | +compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt |
| 47 | + |
| 48 | +# Runtime engine |
| 49 | +runtime: trtllm # options: trtllm, demollm |
| 50 | + |
| 51 | +# Model loading |
| 52 | +skip_loading_weights: false # set true for architecture-only perf runs |
| 53 | + |
| 54 | +# KV cache memory |
| 55 | +free_mem_ratio: 0.8 # fraction of free GPU mem for KV cache |
| 56 | + |
| 57 | +# CUDA graph optimization |
| 58 | +cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64] |
| 59 | + |
| 60 | +# Attention backend |
| 61 | +attn_backend: flashinfer # recommended for best performance |
| 62 | +``` |
| 63 | +
|
| 64 | +## Limitations and tips |
| 65 | +
|
| 66 | +- KV cache block reuse is disabled automatically for AutoDeploy backend |
| 67 | +- AutoDeploy backend doesn't yet support disaggregated serving. WIP |
| 68 | +- For best performance: |
| 69 | + - Prefer `compile_backend: torch-opt` |
| 70 | + - Use `attn_backend: flashinfer` |
| 71 | + - Set realistic `cuda_graph_batch_sizes` that match expected traffic |
| 72 | + - Tune `free_mem_ratio` to 0.8–0.9 |
| 73 | + |
| 74 | +## See also |
| 75 | + |
| 76 | +- [AutoDeploy overview](../auto-deploy.md) |
| 77 | +- [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md) |
| 78 | + |
0 commit comments