review comments + docs

suyoggupta · suyoggupta · commit b3955bc16072 · 2025-08-21T15:50:15.000-07:00
diff --git a/docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md b/docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
@@ -0,0 +1,78 @@
+# Serving with trtllm-serve
+
+AutoDeploy integrates with the OpenAI-compatible `trtllm-serve` CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request.
+
+## Quick start
+
+Launch `trtllm-serve` with the AutoDeploy backend by setting `--backend _autodeploy`:
+
+```bash
+trtllm-serve \
+  meta-llama/Llama-3.1-8B-Instruct \
+  --backend _autodeploy \
+```
+
+- `model`: HF name or local path
+- `--backend _autodeploy`: uses AutoDeploy runtime
+
+Once the server is ready, test with an OpenAI-compatible request:
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "meta-llama/Llama-3.1-8B-Instruct",
+    "messages":[{"role": "system", "content": "You are a helpful assistant."},
+                {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
+    "max_tokens": 32
+  }'
+```
+
+## Configuration via YAML
+
+Use `--extra_llm_api_options` to supply a YAML file that augments or overrides server/runtime settings.
+
+```bash
+trtllm-serve \
+  meta-llama/Llama-3.1-8B \
+  --backend _autodeploy \
+  --extra_llm_api_options autodeploy_config.yaml
+```
+
+Example `autodeploy_config.yaml`:
+
+```yaml
+# Compilation backend for AutoDeploy
+compile_backend: torch-opt  # options: torch-simple, torch-compile, torch-cudagraph, torch-opt
+
+# Runtime engine
+runtime: trtllm                # options: trtllm, demollm
+
+# Model loading
+skip_loading_weights: false    # set true for architecture-only perf runs
+
+# KV cache memory
+free_mem_ratio: 0.8            # fraction of free GPU mem for KV cache
+
+# CUDA graph optimization
+cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
+
+# Attention backend
+attn_backend: flashinfer       # recommended for best performance
+```
+
+## Limitations and tips
+
+- KV cache block reuse is disabled automatically for AutoDeploy backend
+- AutoDeploy backend doesn't yet support disaggregated serving. WIP
+- For best performance:
+  - Prefer `compile_backend: torch-opt`
+  - Use `attn_backend: flashinfer`
+  - Set realistic `cuda_graph_batch_sizes` that match expected traffic
+  - Tune `free_mem_ratio` to 0.8–0.9
+
+## See also
+
+- [AutoDeploy overview](../auto-deploy.md)
+- [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md)
+
diff --git a/docs/source/torch/auto_deploy/auto-deploy.md b/docs/source/torch/auto_deploy/auto-deploy.md
@@ -59,6 +59,7 @@ The exported graph then undergoes a series of automated transformations, includi
 - [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
 - [Expert Configurations](./advanced/expert_configurations.md)
 - [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
+- [Serving with trtllm-serve](./advanced/serving_with_trtllm_serve.md)
 
 ## Roadmap
 
diff --git a/tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py b/tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
@@ -198,7 +198,6 @@ def prepare_flashinfer_metadata(
         flashinfer.get_seq_lens(paged_kv_indptr, paged_kv_last_page_len, page_size),
         position_ids.numel(),
     )
-    
     # return metadata
     return (
         qo_indptr,
diff --git a/tensorrt_llm/commands/serve.py b/tensorrt_llm/commands/serve.py
@@ -167,7 +167,7 @@ def launch_server(host: str,
         llm = PyTorchLLM(**llm_args)
     elif backend == '_autodeploy':
         # AutoDeploy does not support build_config
-        del llm_args["build_config"]
+        llm_args.pop("build_config", None)
         # TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142):
         # AutoDeploy does not support cache reuse yet.
         llm_args["kv_cache_config"].enable_block_reuse = False

Original file line number	Diff line number	Diff line change
`@@ -198,7 +198,6 @@ def prepare_flashinfer_metadata(`
`198`	`198`	`flashinfer.get_seq_lens(paged_kv_indptr, paged_kv_last_page_len, page_size),`
`199`	`199`	`position_ids.numel(),`
`200`	`200`	`)`
`201`		`-`
`202`	`201`	`# return metadata`
`203`	`202`	`return (`
`204`	`203`	`qo_indptr,`