From 1063758a3fd821e7b0b5cbb3fe5e1653aa4c2a1e Mon Sep 17 00:00:00 2001 From: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Date: Thu, 19 Jun 2025 10:06:52 +0000 Subject: [PATCH 1/2] doc: subsequent modifications of blog 5 Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> --- README.md | 3 ++ ...5_Disaggregated_Serving_in_TensorRT-LLM.md | 28 +++++++++---------- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index ec50c063612..7d3bfc14d4c 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ TensorRT-LLM
## Tech Blogs +* [06/19] Disaggregated Serving in TensorRT-LLM +✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md) + * [06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) ✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) diff --git a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md index a808327660e..a73be0bf7f3 100644 --- a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md +++ b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md @@ -35,7 +35,7 @@ There are two ways of serving LLM inference requests:
- +

Figure 1. The execution timeline of aggregated LLM serving

@@ -44,7 +44,7 @@ In aggregated LLM serving, both the context and generation phases share the same
- +

Figure 2. The execution timeline of dis-aggregated LLM serving

@@ -65,7 +65,7 @@ The first approach to do disaggregated LLM inference with TensorRT-LLM involves
- +

Figure 3. `trtllm-serve` integration with disaggregated service

@@ -113,7 +113,7 @@ The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dy
- +

Figure 4. Dynamo integration with disaggregated service

@@ -130,7 +130,7 @@ The third approach to do disaggregated LLM inference with TensorRT-LLM utilizes
- +

Figure 5. Triton integration with disaggregated service

@@ -143,7 +143,7 @@ In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache
- +

Figure 6. KV cache exchange architecture

@@ -154,7 +154,7 @@ To optimize the overall performance of disaggregated serving, TensorRT-LLM overl
- +

Figure 7. KV cache exchange timing diagram

@@ -165,7 +165,7 @@ To minimize KV cache transmission latency, TensorRT-LLM currently uses direct tr
- +

Figure 8. KV cache layout conversion

@@ -200,7 +200,7 @@ We conducted performance testing on DeepSeek R1 based on datasets with different
- +

Figure 9. “Rate-matched” Pareto curve for DeepSeek R1 without MTP

@@ -209,7 +209,7 @@ Figure 9 shows the rate-matched Pareto curve for DeepSeek R1 with MTP off. Confi
- +

Figure 10. DeepSeek R1 with MTP Pareto curve

@@ -222,14 +222,14 @@ As shown in Figure 10, enabling MTP increases speedups of disaggregation over ag
- +

Figure 11. DeepSeek R1 4-GPU Pareto curve. ctx/gen=4.5 means SOL rate matching between context and generation phase, which is only used for SOL perf result collection purpose. c4dep4_g1dep4 means 4 DEP4 context instances plus 1 DEP4 generation instance form a full LLM serving instance.

- +

Figure 12. DeepSeek R1 8-GPU Pareto curve

@@ -244,14 +244,14 @@ By comparing the disaggregated serving E2E results with the “rate-matched” c
- +

Figure 13. DeepSeek R1 E2E Pareto curves with MTP = 1, 2, 3. In this figure, ctx1dep4-gen2dep4-mtp3 means 1 DEP4 context instance plus 2 DEP4 generation instances with MTP = 3.

- +

Figure 14. DeepSeek R1 E2E Pareto curves without MTP.

From ff5ef60966323cc5f9bbc8d46936c7309fe283f9 Mon Sep 17 00:00:00 2001 From: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Date: Thu, 19 Jun 2025 10:11:19 +0000 Subject: [PATCH 2/2] fix an error Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> --- .../tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md index a73be0bf7f3..decf503d5c8 100644 --- a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md +++ b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md @@ -234,7 +234,7 @@ As shown in Figure 10, enabling MTP increases speedups of disaggregation over ag

Figure 12. DeepSeek R1 8-GPU Pareto curve

-Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per context instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments). +Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per generation instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments). The results show that for this ISL/OSL setting, disaggregated serving outperforms aggregated serving significantly—achieving up to **1.73x** speedup with GEN4 and up to **2x** with GEN8.