Skip to content

Commit 1e35be5

Browse files
authored
doc: subsequent modifications of blog 5 (#5366)
Signed-off-by: Shixiaowei02 <[email protected]>
1 parent c7af650 commit 1e35be5

File tree

2 files changed

+18
-15
lines changed

2 files changed

+18
-15
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ TensorRT-LLM
1818
<div align="left">
1919

2020
## Tech Blogs
21+
* [06/19] Disaggregated Serving in TensorRT-LLM
22+
[➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)
23+
2124
* [06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)
2225
[➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
2326

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ There are two ways of serving LLM inference requests:
3535

3636
<div align="center">
3737
<figure>
38-
<img src="../media/tech_blog5_Picture1.png" width="640" height="auto">
38+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture1.png" width="640" height="auto">
3939
</figure>
4040
</div>
4141
<p align="center"><sub><em>Figure 1. The execution timeline of aggregated LLM serving</em></sub></p>
@@ -44,7 +44,7 @@ In aggregated LLM serving, both the context and generation phases share the same
4444

4545
<div align="center">
4646
<figure>
47-
<img src="../media/tech_blog5_Picture2.png" width="580" height="auto">
47+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture2.png" width="580" height="auto">
4848
</figure>
4949
</div>
5050
<p align="center"><sub><em>Figure 2. The execution timeline of dis-aggregated LLM serving</em></sub></p>
@@ -65,7 +65,7 @@ The first approach to do disaggregated LLM inference with TensorRT-LLM involves
6565

6666
<div align="center">
6767
<figure>
68-
<img src="../media/tech_blog5_Picture3.png" width="800" height="auto">
68+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture3.png" width="800" height="auto">
6969
</figure>
7070
</div>
7171
<p align="center"><sub><em>Figure 3. `trtllm-serve` integration with disaggregated service</em></sub></p>
@@ -113,7 +113,7 @@ The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dy
113113
114114
<div align="center">
115115
<figure>
116-
<img src="../media/tech_blog5_Picture4.png" width="800" height="auto">
116+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto">
117117
</figure>
118118
</div>
119119
<p align="center"><sub><em>Figure 4. Dynamo integration with disaggregated service</em></sub></p>
@@ -130,7 +130,7 @@ The third approach to do disaggregated LLM inference with TensorRT-LLM utilizes
130130
131131
<div align="center">
132132
<figure>
133-
<img src="../media/tech_blog5_Picture5.png" width="800" height="auto">
133+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture5.png" width="800" height="auto">
134134
</figure>
135135
</div>
136136
<p align="center"><sub><em>Figure 5. Triton integration with disaggregated service</em></sub></p>
@@ -143,7 +143,7 @@ In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache
143143
144144
<div align="center">
145145
<figure>
146-
<img src="../media/tech_blog5_Picture6.png" width="890" height="auto">
146+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture6.png" width="890" height="auto">
147147
</figure>
148148
</div>
149149
<p align="center"><sub><em>Figure 6. KV cache exchange architecture</em></sub></p>
@@ -154,7 +154,7 @@ To optimize the overall performance of disaggregated serving, TensorRT-LLM overl
154154
155155
<div align="center">
156156
<figure>
157-
<img src="../media/tech_blog5_Picture7.png" width="800" height="auto">
157+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture7.png" width="800" height="auto">
158158
</figure>
159159
</div>
160160
<p align="center"><sub><em>Figure 7. KV cache exchange timing diagram</em></sub></p>
@@ -165,7 +165,7 @@ To minimize KV cache transmission latency, TensorRT-LLM currently uses direct tr
165165
166166
<div align="center">
167167
<figure>
168-
<img src="../media/tech_blog5_Picture8.png" width="680" height="auto">
168+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture8.png" width="680" height="auto">
169169
</figure>
170170
</div>
171171
<p align="center"><sub><em>Figure 8. KV cache layout conversion</em></sub></p>
@@ -200,7 +200,7 @@ We conducted performance testing on DeepSeek R1 based on datasets with different
200200

201201
<div align="center">
202202
<figure>
203-
<img src="../media/tech_blog5_Picture9.png" width="640" height="auto">
203+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture9.png" width="640" height="auto">
204204
</figure>
205205
</div>
206206
<p align="center"><sub><em>Figure 9. “Rate-matched” Pareto curve for DeepSeek R1 without MTP</em></sub></p>
@@ -209,7 +209,7 @@ Figure 9 shows the rate-matched Pareto curve for DeepSeek R1 with MTP off. Confi
209209

210210
<div align="center">
211211
<figure>
212-
<img src="../media/tech_blog5_Picture10.png" width="640" height="auto">
212+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture10.png" width="640" height="auto">
213213
</figure>
214214
</div>
215215
<p align="center"><sub><em>Figure 10. DeepSeek R1 with MTP Pareto curve</em></sub></p>
@@ -222,19 +222,19 @@ As shown in Figure 10, enabling MTP increases speedups of disaggregation over ag
222222

223223
<div align="center">
224224
<figure>
225-
<img src="../media/tech_blog5_Picture11.png" width="640" height="auto">
225+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture11.png" width="640" height="auto">
226226
</figure>
227227
</div>
228228
<p align="center"><sub><em>Figure 11. DeepSeek R1 4-GPU Pareto curve. ctx/gen=4.5 means SOL rate matching between context and generation phase, which is only used for SOL perf result collection purpose. c4dep4_g1dep4 means 4 DEP4 context instances plus 1 DEP4 generation instance form a full LLM serving instance.</em></sub></p>
229229

230230
<div align="center">
231231
<figure>
232-
<img src="../media/tech_blog5_Picture12.png" width="640" height="auto">
232+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture12.png" width="640" height="auto">
233233
</figure>
234234
</div>
235235
<p align="center"><sub><em>Figure 12. DeepSeek R1 8-GPU Pareto curve</em></sub></p>
236236

237-
Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per context instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments).
237+
Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per generation instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments).
238238

239239
The results show that for this ISL/OSL setting, disaggregated serving outperforms aggregated serving significantly—achieving up to **1.73x** speedup with GEN4 and up to **2x** with GEN8.
240240

@@ -244,14 +244,14 @@ By comparing the disaggregated serving E2E results with the “rate-matched” c
244244

245245
<div align="center">
246246
<figure>
247-
<img src="../media/tech_blog5_Picture13.png" width="640" height="auto">
247+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture13.png" width="640" height="auto">
248248
</figure>
249249
</div>
250250
<p align="center"><sub><em>Figure 13. DeepSeek R1 E2E Pareto curves with MTP = 1, 2, 3. In this figure, ctx1dep4-gen2dep4-mtp3 means 1 DEP4 context instance plus 2 DEP4 generation instances with MTP = 3.</em></sub></p>
251251

252252
<div align="center">
253253
<figure>
254-
<img src="../media/tech_blog5_Picture14.png" width="640" height="auto">
254+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture14.png" width="640" height="auto">
255255
</figure>
256256
</div>
257257
<p align="center"><sub><em>Figure 14. DeepSeek R1 E2E Pareto curves without MTP.</em></sub></p>

0 commit comments

Comments
 (0)