@@ -35,7 +35,7 @@ There are two ways of serving LLM inference requests:
35
35
36
36
<div align =" center " >
37
37
<figure >
38
- <img src =" .. /media/tech_blog5_Picture1.png" width =" 640 " height =" auto " >
38
+ <img src =" https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture1.png" width =" 640 " height =" auto " >
39
39
</figure >
40
40
</div >
41
41
<p align =" center " ><sub ><em >Figure 1. The execution timeline of aggregated LLM serving</em ></sub ></p >
@@ -44,7 +44,7 @@ In aggregated LLM serving, both the context and generation phases share the same
44
44
45
45
<div align =" center " >
46
46
<figure >
47
- <img src =" .. /media/tech_blog5_Picture2.png" width =" 580 " height =" auto " >
47
+ <img src =" https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture2.png" width =" 580 " height =" auto " >
48
48
</figure >
49
49
</div >
50
50
<p align =" center " ><sub ><em >Figure 2. The execution timeline of dis-aggregated LLM serving</em ></sub ></p >
@@ -65,7 +65,7 @@ The first approach to do disaggregated LLM inference with TensorRT-LLM involves
65
65
66
66
<div align =" center " >
67
67
<figure >
68
- <img src =" .. /media/tech_blog5_Picture3.png" width =" 800 " height =" auto " >
68
+ <img src =" https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture3.png" width =" 800 " height =" auto " >
69
69
</figure >
70
70
</div >
71
71
<p align =" center " ><sub ><em >Figure 3. `trtllm-serve` integration with disaggregated service</em ></sub ></p >
@@ -113,7 +113,7 @@ The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dy
113
113
114
114
<div align="center">
115
115
<figure>
116
- <img src=".. /media/tech_blog5_Picture4.png" width="800" height="auto">
116
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture4.png" width="800" height="auto">
117
117
</figure>
118
118
</div>
119
119
<p align="center"><sub><em>Figure 4. Dynamo integration with disaggregated service</em></sub></p>
@@ -130,7 +130,7 @@ The third approach to do disaggregated LLM inference with TensorRT-LLM utilizes
130
130
131
131
<div align="center">
132
132
<figure>
133
- <img src=".. /media/tech_blog5_Picture5.png" width="800" height="auto">
133
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture5.png" width="800" height="auto">
134
134
</figure>
135
135
</div>
136
136
<p align="center"><sub><em>Figure 5. Triton integration with disaggregated service</em></sub></p>
@@ -143,7 +143,7 @@ In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache
143
143
144
144
<div align="center">
145
145
<figure>
146
- <img src=".. /media/tech_blog5_Picture6.png" width="890" height="auto">
146
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture6.png" width="890" height="auto">
147
147
</figure>
148
148
</div>
149
149
<p align="center"><sub><em>Figure 6. KV cache exchange architecture</em></sub></p>
@@ -154,7 +154,7 @@ To optimize the overall performance of disaggregated serving, TensorRT-LLM overl
154
154
155
155
<div align="center">
156
156
<figure>
157
- <img src=".. /media/tech_blog5_Picture7.png" width="800" height="auto">
157
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture7.png" width="800" height="auto">
158
158
</figure>
159
159
</div>
160
160
<p align="center"><sub><em>Figure 7. KV cache exchange timing diagram</em></sub></p>
@@ -165,7 +165,7 @@ To minimize KV cache transmission latency, TensorRT-LLM currently uses direct tr
165
165
166
166
<div align="center">
167
167
<figure>
168
- <img src=".. /media/tech_blog5_Picture8.png" width="680" height="auto">
168
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture8.png" width="680" height="auto">
169
169
</figure>
170
170
</div>
171
171
<p align="center"><sub><em>Figure 8. KV cache layout conversion</em></sub></p>
@@ -200,7 +200,7 @@ We conducted performance testing on DeepSeek R1 based on datasets with different
200
200
201
201
<div align="center">
202
202
<figure>
203
- <img src=".. /media/tech_blog5_Picture9.png" width="640" height="auto">
203
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture9.png" width="640" height="auto">
204
204
</figure>
205
205
</div>
206
206
<p align="center"><sub><em>Figure 9. “Rate-matched” Pareto curve for DeepSeek R1 without MTP</em></sub></p>
@@ -209,7 +209,7 @@ Figure 9 shows the rate-matched Pareto curve for DeepSeek R1 with MTP off. Confi
209
209
210
210
<div align="center">
211
211
<figure>
212
- <img src=".. /media/tech_blog5_Picture10.png" width="640" height="auto">
212
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture10.png" width="640" height="auto">
213
213
</figure>
214
214
</div>
215
215
<p align="center"><sub><em>Figure 10. DeepSeek R1 with MTP Pareto curve</em></sub></p>
@@ -222,19 +222,19 @@ As shown in Figure 10, enabling MTP increases speedups of disaggregation over ag
222
222
223
223
<div align="center">
224
224
<figure>
225
- <img src=".. /media/tech_blog5_Picture11.png" width="640" height="auto">
225
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture11.png" width="640" height="auto">
226
226
</figure>
227
227
</div>
228
228
<p align="center"><sub><em>Figure 11. DeepSeek R1 4-GPU Pareto curve. ctx/gen=4.5 means SOL rate matching between context and generation phase, which is only used for SOL perf result collection purpose. c4dep4_g1dep4 means 4 DEP4 context instances plus 1 DEP4 generation instance form a full LLM serving instance.</em></sub></p>
229
229
230
230
<div align="center">
231
231
<figure>
232
- <img src=".. /media/tech_blog5_Picture12.png" width="640" height="auto">
232
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture12.png" width="640" height="auto">
233
233
</figure>
234
234
</div>
235
235
<p align="center"><sub><em>Figure 12. DeepSeek R1 8-GPU Pareto curve</em></sub></p>
236
236
237
- Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per context instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments).
237
+ Figures 11 and 12 show the performance curves for the ISL8192-OSL256 dataset on DeepSeek R1 using 4 GPUs per generation instance (GEN4) and 8 GPUs per generation instance (GEN8) respectively. With disaggregation, we plot both “rate-matched” results (based on perfect rate matching between context and generation phases) and E2E results (which can be directly reproduced by users in production deployment environments).
238
238
239
239
The results show that for this ISL/OSL setting, disaggregated serving outperforms aggregated serving significantly—achieving up to **1.73x** speedup with GEN4 and up to **2x** with GEN8.
240
240
@@ -244,14 +244,14 @@ By comparing the disaggregated serving E2E results with the “rate-matched” c
244
244
245
245
<div align="center">
246
246
<figure>
247
- <img src=".. /media/tech_blog5_Picture13.png" width="640" height="auto">
247
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture13.png" width="640" height="auto">
248
248
</figure>
249
249
</div>
250
250
<p align="center"><sub><em>Figure 13. DeepSeek R1 E2E Pareto curves with MTP = 1, 2, 3. In this figure, ctx1dep4-gen2dep4-mtp3 means 1 DEP4 context instance plus 2 DEP4 generation instances with MTP = 3.</em></sub></p>
251
251
252
252
<div align="center">
253
253
<figure>
254
- <img src=".. /media/tech_blog5_Picture14.png" width="640" height="auto">
254
+ <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs /media/tech_blog5_Picture14.png" width="640" height="auto">
255
255
</figure>
256
256
</div>
257
257
<p align="center"><sub><em>Figure 14. DeepSeek R1 E2E Pareto curves without MTP.</em></sub></p>
0 commit comments