NVIDIA · dc3671 · Jul 22, 2025 · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025
@@ -32,6 +32,7 @@
    ```bash
    pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
    ```
+   **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**
 
 2. Sanity check the installation by running the following in Python (tested on Python 3.12):
 

@@ -28,101 +28,119 @@ nvidia/Llama-3.1-405B-Instruct-FP4
 ```
 
 #### Llama 3.3 70B FP4
+
 |                         | GPU     | B200      |           |           |           |
-|:-----------------------------|:---|:----------|:----------|:----------|:----------|
-|         | TP Size    | 1         | 2         | 4         | 8         |
-| ISL, OSL|    |           |           |           |           |
-|                              |    |           |           |           |           |
-| 128, 128                     |    | 11,253.28 | 17,867.66 | 24,944.50 | 27,471.49 |
-| 128, 2048                    |    | 9,925.00  | 15,459.71 | 23,608.58 | 30,742.86 |
-| 128, 4096                    |    | 6,318.92  | 8,711.88  | 17,659.74 | 24,947.05 |
-| 500, 2000                    |    | 7,559.88  | 10,602.27 | 20,910.23 | 28,182.34 |
-| 1000, 1000                   |    | 6,866.96  | 10,838.01 | 16,567.86 | 19,991.64 |
-| 1000, 2000                   |    | 6,736.88  | 9,132.08  | 15,737.02 | 20,518.04 |
-| 1024, 2048                   |    | 6,580.56  | 8,767.45  | 15,722.55 | 20,437.96 |
-| 2048, 128                    |    | 1,375.49  | 1,610.69  | 2,707.58  | 3,717.82  |
-| 2048, 2048                   |    | 4,544.73  | 6,956.14  | 12,292.23 | 15,661.22 |
-| 5000, 500                    |    | 1,488.19  | 2,379.73  | 3,588.45  | 4,810.21  |
-| 20000, 2000                  |    | 580.96    | 1,043.58  | 1,957.84  | 3,167.30  |
+|:------------------------|:--------|:----------|:----------|:----------|:----------|
+|                         | TP Size | 1         | 2         | 4         | 8         |
+| ISL, OSL                |         |           |           |           |           |
+|                         |         |           |           |           |           |
+| 128, 128                |         | 10,994.48 | 17,542.11 | 24,667.31 | 27,272.27 |
+| 128, 2048               |         | 9,580.46  | 15,432.35 | 23,568.12 | 31,174.31 |
+| 128, 4096               |         | 6,418.39  | 9,841.53  | 17,808.76 | 25,229.25 |
+| 500, 2000               |         | 7,343.32  | 11,850.57 | 20,709.67 | 28,038.78 |
+| 1000, 1000              |         | 6,752.53  | 10,815.88 | 16,413.04 | 20,060.66 |
+| 1000, 2000              |         | 6,670.07  | 9,830.73  | 15,597.49 | 20,672.37 |
+| 1024, 2048              |         | 6,636.75  | 9,807.13  | 15,519.23 | 20,617.28 |
+| 2048, 128               |         | 1,342.17  | 1,989.41  | 3,033.14  | 4,035.64  |
+| 5000, 500               |         | 1,429.67  | 2,419.67  | 3,686.84  | 5,182.96  |
+| 20000, 2000             |         | 629.77    | 1,177.01  | 2,120.66  | 3,429.03  |
 
 #### Llama 3.1 405B FP4
-|                          | GPU    | B200      |
-|:-----------------------------|:---|:----------|
-|          | TP Size   | 8         |
-| ISL, OSL|    |           |
-|                              |    |           |
-| 128, 128                     |    | 9,184.83  |
-| 128, 2048                    |    | 10,387.23 |
-| 128, 4096                    |    | 8,741.80  |
-| 500, 2000                    |    | 9,242.34  |
-| 1000, 1000                   |    | 7,565.50  |
-| 1000, 2000                   |    | 7,696.76  |
-| 1024, 2048                   |    | 7,568.93  |
-| 2048, 128                    |    | 953.57    |
-| 2048, 2048                   |    | 6,092.32  |
-| 5000, 500                    |    | 1,332.22  |
-| 20000, 2000                  |    | 961.58    |
+
+|                         | GPU     | B200     |           |
+|:------------------------|:------- |:---------|:----------|
+|                         | TP Size | 4        | 8         |
+| ISL, OSL                |         |          |           |
+|                         |         |          |           |
+| 128, 128                |         | 6,163.81 | 9,002.90  |
+| 128, 2048               |         | 7,081.21 | 10,288.28 |
+| 128, 4096               |         | 6,028.37 | 8,713.77  |
+| 500, 2000               |         | 5,858.75 | 9,125.86  |
+| 1000, 1000              |         | 4,848.00 | 7,582.97  |
+| 1000, 2000              |         | 5,375.25 | 7,626.28  |
+| 1024, 2048              |         | 5,345.70 | 7,464.03  |
+| 2048, 128               |         | 693.55   | 1,086.56  |
+| 5000, 500               |         | 947.49   | 1,532.45  |
+| 20000, 2000             |         | 641.11   | 1,097.84  |
 
 ### FP8 Models:
 ```
 nvidia/Llama-3.1-8B-Instruct-FP8
-nvidia/Llama-3.1-70B-Instruct-FP8
+nvidia/Llama-3.3-70B-Instruct-FP8
 nvidia/Llama-3.1-405B-Instruct-FP8
+nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
 ```
 
 #### Llama 3.1 8B FP8
-|                          | GPU    | H200 141GB HBM3   | H100 80GB HBM3   |
+
+|                         | GPU     | H200 141GB HBM3   | H100 80GB HBM3   |
 |:-----------------------------|:---|:------------------|:-----------------|
-|          | TP Size   | 1                 | 1                |
+|    | TP Size   | 1              | 1             |
 | ISL, OSL |    |                   |                  |
 |                              |    |                   |                  |
-| 128, 128                     |    | 28,447.38         | 27,568.68        |
-| 128, 2048                    |    | 23,294.74         | 22,003.62        |
-| 128, 4096                    |    | 17,481.48         | 13,640.35        |
-| 500, 2000                    |    | 21,462.57         | 17,794.39        |
-| 1000, 1000                   |    | 17,590.60         | 15,270.02        |
-| 1000, 2000                   |    | 17,139.51         | 13,850.22        |
-| 1024, 2048                   |    | 16,970.63         | 13,374.15        |
-| 2048, 128                    |    | 3,531.33          | 3,495.05         |
-| 2048, 2048                   |    | 12,022.38         | 9,653.67         |
-| 5000, 500                    |    | 3,851.65          | 3,371.16         |
-| 20000, 2000                  |    | 1,706.06          | 1,340.92         |
-
-#### Llama 3.1 70B FP8
-|                          | GPU   | H200 141GB HBM3   |          |           |           | H100 80GB HBM3   |          |           |           |
+| 128, 128                     |    | 27,970.14         | 27,688.36        |
+| 128, 2048                    |    | 23,326.38         | 21,841.15        |
+| 128, 4096                    |    | 17,508.51         | 13,730.89        |
+| 500, 2000                    |    | 21,390.41         | 17,833.34        |
+| 1000, 1000                   |    | 17,366.89         | 15,270.62        |
+| 1000, 2000                   |    | 16,831.31         | 13,798.08        |
+| 1024, 2048                   |    | 16,737.03         | 13,385.50        |
+| 2048, 128                    |    | 3,488.03          | 3,414.67         |
+| 5000, 500                    |    | 3,813.69          | 3,394.54         |
+| 20000, 2000                  |    | 1,696.66          | 1,345.42         |
+
+#### Llama 3.3 70B FP8
+
+|                          | GPU    | H200 141GB HBM3   |          |           |           | H100 80GB HBM3   |          |           |           |
 |:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
-|    | TP Size   | 1                 | 2        | 4         | 8         | 1                | 2        | 4         | 8         |
-| ISL, OSL|    |                   |          |           |           |                  |          |           |           |
+|    | TP Size   | 1              | 2     | 4      | 8      | 1            | 2     | 4      | 8      |
+| ISL, OSL |    |                   |          |           |           |                  |          |           |           |
 |                              |    |                   |          |           |           |                  |          |           |           |
-| 128, 128                     |    | 3,657.58          | 6,477.50 | 10,466.04 | 15,554.57 | 3,191.27         | 6,183.41 | 10,260.68 | 14,686.01 |
-| 128, 2048                    |    | 4,351.07          | 8,450.31 | 13,438.71 | 20,750.58 | 745.19           | 5,822.02 | 11,442.01 | 17,463.99 |
-| 128, 4096                    |    | 2,696.61          | 5,598.92 | 11,524.93 | 16,634.90 |                  | 3,714.87 | 8,209.91  | 12,598.55 |
-| 500, 2000                    |    | 3,475.58          | 6,712.35 | 12,332.32 | 17,311.28 |                  | 4,704.31 | 10,278.02 | 14,630.41 |
-| 1000, 1000                   |    | 2,727.42          | 5,097.36 | 8,698.15  | 12,794.92 | 734.67           | 4,191.26 | 7,427.35  | 11,082.48 |
-| 1000, 2000                   |    | 2,913.54          | 5,841.15 | 9,016.49  | 13,174.68 | 526.31           | 3,920.44 | 7,590.35  | 11,108.11 |
-| 1024, 2048                   |    | 2,893.02          | 5,565.28 | 9,017.72  | 13,117.34 | 525.43           | 3,896.14 | 7,557.32  | 11,028.32 |
-| 2048, 128                    |    | 433.30            | 772.97   | 1,278.26  | 1,947.33  | 315.90           | 747.51   | 1,240.12  | 1,840.12  |
-| 2048, 2048                   |    | 1,990.25          | 3,822.83 | 7,068.68  | 10,529.06 | 357.98           | 2,732.86 | 5,640.31  | 8,772.88  |
-| 5000, 500                    |    | 543.88            | 1,005.81 | 1,714.77  | 2,683.22  | 203.27           | 866.77   | 1,571.92  | 2,399.78  |
-| 20000, 2000                  |    | 276.99            | 618.01   | 1,175.35  | 2,021.08  |                  | 408.43   | 910.77    | 1,568.84  |
+| 128, 128                     |    | 3,605.47          | 6,427.69 | 10,407.42 | 15,434.37 | 3,128.33         | 6,216.91 |           |           |
+| 128, 2048                    |    | 4,315.80          | 8,464.03 | 13,508.59 | 20,759.72 | 756.42           | 5,782.57 | 11,464.94 | 17,424.32 |
+| 128, 4096                    |    | 2,701.17          | 5,573.55 | 11,458.56 | 16,668.75 |                  | 3,868.37 | 8,206.39  | 12,624.61 |
+| 500, 2000                    |    | 3,478.76          | 6,740.06 | 12,200.18 |           |                  | 4,684.06 | 9,903.53  | 14,553.93 |
+| 1000, 1000                   |    | 2,744.32          | 5,119.72 | 8,685.44  | 12,744.51 | 742.14           | 4,247.19 | 7,435.65  | 11,018.81 |
+| 1000, 2000                   |    | 2,896.44          | 5,847.26 | 9,031.21  | 13,141.17 | 533.74           | 3,866.53 | 7,611.12  | 11,139.22 |
+| 1024, 2048                   |    | 2,874.18          | 5,568.61 | 8,946.71  | 13,082.62 | 530.16           | 3,796.68 | 7,575.24  | 11,004.31 |
+| 2048, 128                    |    | 435.90            | 772.67   | 1,264.76  |           |                  | 736.89   | 1,213.33  | 1,839.22  |
+| 2048, 2048                   |    |                   |          |           | 10,412.85 |                  |          |           |           |
+| 5000, 500                    |    | 545.96            | 997.15   | 1,698.22  | 2,655.28  | 204.94           | 862.91   | 1,552.68  | 2,369.84  |
+| 20000, 2000                  |    | 276.66            | 620.33   | 1,161.29  | 1,985.85  |                  | 416.13   | 903.66    | 1,554.10  |
 
 #### Llama 3.1 405B FP8
-|                          | GPU   | H200 141GB HBM3   | H100 80GB HBM3   |
+
+|                          | GPU    | H200 141GB HBM3   | H100 80GB HBM3   |
 |:-----------------------------|:---|:------------------|:-----------------|
-|          | TP Size   | 8                 | 8                |
+|   | TP Size   | 8              | 8             |
 | ISL, OSL |    |                   |                  |
 |                              |    |                   |                  |
-| 128, 128                     |    | 3,800.11          | 3,732.40         |
-| 128, 2048                    |    | 5,661.13          | 4,572.23         |
-| 128, 4096                    |    | 5,167.18          | 2,911.42         |
-| 500, 2000                    |    | 4,854.29          | 3,661.85         |
-| 1000, 1000                   |    | 3,332.15          | 2,963.36         |
-| 1000, 2000                   |    | 3,682.15          | 3,253.17         |
-| 1024, 2048                   |    | 3,685.56          | 3,089.16         |
-| 2048, 128                    |    | 453.42            | 448.89           |
-| 2048, 2048                   |    | 3,055.73          | 2,139.94         |
-| 5000, 500                    |    | 656.11            | 579.14           |
-| 20000, 2000                  |    | 514.02            | 370.26           |
+| 128, 2048                    |    | 5,567.87          |                  |
+| 128, 4096                    |    | 5,136.85          |                  |
+| 500, 2000                    |    | 4,787.61          | 3,673.91         |
+| 1000, 1000                   |    | 3,286.30          | 3,012.22         |
+| 1000, 2000                   |    | 3,636.76          | 3,262.20         |
+| 1024, 2048                   |    | 3,618.66          | 3,109.70         |
+| 2048, 128                    |    | 443.10            | 449.02           |
+| 5000, 500                    |    | 645.46            |                  |
+| 20000, 2000                  |    |                   | 372.12           |
+
+#### Llama 4 Maverick FP8
+
+|                          | GPU    | H200 141GB HBM3   | H100 80GB HBM3   |
+|:-----------------------------|:---|:------------------|:-----------------|
+|    | TP Size    | 8              | 8             |
+| ISL, OSL |    |                   |                  |
+|                              |    |                   |                  |
+| 128, 2048                    |    | 27,543.87         |                  |
+| 128, 4096                    |    | 18,541.01         | 11,163.12        |
+| 500, 2000                    |    | 21,117.34         |                  |
+| 1000, 2000                   |    |                   | 10,556.00        |
+| 1024, 2048                   |    | 16,859.45         | 11,584.33        |
+| 2048, 128                    |    | 4,364.06          | 3,832.38         |
+| 2048, 2048                   |    | 12,800.89         |                  |
+| 5000, 500                    |    | 5,128.60          |                  |
+| 20000, 2000                  |    | 1,764.27          | 1,400.79         |
 
 ## Reproducing Benchmarked Results
 
@@ -198,6 +216,8 @@ a model name (HuggingFace reference or path to a local model), a [generated data
 trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
 ```
 
+The data collected for the v0.20 benchmarks was run with the following file:
+
 `llm_options.yml`
 ```yaml
 cuda_graph_config:
@@ -220,7 +240,7 @@ cuda_graph_config:
     - 8192
 ```
 
-In majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
+In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
 
 The results will be printed to the terminal upon benchmark completion. For example,
 

@@ -8,13 +8,15 @@ This is the starting point to try out TensorRT-LLM. Specifically, this Quick Sta
 
 There are multiple ways to install and run TensorRT-LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features.
 
+Note: **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**
+
 1. [](installation/containers)
 
 1. Pre-built release wheels on [PyPI](https://pypi.org/project/tensorrt-llm) (see [](installation/linux))
 
 1. [Building from source](installation/build-from-source-linux)
 
-The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub).
+The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). Ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process.
 
 
 ## LLM API
@@ -92,7 +94,7 @@ For detailed examples and command syntax, refer to the [trtllm-serve](commands/t
 
 2. Open a new terminal and use the following command to directly attach to the running container:
 
-```bash
+```bash:docs/source/quick-start-guide.md
 docker exec -it <container_id> bash
 ```
 

@@ -25,6 +25,8 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA
 | `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
 | `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + V |
 | `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V |
+| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B` | L |
+| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B` | L |
 
 Note:
 - L: Language only
@@ -72,7 +74,7 @@ Note:
 - [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
 - [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/opt)
 - [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi)
-- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
+- [Qwen/Qwen1.5/Qwen2/Qwen3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
 - [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl)
 - [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma)
 - [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt) [^replitcode]
-Original file line number
+Diff line change
@@ Expand Up / @@ -32,6 +32,7 @@ @@
        ```bash
        pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
        ```
+       **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**
 . Sanity check the installation by running the following in Python (tested on Python 3.12):
@@ Expand Down @@