Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/installation/linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
```bash
pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
```
**This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**

2. Sanity check the installation by running the following in Python (tested on Python 3.12):

Expand Down
170 changes: 95 additions & 75 deletions docs/source/performance/perf-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,101 +28,119 @@ nvidia/Llama-3.1-405B-Instruct-FP4
```

#### Llama 3.3 70B FP4

| | GPU | B200 | | | |
|:-----------------------------|:---|:----------|:----------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 |
| ISL, OSL| | | | | |
| | | | | | |
| 128, 128 | | 11,253.28 | 17,867.66 | 24,944.50 | 27,471.49 |
| 128, 2048 | | 9,925.00 | 15,459.71 | 23,608.58 | 30,742.86 |
| 128, 4096 | | 6,318.92 | 8,711.88 | 17,659.74 | 24,947.05 |
| 500, 2000 | | 7,559.88 | 10,602.27 | 20,910.23 | 28,182.34 |
| 1000, 1000 | | 6,866.96 | 10,838.01 | 16,567.86 | 19,991.64 |
| 1000, 2000 | | 6,736.88 | 9,132.08 | 15,737.02 | 20,518.04 |
| 1024, 2048 | | 6,580.56 | 8,767.45 | 15,722.55 | 20,437.96 |
| 2048, 128 | | 1,375.49 | 1,610.69 | 2,707.58 | 3,717.82 |
| 2048, 2048 | | 4,544.73 | 6,956.14 | 12,292.23 | 15,661.22 |
| 5000, 500 | | 1,488.19 | 2,379.73 | 3,588.45 | 4,810.21 |
| 20000, 2000 | | 580.96 | 1,043.58 | 1,957.84 | 3,167.30 |
|:------------------------|:--------|:----------|:----------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 |
| ISL, OSL | | | | | |
| | | | | | |
| 128, 128 | | 10,994.48 | 17,542.11 | 24,667.31 | 27,272.27 |
| 128, 2048 | | 9,580.46 | 15,432.35 | 23,568.12 | 31,174.31 |
| 128, 4096 | | 6,418.39 | 9,841.53 | 17,808.76 | 25,229.25 |
| 500, 2000 | | 7,343.32 | 11,850.57 | 20,709.67 | 28,038.78 |
| 1000, 1000 | | 6,752.53 | 10,815.88 | 16,413.04 | 20,060.66 |
| 1000, 2000 | | 6,670.07 | 9,830.73 | 15,597.49 | 20,672.37 |
| 1024, 2048 | | 6,636.75 | 9,807.13 | 15,519.23 | 20,617.28 |
| 2048, 128 | | 1,342.17 | 1,989.41 | 3,033.14 | 4,035.64 |
| 5000, 500 | | 1,429.67 | 2,419.67 | 3,686.84 | 5,182.96 |
| 20000, 2000 | | 629.77 | 1,177.01 | 2,120.66 | 3,429.03 |

#### Llama 3.1 405B FP4
| | GPU | B200 |
|:-----------------------------|:---|:----------|
| | TP Size | 8 |
| ISL, OSL| | |
| | | |
| 128, 128 | | 9,184.83 |
| 128, 2048 | | 10,387.23 |
| 128, 4096 | | 8,741.80 |
| 500, 2000 | | 9,242.34 |
| 1000, 1000 | | 7,565.50 |
| 1000, 2000 | | 7,696.76 |
| 1024, 2048 | | 7,568.93 |
| 2048, 128 | | 953.57 |
| 2048, 2048 | | 6,092.32 |
| 5000, 500 | | 1,332.22 |
| 20000, 2000 | | 961.58 |

| | GPU | B200 | |
|:------------------------|:------- |:---------|:----------|
| | TP Size | 4 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 6,163.81 | 9,002.90 |
| 128, 2048 | | 7,081.21 | 10,288.28 |
| 128, 4096 | | 6,028.37 | 8,713.77 |
| 500, 2000 | | 5,858.75 | 9,125.86 |
| 1000, 1000 | | 4,848.00 | 7,582.97 |
| 1000, 2000 | | 5,375.25 | 7,626.28 |
| 1024, 2048 | | 5,345.70 | 7,464.03 |
| 2048, 128 | | 693.55 | 1,086.56 |
| 5000, 500 | | 947.49 | 1,532.45 |
| 20000, 2000 | | 641.11 | 1,097.84 |

### FP8 Models:
```
nvidia/Llama-3.1-8B-Instruct-FP8
nvidia/Llama-3.1-70B-Instruct-FP8
nvidia/Llama-3.3-70B-Instruct-FP8
nvidia/Llama-3.1-405B-Instruct-FP8
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
```

#### Llama 3.1 8B FP8
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 1 | 1 |
| | TP Size | 1 | 1 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 28,447.38 | 27,568.68 |
| 128, 2048 | | 23,294.74 | 22,003.62 |
| 128, 4096 | | 17,481.48 | 13,640.35 |
| 500, 2000 | | 21,462.57 | 17,794.39 |
| 1000, 1000 | | 17,590.60 | 15,270.02 |
| 1000, 2000 | | 17,139.51 | 13,850.22 |
| 1024, 2048 | | 16,970.63 | 13,374.15 |
| 2048, 128 | | 3,531.33 | 3,495.05 |
| 2048, 2048 | | 12,022.38 | 9,653.67 |
| 5000, 500 | | 3,851.65 | 3,371.16 |
| 20000, 2000 | | 1,706.06 | 1,340.92 |

#### Llama 3.1 70B FP8
| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
| 128, 128 | | 27,970.14 | 27,688.36 |
| 128, 2048 | | 23,326.38 | 21,841.15 |
| 128, 4096 | | 17,508.51 | 13,730.89 |
| 500, 2000 | | 21,390.41 | 17,833.34 |
| 1000, 1000 | | 17,366.89 | 15,270.62 |
| 1000, 2000 | | 16,831.31 | 13,798.08 |
| 1024, 2048 | | 16,737.03 | 13,385.50 |
| 2048, 128 | | 3,488.03 | 3,414.67 |
| 5000, 500 | | 3,813.69 | 3,394.54 |
| 20000, 2000 | | 1,696.66 | 1,345.42 |

#### Llama 3.3 70B FP8

| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
|:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
| ISL, OSL| | | | | | | | | |
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
| ISL, OSL | | | | | | | | | |
| | | | | | | | | | |
| 128, 128 | | 3,657.58 | 6,477.50 | 10,466.04 | 15,554.57 | 3,191.27 | 6,183.41 | 10,260.68 | 14,686.01 |
| 128, 2048 | | 4,351.07 | 8,450.31 | 13,438.71 | 20,750.58 | 745.19 | 5,822.02 | 11,442.01 | 17,463.99 |
| 128, 4096 | | 2,696.61 | 5,598.92 | 11,524.93 | 16,634.90 | | 3,714.87 | 8,209.91 | 12,598.55 |
| 500, 2000 | | 3,475.58 | 6,712.35 | 12,332.32 | 17,311.28 | | 4,704.31 | 10,278.02 | 14,630.41 |
| 1000, 1000 | | 2,727.42 | 5,097.36 | 8,698.15 | 12,794.92 | 734.67 | 4,191.26 | 7,427.35 | 11,082.48 |
| 1000, 2000 | | 2,913.54 | 5,841.15 | 9,016.49 | 13,174.68 | 526.31 | 3,920.44 | 7,590.35 | 11,108.11 |
| 1024, 2048 | | 2,893.02 | 5,565.28 | 9,017.72 | 13,117.34 | 525.43 | 3,896.14 | 7,557.32 | 11,028.32 |
| 2048, 128 | | 433.30 | 772.97 | 1,278.26 | 1,947.33 | 315.90 | 747.51 | 1,240.12 | 1,840.12 |
| 2048, 2048 | | 1,990.25 | 3,822.83 | 7,068.68 | 10,529.06 | 357.98 | 2,732.86 | 5,640.31 | 8,772.88 |
| 5000, 500 | | 543.88 | 1,005.81 | 1,714.77 | 2,683.22 | 203.27 | 866.77 | 1,571.92 | 2,399.78 |
| 20000, 2000 | | 276.99 | 618.01 | 1,175.35 | 2,021.08 | | 408.43 | 910.77 | 1,568.84 |
| 128, 128 | | 3,605.47 | 6,427.69 | 10,407.42 | 15,434.37 | 3,128.33 | 6,216.91 | | |
| 128, 2048 | | 4,315.80 | 8,464.03 | 13,508.59 | 20,759.72 | 756.42 | 5,782.57 | 11,464.94 | 17,424.32 |
| 128, 4096 | | 2,701.17 | 5,573.55 | 11,458.56 | 16,668.75 | | 3,868.37 | 8,206.39 | 12,624.61 |
| 500, 2000 | | 3,478.76 | 6,740.06 | 12,200.18 | | | 4,684.06 | 9,903.53 | 14,553.93 |
| 1000, 1000 | | 2,744.32 | 5,119.72 | 8,685.44 | 12,744.51 | 742.14 | 4,247.19 | 7,435.65 | 11,018.81 |
| 1000, 2000 | | 2,896.44 | 5,847.26 | 9,031.21 | 13,141.17 | 533.74 | 3,866.53 | 7,611.12 | 11,139.22 |
| 1024, 2048 | | 2,874.18 | 5,568.61 | 8,946.71 | 13,082.62 | 530.16 | 3,796.68 | 7,575.24 | 11,004.31 |
| 2048, 128 | | 435.90 | 772.67 | 1,264.76 | | | 736.89 | 1,213.33 | 1,839.22 |
| 2048, 2048 | | | | | 10,412.85 | | | | |
| 5000, 500 | | 545.96 | 997.15 | 1,698.22 | 2,655.28 | 204.94 | 862.91 | 1,552.68 | 2,369.84 |
| 20000, 2000 | | 276.66 | 620.33 | 1,161.29 | 1,985.85 | | 416.13 | 903.66 | 1,554.10 |

#### Llama 3.1 405B FP8
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 8 | 8 |
| | TP Size | 8 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 3,800.11 | 3,732.40 |
| 128, 2048 | | 5,661.13 | 4,572.23 |
| 128, 4096 | | 5,167.18 | 2,911.42 |
| 500, 2000 | | 4,854.29 | 3,661.85 |
| 1000, 1000 | | 3,332.15 | 2,963.36 |
| 1000, 2000 | | 3,682.15 | 3,253.17 |
| 1024, 2048 | | 3,685.56 | 3,089.16 |
| 2048, 128 | | 453.42 | 448.89 |
| 2048, 2048 | | 3,055.73 | 2,139.94 |
| 5000, 500 | | 656.11 | 579.14 |
| 20000, 2000 | | 514.02 | 370.26 |
| 128, 2048 | | 5,567.87 | |
| 128, 4096 | | 5,136.85 | |
| 500, 2000 | | 4,787.61 | 3,673.91 |
| 1000, 1000 | | 3,286.30 | 3,012.22 |
| 1000, 2000 | | 3,636.76 | 3,262.20 |
| 1024, 2048 | | 3,618.66 | 3,109.70 |
| 2048, 128 | | 443.10 | 449.02 |
| 5000, 500 | | 645.46 | |
| 20000, 2000 | | | 372.12 |

#### Llama 4 Maverick FP8

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 8 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 2048 | | 27,543.87 | |
| 128, 4096 | | 18,541.01 | 11,163.12 |
| 500, 2000 | | 21,117.34 | |
| 1000, 2000 | | | 10,556.00 |
| 1024, 2048 | | 16,859.45 | 11,584.33 |
| 2048, 128 | | 4,364.06 | 3,832.38 |
| 2048, 2048 | | 12,800.89 | |
| 5000, 500 | | 5,128.60 | |
| 20000, 2000 | | 1,764.27 | 1,400.79 |

## Reproducing Benchmarked Results

Expand Down Expand Up @@ -198,6 +216,8 @@ a model name (HuggingFace reference or path to a local model), a [generated data
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
```

The data collected for the v0.20 benchmarks was run with the following file:

`llm_options.yml`
```yaml
cuda_graph_config:
Expand All @@ -220,7 +240,7 @@ cuda_graph_config:
- 8192
```

In majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.

The results will be printed to the terminal upon benchmark completion. For example,

Expand Down
6 changes: 4 additions & 2 deletions docs/source/quick-start-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@ This is the starting point to try out TensorRT-LLM. Specifically, this Quick Sta

There are multiple ways to install and run TensorRT-LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features.

Note: **This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.**

1. [](installation/containers)

1. Pre-built release wheels on [PyPI](https://pypi.org/project/tensorrt-llm) (see [](installation/linux))

1. [Building from source](installation/build-from-source-linux)

The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub).
The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). Ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process.


## LLM API
Expand Down Expand Up @@ -92,7 +94,7 @@ For detailed examples and command syntax, refer to the [trtllm-serve](commands/t

2. Open a new terminal and use the following command to directly attach to the running container:

```bash
```bash:docs/source/quick-start-guide.md
docker exec -it <container_id> bash
```

Expand Down
4 changes: 3 additions & 1 deletion docs/source/reference/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
| `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + V |
| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V |
| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B` | L |
| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B` | L |

Note:
- L: Language only
Expand Down Expand Up @@ -72,7 +74,7 @@ Note:
- [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec)
- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/opt)
- [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi)
- [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
- [Qwen/Qwen1.5/Qwen2/Qwen3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen)
- [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl)
- [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma)
- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt) [^replitcode]
Expand Down
Loading