You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docker/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,12 @@
2
2
3
3
## Multi-stage Builds with Docker
4
4
5
-
TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
5
+
TensorRTLLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
6
6
The following build stages are defined:
7
7
8
8
*`devel`: this image provides all dependencies for building TensorRT-LLM.
9
9
*`wheel`: this image contains the source code and the compiled binary distribution.
10
-
*`release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
10
+
*`release`: this image has the binaries installed and contains TensorRTLLM examples in `/app/tensorrt_llm`.
11
11
12
12
## Building Docker Images with GNU `make`
13
13
@@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
19
19
*`<stage>_push`: pushes the docker image for the stage to a docker registry (implies `<stage>_build`).
20
20
*`<stage>_run`: runs the docker image for the stage in a new container.
21
21
22
-
For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
22
+
For example, the `release` stage is built and pushed from the top-level directory of TensorRTLLM as follows:
23
23
24
24
```bash
25
25
make -C docker release_push
@@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
178
178
Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
179
179
(typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
180
180
with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
181
-
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
181
+
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRTLLM working copy and the user home directory, respectively.
Full instructions for cloning the TensorRT-LLM repository can be found in
17
-
the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
16
+
Full instructions for cloning the TensorRTLLM repository can be found in
17
+
the [TensorRTLLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
18
18
19
-
### Running TensorRT-LLM Using Docker
19
+
### Running TensorRTLLM Using Docker
20
20
21
-
With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
21
+
With the top-level directory of the TensorRTLLM repository cloned to your local machine, you can run the following
22
22
command to start the development container:
23
23
24
24
```bash
25
25
make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
26
26
```
27
27
28
-
where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
28
+
where `x.y.z` is the version of the TensorRTLLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
29
29
NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
30
-
local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
31
-
integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
30
+
local source code of TensorRTLLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
31
+
integration. Ensure that the image version matches the version of TensorRTLLM in your currently checked out local git branch. Not
32
32
specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
33
33
accompanied by a development container. In that case, use the latest version preceding the version of your development
@@ -78,7 +78,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
78
78
pip install ./build/tensorrt_llm*.whl
79
79
```
80
80
81
-
For additional information on building the TensorRT-LLM wheel, refer to
81
+
For additional information on building the TensorRTLLM wheel, refer to
82
82
the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).
where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
24
+
where x.y.z is the version of the TensorRTLLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
25
25
26
26
```bash
27
27
python3 -c "import tensorrt_llm"
28
28
```
29
29
30
-
This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
30
+
This command will print the TensorRTLLM version if everything is working correctly. After verification, you can explore
31
31
and try the example scripts included in `/app/tensorrt_llm/examples`.
32
32
33
-
Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
33
+
Alternatively, if you have already cloned the TensorRTLLM repository, you can use the following convenient command to
34
34
run the container:
35
35
36
36
```bash
@@ -43,8 +43,8 @@ container, and launches it with full GPU support.
43
43
For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
44
44
guidelines, visit the following official resources:
Copy file name to clipboardExpand all lines: docs/source/blogs/Falcon180B-H200.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.
21
21
22
22
<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
23
23
<sup>
24
-
TensorRT-LLM v0.7a |
24
+
TensorRTLLM v0.7a |
25
25
Falcon-180B |
26
26
1xH200 TP1 |
27
27
INT4 AWQ |
@@ -38,7 +38,7 @@ while maintaining high accuracy.
38
38
39
39
<sup>Preliminary measured accuracy, subject to change. </sup>
40
40
<sup>
41
-
TensorRT-LLM v0.7a |
41
+
TensorRTLLM v0.7a |
42
42
Falcon-180B |
43
43
1xH200 TP1 |
44
44
INT4 AWQ
@@ -84,20 +84,20 @@ than A100.
84
84
85
85
<sup>Preliminary measured performance, subject to change. </sup>
86
86
<sup>
87
-
TensorRT-LLM v0.7a |
87
+
TensorRTLLM v0.7a |
88
88
Llama2-70B |
89
89
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
90
90
FP8 |
91
91
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
92
92
93
93
94
-
**TensorRT-LLM GQA now 2.4x faster on H200**
94
+
**TensorRTLLM GQA now 2.4x faster on H200**
95
95
96
96
<imgsrc="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true"alt="Llama-70B H200 December vs Oct."width="400"height="auto">
97
97
98
98
<sup>Preliminary measured performance, subject to change.</sup>
Copy file name to clipboardExpand all lines: docs/source/blogs/XQA-kernel.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA
10
10
11
11
<imgsrc="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true"alt="XQA increased throughput within same latency budget"width="950"height="auto">
12
12
13
-
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>
13
+
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRTLLM v0.8a</sub>
14
14
15
15
16
16
## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
Copy file name to clipboardExpand all lines: docs/source/blogs/quantization-in-TRT-LLM.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
5
5
In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.
6
6
7
7
## Quantization in TensorRT-LLM
8
-
TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
8
+
TensorRTLLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
9
9
10
10
## Benchmark
11
11
@@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
63
63
\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
64
64
** Calibration time is subject to the actual model size.
65
65
66
-
We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
66
+
We note that TensorRTLLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
67
67
68
68
## What’s coming next
69
-
TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
69
+
TensorRTLLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
Copy file name to clipboardExpand all lines: docs/source/dev-on-cloud/dev-on-runpod.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
(dev-on-runpod)=
2
2
3
-
# Develop TensorRT-LLM on Runpod
4
-
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT-LLM on Runpod.
3
+
# Develop TensorRTLLM on Runpod
4
+
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRTLLM on Runpod.
5
5
6
6
## Prepare
7
7
@@ -13,7 +13,7 @@ Please refer to the [Configure SSH Key](https://docs.runpod.io/pods/configuratio
13
13
14
14
Note that we can skip the step of "Start your Pod. Make sure of the following things" here as we will introduce it below.
15
15
16
-
## Build the TensorRT-LLM Docker Image and Upload to DockerHub
16
+
## Build the TensorRTLLM Docker Image and Upload to DockerHub
17
17
Please refer to the [Build Image to DockerHub](build-image-to-dockerhub.md).
18
18
19
19
Note that the docker image must enable ssh access. See on [Enable ssh access to the container](build-image-to-dockerhub.md#enable-ssh-access-to-the-container).
0 commit comments