Skip to content

Commit ae1a389

Browse files
nv-guomingzdominicshanshan
authored andcommitted
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (NVIDIA#7850)
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent 1907ab8 commit ae1a389

File tree

95 files changed

+842
-582
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

95 files changed

+842
-582
lines changed

README.md

Lines changed: 35 additions & 35 deletions
Large diffs are not rendered by default.

cpp/CMakeLists.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ else()
6868
message(STATUS "NVTX is enabled")
6969
endif()
7070

71-
# Add TensorRT-LLM Gen export interface and CUDA support
71+
# Add TensorRT LLM Gen export interface and CUDA support
7272
add_compile_definitions("TLLM_GEN_EXPORT_INTERFACE")
7373
add_compile_definitions("TLLM_ENABLE_CUDA")
7474

@@ -138,9 +138,9 @@ execute_process(
138138
OUTPUT_STRIP_TRAILING_WHITESPACE)
139139

140140
if(TRTLLM_VERSION_RESULT EQUAL 0)
141-
message(STATUS "TensorRT-LLM version: ${TRTLLM_VERSION}")
141+
message(STATUS "TensorRT LLM version: ${TRTLLM_VERSION}")
142142
else()
143-
message(FATAL_ERROR "Failed to determine Tensorrt-LLM version")
143+
message(FATAL_ERROR "Failed to determine TensorRT LLM version")
144144
endif()
145145

146146
configure_file(

cpp/cmake/modules/cuda_configuration.cmake

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ function(setup_cuda_architectures)
116116
unset(CMAKE_CUDA_ARCHITECTURES_RAW)
117117
message(
118118
STATUS
119-
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT-LLM optimized for, "
119+
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT LLM optimized for, "
120120
"not all architectures CUDA compiler supports.")
121121
elseif(CMAKE_CUDA_ARCHITECTURES_RAW STREQUAL "all-major")
122122
message(

docker/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
## Multi-stage Builds with Docker
44

5-
TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
5+
TensorRT LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
66
The following build stages are defined:
77

88
* `devel`: this image provides all dependencies for building TensorRT-LLM.
99
* `wheel`: this image contains the source code and the compiled binary distribution.
10-
* `release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
10+
* `release`: this image has the binaries installed and contains TensorRT LLM examples in `/app/tensorrt_llm`.
1111

1212
## Building Docker Images with GNU `make`
1313

@@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
1919
* `<stage>_push`: pushes the docker image for the stage to a docker registry (implies `<stage>_build`).
2020
* `<stage>_run`: runs the docker image for the stage in a new container.
2121

22-
For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
22+
For example, the `release` stage is built and pushed from the top-level directory of TensorRT LLM as follows:
2323

2424
```bash
2525
make -C docker release_push
@@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
178178
Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
179179
(typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
180180
with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
181-
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
181+
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT LLM working copy and the user home directory, respectively.

docker/develop.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,37 @@
11
# Description
22

3-
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4-
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
3+
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4+
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
55
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
66

77
# Overview
88

9-
## TensorRT-LLM Develop Container
9+
## TensorRT LLM Develop Container
1010

11-
The TensorRT-LLM Develop container includes all necessary dependencies to build TensorRT-LLM from source. It is
12-
specifically designed to be used alongside the source code cloned from the official TensorRT-LLM repository:
11+
The TensorRT LLM Develop container includes all necessary dependencies to build TensorRT LLM from source. It is
12+
specifically designed to be used alongside the source code cloned from the official TensorRT LLM repository:
1313

1414
[GitHub Repository - NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
1515

16-
Full instructions for cloning the TensorRT-LLM repository can be found in
17-
the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
16+
Full instructions for cloning the TensorRT LLM repository can be found in
17+
the [TensorRT LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
1818

1919
> **Note:**
20-
> This container does not contain a pre-built binary release of `TensorRT-LLM` or tools like `trtllm-serve`.
20+
> This container does not contain a pre-built binary release of `TensorRT LLM` or tools like `trtllm-serve`.
2121
22-
### Running the TensorRT-LLM Develop Container Using Docker
22+
### Running the TensorRT LLM Develop Container Using Docker
2323

24-
With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
24+
With the top-level directory of the TensorRT LLM repository cloned to your local machine, you can run the following
2525
command to start the development container:
2626

2727
```bash
2828
make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
2929
```
3030

31-
where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
31+
where `x.y.z` is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
3232
NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
33-
local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
34-
integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
33+
local source code of TensorRT LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
34+
integration. Ensure that the image version matches the version of TensorRT LLM in your currently checked out local git branch. Not
3535
specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
3636
accompanied by a development container. In that case, use the latest version preceding the version of your development
3737
branch.
@@ -53,9 +53,9 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
5353
Note that this will start the container with the user `root`, which may leave files with root ownership in your local
5454
checkout.
5555

56-
### Building the TensorRT-LLM Wheel within the Container
56+
### Building the TensorRT LLM Wheel within the Container
5757

58-
You can build the TensorRT-LLM Python wheel inside the development container using the following command:
58+
You can build the TensorRT LLM Python wheel inside the development container using the following command:
5959

6060
```bash
6161
./scripts/build_wheel.py --clean --use_ccache --cuda_architectures=native
@@ -81,7 +81,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
8181
pip install ./build/tensorrt_llm*.whl
8282
```
8383

84-
For additional information on building the TensorRT-LLM wheel, refer to
84+
For additional information on building the TensorRT LLM wheel, refer to
8585
the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).
8686

8787
### Security CVEs

docker/release.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# Description
22

3-
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4-
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
3+
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4+
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
55
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
66

77
# Overview
88

9-
## TensorRT-LLM Release Container
9+
## TensorRT LLM Release Container
1010

11-
The TensorRT-LLM Release container provides a pre-built environment for running TensorRT-LLM.
11+
The TensorRT LLM Release container provides a pre-built environment for running TensorRT-LLM.
1212

1313
Visit the [official GitHub repository](https://github.com/NVIDIA/TensorRT-LLM) for more details.
1414

15-
### Running TensorRT-LLM Using Docker
15+
### Running TensorRT LLM Using Docker
1616

1717
A typical command to launch the container is:
1818

@@ -21,16 +21,16 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpu
2121
nvcr.io/nvidia/tensorrt-llm/release:x.y.z
2222
```
2323

24-
where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
24+
where x.y.z is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
2525

2626
```bash
2727
python3 -c "import tensorrt_llm"
2828
```
2929

30-
This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
30+
This command will print the TensorRT LLM version if everything is working correctly. After verification, you can explore
3131
and try the example scripts included in `/app/tensorrt_llm/examples`.
3232

33-
Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
33+
Alternatively, if you have already cloned the TensorRT LLM repository, you can use the following convenient command to
3434
run the container:
3535

3636
```bash
@@ -43,8 +43,8 @@ container, and launches it with full GPU support.
4343
For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
4444
guidelines, visit the following official resources:
4545

46-
- [TensorRT-LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
47-
- [TensorRT-LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
46+
- [TensorRT LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
47+
- [TensorRT LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
4848

4949
### Security CVEs
5050

docs/source/blogs/Falcon180B-H200.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.
2121

2222
<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
2323
<sup>
24-
TensorRT-LLM v0.7a |
24+
TensorRT LLM v0.7a |
2525
Falcon-180B |
2626
1xH200 TP1 |
2727
INT4 AWQ |
@@ -38,7 +38,7 @@ while maintaining high accuracy.
3838

3939
<sup>Preliminary measured accuracy, subject to change. </sup>
4040
<sup>
41-
TensorRT-LLM v0.7a |
41+
TensorRT LLM v0.7a |
4242
Falcon-180B |
4343
1xH200 TP1 |
4444
INT4 AWQ
@@ -84,20 +84,20 @@ than A100.
8484

8585
<sup>Preliminary measured performance, subject to change. </sup>
8686
<sup>
87-
TensorRT-LLM v0.7a |
87+
TensorRT LLM v0.7a |
8888
Llama2-70B |
8989
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
9090
FP8 |
9191
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
9292

9393

94-
**TensorRT-LLM GQA now 2.4x faster on H200**
94+
**TensorRT LLM GQA now 2.4x faster on H200**
9595

9696
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true" alt="Llama-70B H200 December vs Oct." width="400" height="auto">
9797

9898
<sup>Preliminary measured performance, subject to change.</sup>
9999
<sup>
100-
TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
100+
TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
101101
Llama2-70B |
102102
1xH200 TP1 |
103103
FP8 |

docs/source/blogs/XQA-kernel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA
1010

1111
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">
1212

13-
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>
13+
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a</sub>
1414

1515

1616
## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget

docs/source/blogs/quantization-in-TRT-LLM.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
55
In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.
66

77
## Quantization in TensorRT-LLM
8-
TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
8+
TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
99

1010
## Benchmark
1111

@@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
6363
\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
6464
** Calibration time is subject to the actual model size.
6565

66-
We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
66+
We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
6767

6868
## What’s coming next
69-
TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
69+
TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.

docs/source/developer-guide/dev-containers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Using Dev Containers
22

3-
The TensorRT-LLM repository contains a [Dev Containers](https://containers.dev/)
3+
The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
44
configuration in `.devcontainer`. These files are intended for
55
use with [Visual Studio Code](https://code.visualstudio.com/).
66

7-
Due to the various container options supported by TensorRT-LLM (see
7+
Due to the various container options supported by TensorRT LLM (see
88
[](/installation/build-from-source-linux.md) and
99
<https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>), the Dev
1010
Container configuration also offers some degree of customization.

0 commit comments

Comments
 (0)