Skip to content

Commit 4d0e419

Browse files
nv-guomingzyuanjingx87
authored andcommitted
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 6244acd commit 4d0e419

File tree

97 files changed

+580
-580
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+580
-580
lines changed

README.md

Lines changed: 32 additions & 32 deletions
Large diffs are not rendered by default.

cpp/CMakeLists.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ else()
6464
message(STATUS "NVTX is enabled")
6565
endif()
6666

67-
# Add TensorRT-LLM Gen export interface and CUDA support
67+
# Add TensorRT LLM Gen export interface and CUDA support
6868
add_compile_definitions("TLLM_GEN_EXPORT_INTERFACE")
6969
add_compile_definitions("TLLM_ENABLE_CUDA")
7070

@@ -134,9 +134,9 @@ execute_process(
134134
OUTPUT_STRIP_TRAILING_WHITESPACE)
135135

136136
if(TRTLLM_VERSION_RESULT EQUAL 0)
137-
message(STATUS "TensorRT-LLM version: ${TRTLLM_VERSION}")
137+
message(STATUS "TensorRT LLM version: ${TRTLLM_VERSION}")
138138
else()
139-
message(FATAL_ERROR "Failed to determine Tensorrt-LLM version")
139+
message(FATAL_ERROR "Failed to determine TensorRT LLM version")
140140
endif()
141141

142142
configure_file(

cpp/cmake/modules/cuda_configuration.cmake

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ function(setup_cuda_architectures)
116116
unset(CMAKE_CUDA_ARCHITECTURES_RAW)
117117
message(
118118
STATUS
119-
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT-LLM optimized for, "
119+
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT LLM optimized for, "
120120
"not all architectures CUDA compiler supports.")
121121
elseif(CMAKE_CUDA_ARCHITECTURES_RAW STREQUAL "all-major")
122122
message(

docker/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
## Multi-stage Builds with Docker
44

5-
TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
5+
TensorRT LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
66
The following build stages are defined:
77

88
* `devel`: this image provides all dependencies for building TensorRT-LLM.
99
* `wheel`: this image contains the source code and the compiled binary distribution.
10-
* `release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
10+
* `release`: this image has the binaries installed and contains TensorRT LLM examples in `/app/tensorrt_llm`.
1111

1212
## Building Docker Images with GNU `make`
1313

@@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
1919
* `<stage>_push`: pushes the docker image for the stage to a docker registry (implies `<stage>_build`).
2020
* `<stage>_run`: runs the docker image for the stage in a new container.
2121

22-
For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
22+
For example, the `release` stage is built and pushed from the top-level directory of TensorRT LLM as follows:
2323

2424
```bash
2525
make -C docker release_push
@@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
178178
Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
179179
(typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
180180
with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
181-
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
181+
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT LLM working copy and the user home directory, respectively.

docker/develop.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,34 @@
11
# Description
22

3-
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4-
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
3+
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4+
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
55
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
66

77
# Overview
88

9-
## TensorRT-LLM Develop Container
9+
## TensorRT LLM Develop Container
1010

11-
The TensorRT-LLM Develop container includes all necessary dependencies to build TensorRT-LLM from source. It is
12-
specifically designed to be used alongside the source code cloned from the official TensorRT-LLM repository:
11+
The TensorRT LLM Develop container includes all necessary dependencies to build TensorRT LLM from source. It is
12+
specifically designed to be used alongside the source code cloned from the official TensorRT LLM repository:
1313

1414
[GitHub Repository - NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
1515

16-
Full instructions for cloning the TensorRT-LLM repository can be found in
17-
the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
16+
Full instructions for cloning the TensorRT LLM repository can be found in
17+
the [TensorRT LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
1818

19-
### Running TensorRT-LLM Using Docker
19+
### Running TensorRT LLM Using Docker
2020

21-
With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
21+
With the top-level directory of the TensorRT LLM repository cloned to your local machine, you can run the following
2222
command to start the development container:
2323

2424
```bash
2525
make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
2626
```
2727

28-
where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
28+
where `x.y.z` is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
2929
NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
30-
local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
31-
integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
30+
local source code of TensorRT LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
31+
integration. Ensure that the image version matches the version of TensorRT LLM in your currently checked out local git branch. Not
3232
specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
3333
accompanied by a development container. In that case, use the latest version preceding the version of your development
3434
branch.
@@ -50,9 +50,9 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
5050
Note that this will start the container with the user `root`, which may leave files with root ownership in your local
5151
checkout.
5252

53-
### Building the TensorRT-LLM Wheel within the Container
53+
### Building the TensorRT LLM Wheel within the Container
5454

55-
You can build the TensorRT-LLM Python wheel inside the development container using the following command:
55+
You can build the TensorRT LLM Python wheel inside the development container using the following command:
5656

5757
```bash
5858
./scripts/build_wheel.py --clean --use_ccache --cuda_architectures=native
@@ -78,7 +78,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
7878
pip install ./build/tensorrt_llm*.whl
7979
```
8080

81-
For additional information on building the TensorRT-LLM wheel, refer to
81+
For additional information on building the TensorRT LLM wheel, refer to
8282
the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).
8383

8484
### Security CVEs

docker/release.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# Description
22

3-
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4-
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
3+
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
4+
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
55
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
66

77
# Overview
88

9-
## TensorRT-LLM Release Container
9+
## TensorRT LLM Release Container
1010

11-
The TensorRT-LLM Release container provides a pre-built environment for running TensorRT-LLM.
11+
The TensorRT LLM Release container provides a pre-built environment for running TensorRT-LLM.
1212

1313
Visit the [official GitHub repository](https://github.com/NVIDIA/TensorRT-LLM) for more details.
1414

15-
### Running TensorRT-LLM Using Docker
15+
### Running TensorRT LLM Using Docker
1616

1717
A typical command to launch the container is:
1818

@@ -21,16 +21,16 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpu
2121
nvcr.io/nvidia/tensorrt-llm/release:x.y.z
2222
```
2323

24-
where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
24+
where x.y.z is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
2525

2626
```bash
2727
python3 -c "import tensorrt_llm"
2828
```
2929

30-
This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
30+
This command will print the TensorRT LLM version if everything is working correctly. After verification, you can explore
3131
and try the example scripts included in `/app/tensorrt_llm/examples`.
3232

33-
Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
33+
Alternatively, if you have already cloned the TensorRT LLM repository, you can use the following convenient command to
3434
run the container:
3535

3636
```bash
@@ -43,8 +43,8 @@ container, and launches it with full GPU support.
4343
For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
4444
guidelines, visit the following official resources:
4545

46-
- [TensorRT-LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
47-
- [TensorRT-LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
46+
- [TensorRT LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
47+
- [TensorRT LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
4848

4949
### Security CVEs
5050

docs/source/blogs/Falcon180B-H200.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.
2121

2222
<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
2323
<sup>
24-
TensorRT-LLM v0.7a |
24+
TensorRT LLM v0.7a |
2525
Falcon-180B |
2626
1xH200 TP1 |
2727
INT4 AWQ |
@@ -38,7 +38,7 @@ while maintaining high accuracy.
3838

3939
<sup>Preliminary measured accuracy, subject to change. </sup>
4040
<sup>
41-
TensorRT-LLM v0.7a |
41+
TensorRT LLM v0.7a |
4242
Falcon-180B |
4343
1xH200 TP1 |
4444
INT4 AWQ
@@ -84,20 +84,20 @@ than A100.
8484

8585
<sup>Preliminary measured performance, subject to change. </sup>
8686
<sup>
87-
TensorRT-LLM v0.7a |
87+
TensorRT LLM v0.7a |
8888
Llama2-70B |
8989
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
9090
FP8 |
9191
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
9292

9393

94-
**TensorRT-LLM GQA now 2.4x faster on H200**
94+
**TensorRT LLM GQA now 2.4x faster on H200**
9595

9696
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true" alt="Llama-70B H200 December vs Oct." width="400" height="auto">
9797

9898
<sup>Preliminary measured performance, subject to change.</sup>
9999
<sup>
100-
TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
100+
TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
101101
Llama2-70B |
102102
1xH200 TP1 |
103103
FP8 |

docs/source/blogs/XQA-kernel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA
1010

1111
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">
1212

13-
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>
13+
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a</sub>
1414

1515

1616
## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget

docs/source/blogs/quantization-in-TRT-LLM.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
55
In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.
66

77
## Quantization in TensorRT-LLM
8-
TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
8+
TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
99

1010
## Benchmark
1111

@@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
6363
\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
6464
** Calibration time is subject to the actual model size.
6565

66-
We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
66+
We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
6767

6868
## What’s coming next
69-
TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
69+
TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.

docs/source/dev-on-cloud/dev-on-runpod.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
(dev-on-runpod)=
22

3-
# Develop TensorRT-LLM on Runpod
4-
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT-LLM on Runpod.
3+
# Develop TensorRT LLM on Runpod
4+
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT LLM on Runpod.
55

66
## Prepare
77

@@ -13,7 +13,7 @@ Please refer to the [Configure SSH Key](https://docs.runpod.io/pods/configuratio
1313

1414
Note that we can skip the step of "Start your Pod. Make sure of the following things" here as we will introduce it below.
1515

16-
## Build the TensorRT-LLM Docker Image and Upload to DockerHub
16+
## Build the TensorRT LLM Docker Image and Upload to DockerHub
1717
Please refer to the [Build Image to DockerHub](build-image-to-dockerhub.md).
1818

1919
Note that the docker image must enable ssh access. See on [Enable ssh access to the container](build-image-to-dockerhub.md#enable-ssh-access-to-the-container).

0 commit comments

Comments
 (0)