Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 32 additions & 32 deletions README.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ else()
message(STATUS "NVTX is enabled")
endif()

# Add TensorRT-LLM Gen export interface and CUDA support
# Add TensorRT LLM Gen export interface and CUDA support
add_compile_definitions("TLLM_GEN_EXPORT_INTERFACE")
add_compile_definitions("TLLM_ENABLE_CUDA")

Expand Down Expand Up @@ -134,9 +134,9 @@ execute_process(
OUTPUT_STRIP_TRAILING_WHITESPACE)

if(TRTLLM_VERSION_RESULT EQUAL 0)
message(STATUS "TensorRT-LLM version: ${TRTLLM_VERSION}")
message(STATUS "TensorRT LLM version: ${TRTLLM_VERSION}")
else()
message(FATAL_ERROR "Failed to determine Tensorrt-LLM version")
message(FATAL_ERROR "Failed to determine TensorRT LLM version")
endif()

configure_file(
Expand Down
2 changes: 1 addition & 1 deletion cpp/cmake/modules/cuda_configuration.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ function(setup_cuda_architectures)
unset(CMAKE_CUDA_ARCHITECTURES_RAW)
message(
STATUS
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT-LLM optimized for, "
"Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT LLM optimized for, "
"not all architectures CUDA compiler supports.")
elseif(CMAKE_CUDA_ARCHITECTURES_RAW STREQUAL "all-major")
message(
Expand Down
8 changes: 4 additions & 4 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

## Multi-stage Builds with Docker

TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
TensorRT LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
The following build stages are defined:

* `devel`: this image provides all dependencies for building TensorRT-LLM.
* `wheel`: this image contains the source code and the compiled binary distribution.
* `release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
* `release`: this image has the binaries installed and contains TensorRT LLM examples in `/app/tensorrt_llm`.

## Building Docker Images with GNU `make`

Expand All @@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
* `<stage>_push`: pushes the docker image for the stage to a docker registry (implies `<stage>_build`).
* `<stage>_run`: runs the docker image for the stage in a new container.

For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
For example, the `release` stage is built and pushed from the top-level directory of TensorRT LLM as follows:

```bash
make -C docker release_push
Expand Down Expand Up @@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
(typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT LLM working copy and the user home directory, respectively.
30 changes: 15 additions & 15 deletions docker/develop.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
# Description

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
create Python and C++ runtimes that orchestrate the inference execution in a performant way.

# Overview

## TensorRT-LLM Develop Container
## TensorRT LLM Develop Container

The TensorRT-LLM Develop container includes all necessary dependencies to build TensorRT-LLM from source. It is
specifically designed to be used alongside the source code cloned from the official TensorRT-LLM repository:
The TensorRT LLM Develop container includes all necessary dependencies to build TensorRT LLM from source. It is
specifically designed to be used alongside the source code cloned from the official TensorRT LLM repository:

[GitHub Repository - NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)

Full instructions for cloning the TensorRT-LLM repository can be found in
the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
Full instructions for cloning the TensorRT LLM repository can be found in
the [TensorRT LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).

### Running TensorRT-LLM Using Docker
### Running TensorRT LLM Using Docker

With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
With the top-level directory of the TensorRT LLM repository cloned to your local machine, you can run the following
command to start the development container:

```bash
make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
```

where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
where `x.y.z` is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
local source code of TensorRT LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
integration. Ensure that the image version matches the version of TensorRT LLM in your currently checked out local git branch. Not
specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
accompanied by a development container. In that case, use the latest version preceding the version of your development
branch.
Expand All @@ -50,9 +50,9 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
Note that this will start the container with the user `root`, which may leave files with root ownership in your local
checkout.

### Building the TensorRT-LLM Wheel within the Container
### Building the TensorRT LLM Wheel within the Container

You can build the TensorRT-LLM Python wheel inside the development container using the following command:
You can build the TensorRT LLM Python wheel inside the development container using the following command:

```bash
./scripts/build_wheel.py --clean --use_ccache --cuda_architectures=native
Expand All @@ -78,7 +78,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
pip install ./build/tensorrt_llm*.whl
```

For additional information on building the TensorRT-LLM wheel, refer to
For additional information on building the TensorRT LLM wheel, refer to
the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).

### Security CVEs
Expand Down
20 changes: 10 additions & 10 deletions docker/release.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Description

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
create Python and C++ runtimes that orchestrate the inference execution in a performant way.

# Overview

## TensorRT-LLM Release Container
## TensorRT LLM Release Container

The TensorRT-LLM Release container provides a pre-built environment for running TensorRT-LLM.
The TensorRT LLM Release container provides a pre-built environment for running TensorRT-LLM.

Visit the [official GitHub repository](https://github.com/NVIDIA/TensorRT-LLM) for more details.

### Running TensorRT-LLM Using Docker
### Running TensorRT LLM Using Docker

A typical command to launch the container is:

Expand All @@ -21,16 +21,16 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpu
nvcr.io/nvidia/tensorrt-llm/release:x.y.z
```

where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
where x.y.z is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:

```bash
python3 -c "import tensorrt_llm"
```

This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
This command will print the TensorRT LLM version if everything is working correctly. After verification, you can explore
and try the example scripts included in `/app/tensorrt_llm/examples`.

Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
Alternatively, if you have already cloned the TensorRT LLM repository, you can use the following convenient command to
run the container:

```bash
Expand All @@ -43,8 +43,8 @@ container, and launches it with full GPU support.
For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
guidelines, visit the following official resources:

- [TensorRT-LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
- [TensorRT-LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
- [TensorRT LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
- [TensorRT LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)

### Security CVEs

Expand Down
10 changes: 5 additions & 5 deletions docs/source/blogs/Falcon180B-H200.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.

<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
<sup>
TensorRT-LLM v0.7a |
TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ |
Expand All @@ -38,7 +38,7 @@ while maintaining high accuracy.

<sup>Preliminary measured accuracy, subject to change. </sup>
<sup>
TensorRT-LLM v0.7a |
TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ
Expand Down Expand Up @@ -84,20 +84,20 @@ than A100.

<sup>Preliminary measured performance, subject to change. </sup>
<sup>
TensorRT-LLM v0.7a |
TensorRT LLM v0.7a |
Llama2-70B |
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
FP8 |
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>


**TensorRT-LLM GQA now 2.4x faster on H200**
**TensorRT LLM GQA now 2.4x faster on H200**

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true" alt="Llama-70B H200 December vs Oct." width="400" height="auto">

<sup>Preliminary measured performance, subject to change.</sup>
<sup>
TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
Llama2-70B |
1xH200 TP1 |
FP8 |
Expand Down
2 changes: 1 addition & 1 deletion docs/source/blogs/XQA-kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA

<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">

<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>
<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a</sub>


## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
Expand Down
6 changes: 3 additions & 3 deletions docs/source/blogs/quantization-in-TRT-LLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.

## Quantization in TensorRT-LLM
TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.

## Benchmark

Expand Down Expand Up @@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
** Calibration time is subject to the actual model size.

We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.

## What’s coming next
TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
6 changes: 3 additions & 3 deletions docs/source/dev-on-cloud/dev-on-runpod.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
(dev-on-runpod)=

# Develop TensorRT-LLM on Runpod
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT-LLM on Runpod.
# Develop TensorRT LLM on Runpod
[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT LLM on Runpod.

## Prepare

Expand All @@ -13,7 +13,7 @@ Please refer to the [Configure SSH Key](https://docs.runpod.io/pods/configuratio

Note that we can skip the step of "Start your Pod. Make sure of the following things" here as we will introduce it below.

## Build the TensorRT-LLM Docker Image and Upload to DockerHub
## Build the TensorRT LLM Docker Image and Upload to DockerHub
Please refer to the [Build Image to DockerHub](build-image-to-dockerhub.md).

Note that the docker image must enable ssh access. See on [Enable ssh access to the container](build-image-to-dockerhub.md#enable-ssh-access-to-the-container).
Expand Down
4 changes: 2 additions & 2 deletions docs/source/developer-guide/dev-containers.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Using Dev Containers

The TensorRT-LLM repository contains a [Dev Containers](https://containers.dev/)
The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
configuration in `.devcontainer`. These files are intended for
use with [Visual Studio Code](https://code.visualstudio.com/).

Due to the various container options supported by TensorRT-LLM (see
Due to the various container options supported by TensorRT LLM (see
[](/installation/build-from-source-linux.md) and
<https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>), the Dev
Container configuration also offers some degree of customization.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer-guide/perf-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ trtllm-bench --model meta-llama/Llama-3.1-8B \
===========================================================
Model: meta-llama/Llama-3.1-8B
Model Path: /Ckpt/Path/To/Llama-3.1-8B
TensorRT-LLM Version: 0.17.0
TensorRT LLM Version: 0.17.0
Dtype: bfloat16
KV Cache Dtype: None
Quantization: FP8
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utilit

## Getting Started

Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
Before benchmarking with AutoDeploy, review the [TensorRT LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.

## Basic Usage

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Expert Configuration of LLM API

For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TensorRT LLM argument list.
For advanced TensorRT LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TensorRT LLM argument list.

- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
Please make sure to refer to those first.
Expand Down
Loading