Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/advanced/lora.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w

#### DoRA

TRTLLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
TensorRT LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.

The DoRA scales must be normalized before they are submitted to TRTLLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
The DoRA scales must be normalized before they are submitted to TensorRT LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.

When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/advanced/speculative-decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits

### Disaggregated Serving

[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE-3 using the two-model approach is supported in the PyTorch backend.
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.

## Lookahead Decoding

Expand Down
12 changes: 6 additions & 6 deletions docs/source/architecture/add-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@

# Adding a Model

This document describes how to add a typical decoder-only model in TensorRT-LLM.
This document describes how to add a typical decoder-only model in TensorRT LLM.

## Step 1. Write Modeling Part

TensorRT-LLM provides different levels of APIs:
TensorRT LLM provides different levels of APIs:

- Low-level functions, for example, `concat`, `add`, and `sum`.
- Basic layers, such as, `Linear` and `LayerNorm`.
- High-level layers, such as, `MLP` and `Attention`.
- Base class for typical decoder-only models, such as, `DecoderModelForCausalLM`.

1. Create a model directory in `tensorrt_llm/models`, for example `my_model`.
2. Write a `model.py` with TensorRT-LLM's APIs
2. Write a `model.py` with TensorRT LLM's APIs

```python
class MyDecoderLayer(Module):
Expand Down Expand Up @@ -52,7 +52,7 @@ class MyModelForCausalLM(DecoderModelForCausalLM):

## Step 2. Implement Weight Conversion

The weights from source framework need to be converted and bound to the new added TensorRT-LLM model. Here is an example of converting HuggingFace weights:
The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights:

```python
class MyModelForCausalLM(DecoderModelForCausalLM):
Expand All @@ -62,8 +62,8 @@ class MyModelForCausalLM(DecoderModelForCausalLM):
hf_model_dir,
dtype='float16',
mapping: Optional[Mapping] = None) -> MyModelForCausalLM
# create a TensorRT-LLM MyModelForCausalLM model object
# convert HuggingFace checkpoint to TensorRT-LLM expected weights dict
# create a TensorRT LLM MyModelForCausalLM model object
# convert HuggingFace checkpoint to TensorRT LLM expected weights dict
# load the weights to MyModelForCausalLM object
```

Expand Down
24 changes: 12 additions & 12 deletions docs/source/architecture/checkpoint.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,36 @@
# TensorRT-LLM Checkpoint
# TensorRT LLM Checkpoint

## Overview

The earlier versions (pre-0.8 version) of TensorRT-LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT-LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT-LLM. This file documents the workflow around TensorRT-LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.
The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT LLM. This file documents the workflow around TensorRT LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.

There are three steps in the workflow:

1. Convert weights from different source frameworks into TensorRT-LLM checkpoint.
2. Build the TensorRT-LLM checkpoint into TensorRT engines with a unified build command.
3. Load the engines to TensorRT-LLM model runner and evaluate with different evaluation tasks.
1. Convert weights from different source frameworks into TensorRT LLM checkpoint.
2. Build the TensorRT LLM checkpoint into TensorRT engines with a unified build command.
3. Load the engines to TensorRT LLM model runner and evaluate with different evaluation tasks.

```
NeMo -------------
|
HuggingFace ------
| convert build load
Modelopt --------- ----------> TensorRT-LLM Checkpoint --------> TensorRT Engine ------> TensorRT-LLM ModelRunner
Modelopt --------- ----------> TensorRT LLM Checkpoint --------> TensorRT Engine ------> TensorRT LLM ModelRunner
|
JAX --------------
|
DeepSpeed --------
```

## Prepare the TensorRT-LLM Checkpoint
## Prepare the TensorRT LLM Checkpoint

TensorRT-LLM aims at supporting different sources:
TensorRT LLM aims at supporting different sources:

1. Trained models from NVIDIA NeMo, Microsoft DeepSpeed, and JAX
2. Quantized models from NVIDIA Modelopt
3. Popular models from HuggingFace

TensorRT-LLM defines its own checkpoint format. A checkpoint directory includes:
TensorRT LLM defines its own checkpoint format. A checkpoint directory includes:

1. One config `json` file, which contains several model hyper-parameters.
2. One or several rank weights files, each file contains a dictionary of tensors (weights).
Expand Down Expand Up @@ -107,7 +107,7 @@ Here is the model specific config list:
### Rank Weights

Like PyTorch, the tensor (weight) name is a string containing hierarchical information,
which is uniquely mapped to a certain parameter of a TensorRT-LLM model.
which is uniquely mapped to a certain parameter of a TensorRT LLM model.

For example, each transformer layer of the OPT model contains an `Attention` layer, an `MLP` layer. and two `LayerNorm` layers.

Expand Down Expand Up @@ -169,7 +169,7 @@ Here is the AWQ scaling factors of `mlp.fc` linear layer:
- `transformer.layers.0.mlp.fc.prequant_scaling_factor`

```{note}
The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.

### Example

Expand Down Expand Up @@ -218,7 +218,7 @@ Here is the `config.json`:

## Build Checkpoint into TensorRT Engine

TensorRT-LLM provides a unified build command: `trtllm-build`. Before using it,
TensorRT LLM provides a unified build command: `trtllm-build`. Before using it,
you may need to add it to the `PATH`.

```bash
Expand Down
77 changes: 67 additions & 10 deletions docs/source/architecture/overview.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,75 @@
(architecture-overview)=
# Architecture Overview

# TensorRT-LLM Architecture
The `LLM` class is a core entry point for the TensorRT LLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:

TensorRT-LLM is a toolkit to assemble optimized solutions to perform Large Language Model (LLM) inference. It offers a Model Definition API to define models and compile efficient [TensorRT](https://developer.nvidia.com/tensorrt) engines for NVIDIA GPUs. It also contains Python and C++ components to build runtimes to execute those engines as well as backends for the [Triton Inference
Server](https://developer.nvidia.com/nvidia-triton-inference-server) to easily create web-based services for LLMs. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI).
```python
from tensorrt_llm import LLM

As a user, the very first step to create an inference solution is to either define your own model or select a pre-defined network architecture (refer to {ref}`models` for the list of models supported by TensorRT-LLM). Once defined, that model must be trained using a training framework (training is outside of the scope of TensorRT-LLM). For pre-defined models, checkpoints can be downloaded from various providers. To illustrate that point, a lot of examples in TensorRT-LLM use model weights obtained from the [Hugging Face](https://huggingface.co) hub and trained using [NVIDIA Nemo](https://developer.nvidia.com/nemo) or [PyTorch](https://pytorch.org).
# Initialize the LLM with a specified model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

Equipped with the model definition and the weights, a user must use TensorRT-LLM's Model Definition API to recreate the model in a way that can be compiled by TensorRT into an efficient engine. For ease of use, TensorRT-LLM already supports a handful of standard models.
# Generate text using the model
output = llm.generate("Hello, my name is")
```

Together with the Model Definition API to describe models, TensorRT-LLM provides users with components to create a runtime that executes the efficient TensorRT engine. Runtime components offer beam-search, along with extensive sampling functionalities such as top-K and top-P sampling. The exhaustive list can be found in the documentation of the {ref}`gpt-runtime`. The C++ runtime is the recommended runtime.
The `LLM` class automatically manages essential pre and post-processing steps, including tokenization (encoding input prompts into numerical representations) and detokenization (decoding model outputs back into human-readable text).

TensorRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving. The C++ backend implements in-flight batching as explained in the {ref}`executor` documentation and is the recommended backend.
Internally, the `LLM` class orchestrates the creation of a dedicated `PyExecutor(Worker)` process on each rank.

## Model Weights
![TensorRT LLM Architecture Overview](../media/TRTLLM_Architecture_Overview.png)

TensorRT-LLM is a library for LLM inference, and so to use it, you need to supply a set of trained weights. You can either use your own model weights trained in a framework like [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) or pull a set of pretrained weights from repositories like the Hugging Face Hub.
This `PyExecutor` operates in a continuous background loop, designed for the efficient, asynchronous processing of inference requests.

The `PyExecutor`'s functionality is built upon several key components:

- `Scheduler`: Responsible for determining which active requests are ready for execution at each processing step.

- `KVCacheManager`: Manages the allocation, deallocation, and maintenance of the Key-Value (KV) Cache. This is a critical optimization for Transformer models, significantly enhancing performance during autoregressive text generation by storing previously computed attention keys and values.

- `ModelEngine`: Handles the loading and highly efficient execution of the language model on the GPU hardware.

- `Sampler`: Takes the raw outputs (logits) from the ModelEngine and applies appropriate sampling strategies (e.g., greedy, top-k, top-p, beam search) to generate the final output tokens.

During each iteration of its background loop, the `PyExecutor` performs the following sequence of operations:

- Request Fetching: Retrieves new inference requests from an internal request queue, if available.

- Scheduling: Interacts with the `Scheduler` to identify and prioritize requests that are ready to be processed in the current step.

- Resource Preparation: Coordinates with the `KVCacheManager` to ensure that the necessary Key-Value (KV) Cache resources are allocated for the selected requests.

- Model Execution: Invokes the `ModelEngine` to perform a forward pass on the scheduled requests, predicting the next output tokens.

- Output Handling: Updates the partial outputs for ongoing requests and finalizes the results for any requests that have reached completion, returning them to the user.


## Runtime Optimizations

TensorRT LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.

### CUDA Graph

CUDA Graphs drastically reduce the CPU-side overhead associated with launching GPU kernels, which is particularly impactful in PyTorch-based inference where Python's host-side code can be a bottleneck. By capturing a sequence of CUDA operations as a single graph, the entire sequence can be launched with one API call, minimizing CPU-GPU synchronization and driver overhead.

To maximize the "hit rate" of these cached graphs, TensorRT LLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.

### Overlap Scheduler

The Overlap Scheduler maximizes GPU utilization by hiding CPU-bound latency behind GPU computation.

The key strategy is to launch the GPU's work for the next step (n+1) immediately, without waiting for the CPU to finish processing the results of the current step (n). This allows the CPU to handle tasks like checking stop criteria or updating responses for one batch while the GPU is already executing the model for the subsequent batch.

This concurrent execution pipeline is illustrated in the `PyExecutor`'s logic:

```python
# Schedule and launch GPU work for the current step (n)
scheduled_batch, _, _ = self._schedule()
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
sample_state = self._sample_async(scheduled_batch, batch_outputs)

# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
if self.previous_batch is not None:
self._process_previous_batch()
```

This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRT LLM.
Loading