A TensorRT Toolbox for Optimized Large Language Model Inference
@@ -18,20 +18,20 @@ TensorRT-LLM
## Tech Blogs
-* [08/06] Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM
+* [08/06] Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md)
-* [08/01] Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization)
+* [08/01] Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
✨ [➡️ link](./docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
* [07/26] N-Gram Speculative Decoding in TensorRT‑LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog7_NGram_performance_Analysis_And_Auto_Enablement.md)
-* [06/19] Disaggregated Serving in TensorRT-LLM
+* [06/19] Disaggregated Serving in TensorRT LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)
-* [06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)
+* [06/05] Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
* [05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
@@ -44,21 +44,21 @@ TensorRT-LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)
## Latest News
-* [07/15] 🌟 TensorRT-LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 [➡️ link](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
+* [07/15] 🌟 TensorRT LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 [➡️ link](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
* [06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ [➡️ link](https://events.nvidia.com/scaletheunscalablenextgenai)
* [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
✨ [➡️ link](https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/)
-* [04/10] TensorRT-LLM DeepSeek R1 performance benchmarking best practices now published.
+* [04/10] TensorRT LLM DeepSeek R1 performance benchmarking best practices now published.
✨ [➡️ link](./docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
-* [04/05] TensorRT-LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!
+* [04/05] TensorRT LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

-* [03/22] TensorRT-LLM is now fully open-source, with developments moved to GitHub!
-* [03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT-LLM [➡️ Link](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
-* [02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT-LLM [➡️ Link](https://developer.nvidia.com/blog/spotlight-naver-place-optimizes-slm-based-vertical-services-with-nvidia-tensorrt-llm/)
+* [03/22] TensorRT LLM is now fully open-source, with developments moved to GitHub!
+* [03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT LLM [➡️ Link](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
+* [02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT LLM [➡️ Link](https://developer.nvidia.com/blog/spotlight-naver-place-optimizes-slm-based-vertical-services-with-nvidia-tensorrt-llm/)
* [02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell [➡️ Link](https://huggingface.co/nvidia/DeepSeek-R1-FP4)
@@ -82,11 +82,11 @@ TensorRT-LLM
* [2025/01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
-* [2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
+* [2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
-* [2025/01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
+* [2025/01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
-* [2025/01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
+* [2025/01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT LLM Speculative Decoding
[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
* [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview
@@ -102,16 +102,16 @@ TensorRT-LLM
✅ Easy building of TensorRT engines
[➡️ link](https://developer.nvidia.com/nsight-dl-designer?ncid=so-link-485689&linkId=100000315016072)
-* [2024/11/26] 📣 Introducing TensorRT-LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT-LLM repo. ✅ Pre-compiled TensorRT-LLM wheels & containers for easy integration ✅ Comprehensive guides & docs to get you started
+* [2024/11/26] 📣 Introducing TensorRT LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT LLM repo. ✅ Pre-compiled TensorRT LLM wheels & containers for easy integration ✅ Comprehensive guides & docs to get you started
[➡️ link](https://forums.developer.nvidia.com/t/tensorrt-llm-for-jetson/313227?linkId=100000312718869)
-* [2024/11/21] NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
+* [2024/11/21] NVIDIA TensorRT LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
[➡️ link](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-multiblock-attention-boosts-throughput-by-more-than-3x-for-long-sequence-lengths-on-nvidia-hgx-h200/)
* [2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs
[➡️ link](https://developer.nvidia.com/blog/llama-3-2-full-stack-optimizations-unlock-high-performance-on-nvidia-gpus/?ncid=so-link-721194)
-* [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot
+* [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT LLM MultiShot
[➡️ link](https://developer.nvidia.com/blog/3x-faster-allreduce-with-nvswitch-and-tensorrt-llm-multishot/)
* [2024/11/09] ✨ NVIDIA advances the AI ecosystem with the AI model of LG AI Research 🙌
@@ -137,16 +137,16 @@ TensorRT-LLM
* [2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 ⚡TensorRT 10.1 ⚡PyTorch 2.4 ⚡CUDA 12.4 ⚡Python 3.12
[➡️ link](https://github.com/pytorch/TensorRT/releases/tag/v2.4.0)
-* [2024/09/17] ✨ NVIDIA TensorRT-LLM Meetup
+* [2024/09/17] ✨ NVIDIA TensorRT LLM Meetup
[➡️ link](https://drive.google.com/file/d/1RR8GqC-QbuaKuHj82rZcXb3MS20SWo6F/view?usp=share_link)
* [2024/09/17] ✨ Accelerating LLM Inference at Databricks with TensorRT-LLM
[➡️ link](https://drive.google.com/file/d/1NeSmrLaWRJAY1rxD9lJmzpB9rzr38j8j/view?usp=sharing)
-* [2024/09/17] ✨ TensorRT-LLM @ Baseten
+* [2024/09/17] ✨ TensorRT LLM @ Baseten
[➡️ link](https://drive.google.com/file/d/1Y7L2jqW-aRmt31mCdqhwvGMmCSOzBUjG/view?usp=share_link)
-* [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML
+* [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT LLM for Optimal Serving with BentoML
[➡️ link](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml)
@@ -193,7 +193,7 @@ Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights
👀 📚 DIY [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&launchableID=env-2h6bym7h5GFNho3vpWQQeUYMwTM&instance=L4%40g6.xlarge&diskStorage=500&cloudID=devplane-brev-1&baseImage=nvcr.io%2Fnvidia%2Ftensorrt%3A24.05-py3&file=https%3A%2F%2Fgithub.com%2FNVIDIA%2FTensorRT%2Fblob%2Frelease%2F10.0%2Fsamples%2Fpython%2Fsample_weight_stripping%2Fnotebooks%2Fweight_stripping.ipynb&name=tensorrt_weight_stripping_resnet50)
* [2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:
-Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)
+Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)
* [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
@@ -202,23 +202,23 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
* [2024/02/06] [🚀 Speed up inference with SOTA quantization techniques in TRT-LLM](./docs/source/blogs/quantization-in-TRT-LLM.md)
* [2024/01/30] [ New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](./docs/source/blogs/XQA-kernel.md)
* [2023/12/04] [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](./docs/source/blogs/Falcon180B-H200.md)
-* [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
+* [2023/11/27] [SageMaker LMI now supports TensorRT LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
* [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
-* [2023/10/22] [🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
+* [2023/10/22] [🚀 RAG on Windows using TensorRT LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
* [2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)
-* [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
+* [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT LLM for Windows
](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
-## TensorRT-LLM Overview
+## TensorRT LLM Overview
-TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
+TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
-[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT-LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
+[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
-TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
+TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
## Getting Started
@@ -235,14 +235,14 @@ To get started with TensorRT-LLM, visit our documentation:
## Deprecation Policy
-Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. Beginning with version 1.0, TensorRT-LLM has the following deprecation policy:
+Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. Beginning with version 1.0, TensorRT LLM has the following deprecation policy:
1. Communication of Deprecation
- Deprecation notices are documented in the Release Notes.
- Deprecated APIs, methods, classes, or parameters include a statement in the source code indicating when they were deprecated.
- If used, deprecated methods, classes, or parameters issue runtime deprecation warnings.
2. Migration Period
- - TensorRT-LLM provides a 3-month migration period after deprecation.
+ - TensorRT LLM provides a 3-month migration period after deprecation.
- During this period, deprecated APIs, tools, or parameters continue to work but trigger warnings.
3. Scope of Deprecation
- Full API/Method/Class Deprecation: The entire API/method/class is marked for removal.
@@ -253,5 +253,5 @@ Deprecation is used to inform developers that some APIs and tools are no longer
## Useful Links
- [Quantized models on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4): A growing collection of quantized (e.g., FP8, FP4) and optimized LLMs, including [DeepSeek FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), ready for fast inference with TensorRT-LLM.
- [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo): A datacenter scale distributed inference serving framework that works seamlessly with TensorRT-LLM.
-- [AutoDeploy](./examples/auto_deploy/README.md): A prototype backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models.
-- [WeChat Discussion Group](https://github.com/NVIDIA/TensorRT-LLM/issues/5359): A real-time channel for TensorRT-LLM Q&A and news.
+- [AutoDeploy](./examples/auto_deploy/README.md): A prototype backend for TensorRT LLM to simplify and accelerate the deployment of PyTorch models.
+- [WeChat Discussion Group](https://github.com/NVIDIA/TensorRT-LLM/issues/5359): A real-time channel for TensorRT LLM Q&A and news.
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 4a8c8e9267f..d532c23ef26 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -64,7 +64,7 @@ else()
message(STATUS "NVTX is enabled")
endif()
-# Add TensorRT-LLM Gen export interface and CUDA support
+# Add TensorRT LLM Gen export interface and CUDA support
add_compile_definitions("TLLM_GEN_EXPORT_INTERFACE")
add_compile_definitions("TLLM_ENABLE_CUDA")
@@ -134,9 +134,9 @@ execute_process(
OUTPUT_STRIP_TRAILING_WHITESPACE)
if(TRTLLM_VERSION_RESULT EQUAL 0)
- message(STATUS "TensorRT-LLM version: ${TRTLLM_VERSION}")
+ message(STATUS "TensorRT LLM version: ${TRTLLM_VERSION}")
else()
- message(FATAL_ERROR "Failed to determine Tensorrt-LLM version")
+ message(FATAL_ERROR "Failed to determine TensorRT LLM version")
endif()
configure_file(
diff --git a/cpp/cmake/modules/cuda_configuration.cmake b/cpp/cmake/modules/cuda_configuration.cmake
index 57f957da39e..1251c79ea6b 100644
--- a/cpp/cmake/modules/cuda_configuration.cmake
+++ b/cpp/cmake/modules/cuda_configuration.cmake
@@ -116,7 +116,7 @@ function(setup_cuda_architectures)
unset(CMAKE_CUDA_ARCHITECTURES_RAW)
message(
STATUS
- "Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT-LLM optimized for, "
+ "Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT LLM optimized for, "
"not all architectures CUDA compiler supports.")
elseif(CMAKE_CUDA_ARCHITECTURES_RAW STREQUAL "all-major")
message(
diff --git a/docker/README.md b/docker/README.md
index 275de142a32..ca1fd2196b1 100644
--- a/docker/README.md
+++ b/docker/README.md
@@ -2,12 +2,12 @@
## Multi-stage Builds with Docker
-TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
+TensorRT LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
The following build stages are defined:
* `devel`: this image provides all dependencies for building TensorRT-LLM.
* `wheel`: this image contains the source code and the compiled binary distribution.
-* `release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
+* `release`: this image has the binaries installed and contains TensorRT LLM examples in `/app/tensorrt_llm`.
## Building Docker Images with GNU `make`
@@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
* `_push`: pushes the docker image for the stage to a docker registry (implies `_build`).
* `_run`: runs the docker image for the stage in a new container.
-For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
+For example, the `release` stage is built and pushed from the top-level directory of TensorRT LLM as follows:
```bash
make -C docker release_push
@@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
(typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
-started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
+started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT LLM working copy and the user home directory, respectively.
diff --git a/docker/develop.md b/docker/develop.md
index 2e1884b5cc4..556663c08ae 100644
--- a/docker/develop.md
+++ b/docker/develop.md
@@ -1,34 +1,34 @@
# Description
-TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
-state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
+TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
+state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
# Overview
-## TensorRT-LLM Develop Container
+## TensorRT LLM Develop Container
-The TensorRT-LLM Develop container includes all necessary dependencies to build TensorRT-LLM from source. It is
-specifically designed to be used alongside the source code cloned from the official TensorRT-LLM repository:
+The TensorRT LLM Develop container includes all necessary dependencies to build TensorRT LLM from source. It is
+specifically designed to be used alongside the source code cloned from the official TensorRT LLM repository:
[GitHub Repository - NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
-Full instructions for cloning the TensorRT-LLM repository can be found in
-the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
+Full instructions for cloning the TensorRT LLM repository can be found in
+the [TensorRT LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
-### Running TensorRT-LLM Using Docker
+### Running TensorRT LLM Using Docker
-With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
+With the top-level directory of the TensorRT LLM repository cloned to your local machine, you can run the following
command to start the development container:
```bash
make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
```
-where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
+where `x.y.z` is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
-local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
-integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
+local source code of TensorRT LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
+integration. Ensure that the image version matches the version of TensorRT LLM in your currently checked out local git branch. Not
specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
accompanied by a development container. In that case, use the latest version preceding the version of your development
branch.
@@ -50,9 +50,9 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
Note that this will start the container with the user `root`, which may leave files with root ownership in your local
checkout.
-### Building the TensorRT-LLM Wheel within the Container
+### Building the TensorRT LLM Wheel within the Container
-You can build the TensorRT-LLM Python wheel inside the development container using the following command:
+You can build the TensorRT LLM Python wheel inside the development container using the following command:
```bash
./scripts/build_wheel.py --clean --use_ccache --cuda_architectures=native
@@ -78,7 +78,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
pip install ./build/tensorrt_llm*.whl
```
-For additional information on building the TensorRT-LLM wheel, refer to
+For additional information on building the TensorRT LLM wheel, refer to
the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).
### Security CVEs
diff --git a/docker/release.md b/docker/release.md
index b016a0b204e..88d59518c69 100644
--- a/docker/release.md
+++ b/docker/release.md
@@ -1,18 +1,18 @@
# Description
-TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
-state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
+TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
+state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
create Python and C++ runtimes that orchestrate the inference execution in a performant way.
# Overview
-## TensorRT-LLM Release Container
+## TensorRT LLM Release Container
-The TensorRT-LLM Release container provides a pre-built environment for running TensorRT-LLM.
+The TensorRT LLM Release container provides a pre-built environment for running TensorRT-LLM.
Visit the [official GitHub repository](https://github.com/NVIDIA/TensorRT-LLM) for more details.
-### Running TensorRT-LLM Using Docker
+### Running TensorRT LLM Using Docker
A typical command to launch the container is:
@@ -21,16 +21,16 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpu
nvcr.io/nvidia/tensorrt-llm/release:x.y.z
```
-where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
+where x.y.z is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
```bash
python3 -c "import tensorrt_llm"
```
-This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
+This command will print the TensorRT LLM version if everything is working correctly. After verification, you can explore
and try the example scripts included in `/app/tensorrt_llm/examples`.
-Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
+Alternatively, if you have already cloned the TensorRT LLM repository, you can use the following convenient command to
run the container:
```bash
@@ -43,8 +43,8 @@ container, and launches it with full GPU support.
For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
guidelines, visit the following official resources:
-- [TensorRT-LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
-- [TensorRT-LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
+- [TensorRT LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
+- [TensorRT LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
### Security CVEs
diff --git a/docs/source/blogs/Falcon180B-H200.md b/docs/source/blogs/Falcon180B-H200.md
index 01e5eeba59a..b245a6009f6 100644
--- a/docs/source/blogs/Falcon180B-H200.md
+++ b/docs/source/blogs/Falcon180B-H200.md
@@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.
Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200.
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ |
@@ -38,7 +38,7 @@ while maintaining high accuracy.
Preliminary measured accuracy, subject to change.
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
Falcon-180B |
1xH200 TP1 |
INT4 AWQ
@@ -84,20 +84,20 @@ than A100.
Preliminary measured performance, subject to change.
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
Llama2-70B |
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
FP8 |
BS: (in order) 960, 960, 192, 560, 96, 640
-**TensorRT-LLM GQA now 2.4x faster on H200**
+**TensorRT LLM GQA now 2.4x faster on H200**
Preliminary measured performance, subject to change.
-TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
+TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
Llama2-70B |
1xH200 TP1 |
FP8 |
diff --git a/docs/source/blogs/XQA-kernel.md b/docs/source/blogs/XQA-kernel.md
index 92858239891..dacc3657f32 100644
--- a/docs/source/blogs/XQA-kernel.md
+++ b/docs/source/blogs/XQA-kernel.md
@@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA
-Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a
+Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a
## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
diff --git a/docs/source/blogs/quantization-in-TRT-LLM.md b/docs/source/blogs/quantization-in-TRT-LLM.md
index 74fbd96506c..7476ac27273 100644
--- a/docs/source/blogs/quantization-in-TRT-LLM.md
+++ b/docs/source/blogs/quantization-in-TRT-LLM.md
@@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.
## Quantization in TensorRT-LLM
-TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
+TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
## Benchmark
@@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
\* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
** Calibration time is subject to the actual model size.
-We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
+We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
## What’s coming next
-TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
+TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
diff --git a/docs/source/dev-on-cloud/dev-on-runpod.md b/docs/source/dev-on-cloud/dev-on-runpod.md
index 5062b598b94..96f66945420 100644
--- a/docs/source/dev-on-cloud/dev-on-runpod.md
+++ b/docs/source/dev-on-cloud/dev-on-runpod.md
@@ -1,7 +1,7 @@
(dev-on-runpod)=
-# Develop TensorRT-LLM on Runpod
-[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT-LLM on Runpod.
+# Develop TensorRT LLM on Runpod
+[Runpod](https://runpod.io) is a popular cloud platform among many researchers. This doc describes how to develop TensorRT LLM on Runpod.
## Prepare
@@ -13,7 +13,7 @@ Please refer to the [Configure SSH Key](https://docs.runpod.io/pods/configuratio
Note that we can skip the step of "Start your Pod. Make sure of the following things" here as we will introduce it below.
-## Build the TensorRT-LLM Docker Image and Upload to DockerHub
+## Build the TensorRT LLM Docker Image and Upload to DockerHub
Please refer to the [Build Image to DockerHub](build-image-to-dockerhub.md).
Note that the docker image must enable ssh access. See on [Enable ssh access to the container](build-image-to-dockerhub.md#enable-ssh-access-to-the-container).
diff --git a/docs/source/developer-guide/dev-containers.md b/docs/source/developer-guide/dev-containers.md
index 27b5e26d15e..0203a0ea0c8 100644
--- a/docs/source/developer-guide/dev-containers.md
+++ b/docs/source/developer-guide/dev-containers.md
@@ -1,10 +1,10 @@
# Using Dev Containers
-The TensorRT-LLM repository contains a [Dev Containers](https://containers.dev/)
+The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
configuration in `.devcontainer`. These files are intended for
use with [Visual Studio Code](https://code.visualstudio.com/).
-Due to the various container options supported by TensorRT-LLM (see
+Due to the various container options supported by TensorRT LLM (see
[](/installation/build-from-source-linux.md) and
), the Dev
Container configuration also offers some degree of customization.
diff --git a/docs/source/developer-guide/perf-benchmarking.md b/docs/source/developer-guide/perf-benchmarking.md
index 6c7dbc97c34..dba690193a3 100644
--- a/docs/source/developer-guide/perf-benchmarking.md
+++ b/docs/source/developer-guide/perf-benchmarking.md
@@ -181,7 +181,7 @@ trtllm-bench --model meta-llama/Llama-3.1-8B \
===========================================================
Model: meta-llama/Llama-3.1-8B
Model Path: /Ckpt/Path/To/Llama-3.1-8B
-TensorRT-LLM Version: 0.17.0
+TensorRT LLM Version: 0.17.0
Dtype: bfloat16
KV Cache Dtype: None
Quantization: FP8
diff --git a/docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md b/docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
index 10515500797..1bbf2d1bd89 100644
--- a/docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
+++ b/docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
@@ -4,7 +4,7 @@ AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utilit
## Getting Started
-Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
+Before benchmarking with AutoDeploy, review the [TensorRT LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
## Basic Usage
diff --git a/docs/source/features/auto_deploy/advanced/expert_configurations.md b/docs/source/features/auto_deploy/advanced/expert_configurations.md
index 86109a30e5b..60cfd197a97 100644
--- a/docs/source/features/auto_deploy/advanced/expert_configurations.md
+++ b/docs/source/features/auto_deploy/advanced/expert_configurations.md
@@ -1,6 +1,6 @@
# Expert Configuration of LLM API
-For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TensorRT LLM argument list.
+For advanced TensorRT LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TensorRT LLM argument list.
- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
Please make sure to refer to those first.
diff --git a/docs/source/features/disagg-serving.md b/docs/source/features/disagg-serving.md
index 14de507a89d..b8c65f615f4 100644
--- a/docs/source/features/disagg-serving.md
+++ b/docs/source/features/disagg-serving.md
@@ -60,7 +60,7 @@ In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache
### Overlap Optimization
-To optimize the overall performance of disaggregated serving, TensorRT-LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.
+To optimize the overall performance of disaggregated serving, TensorRT LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.
@@ -71,7 +71,7 @@ To optimize the overall performance of disaggregated serving, TensorRT-LLM overl
### Cache Layout Transformation
-To minimize KV cache transmission latency, TensorRT-LLM currently uses direct transmission between device memories for cache transfer. The KV cache transmission supports using different parallel strategies for the context and generation phases. In such cases, careful orchestration of KV cache block mapping is required. Figure 5 illustrates this using the example of context phase with TP2 and generation phase with PP2.
+To minimize KV cache transmission latency, TensorRT LLM currently uses direct transmission between device memories for cache transfer. The KV cache transmission supports using different parallel strategies for the context and generation phases. In such cases, careful orchestration of KV cache block mapping is required. Figure 5 illustrates this using the example of context phase with TP2 and generation phase with PP2.
@@ -80,13 +80,13 @@ To minimize KV cache transmission latency, TensorRT-LLM currently uses direct tr
Figure 5. KV cache layout conversion
-The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT-LLM provides a set of environment variables for selection in different environments. Please refer to the following section for details [Environment Variables](#Environment-Variables).
+The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT LLM provides a set of environment variables for selection in different environments. Please refer to the following section for details [Environment Variables](#Environment-Variables).
## Usage
### trtllm-serve
-The first approach to do disaggregated LLM inference with TensorRT-LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.
+The first approach to do disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.
diff --git a/docs/source/features/kvcache.md b/docs/source/features/kvcache.md
index 3f6b394d0e3..ee44ed4fdfc 100644
--- a/docs/source/features/kvcache.md
+++ b/docs/source/features/kvcache.md
@@ -1,6 +1,6 @@
# KV Cache System
-The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calculations. The TensorRT-LLM KV cache system also supports reuse across requests and uses a suite of tools like offloading and prioritized eviction to increase reuse. It supports variable attention window sizes and Multi-Head Attention (MHA) optimization techniques such as MQA and GQA.
+The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calculations. The TensorRT LLM KV cache system also supports reuse across requests and uses a suite of tools like offloading and prioritized eviction to increase reuse. It supports variable attention window sizes and Multi-Head Attention (MHA) optimization techniques such as MQA and GQA.
## The Basics
@@ -34,11 +34,11 @@ Reuse across requests is supported by all speculative decoding models. Please se
## Limited Attention Window Size
-TensorRT-LLM takes advantage of layers with limited attention window size in order to reduce computations and memory usage. Blocks that leave the attention window are freed and placed on the radix search tree so they can be reused.
+TensorRT LLM takes advantage of layers with limited attention window size in order to reduce computations and memory usage. Blocks that leave the attention window are freed and placed on the radix search tree so they can be reused.
## MQA / GQA
-TensorRT-LLM takes advantage of grouped query attention in order to save memory. KV cache will create blocks with only enough space to store state for the discrete query head groups. For MHA, there is one group per head, for MQA there is a single group for all the heads. GQA strikes a balance between these two.
+TensorRT LLM takes advantage of grouped query attention in order to save memory. KV cache will create blocks with only enough space to store state for the discrete query head groups. For MHA, there is one group per head, for MQA there is a single group for all the heads. GQA strikes a balance between these two.
## Controlling KV Cache Behavior
diff --git a/docs/source/features/long-sequence.md b/docs/source/features/long-sequence.md
index 2dffc150479..a5cd512c6ef 100644
--- a/docs/source/features/long-sequence.md
+++ b/docs/source/features/long-sequence.md
@@ -1,15 +1,15 @@
# Long Sequences
-In many real-world scenarios, such as long documents summarization or multi-turn conversations, LLMs are required to perform cognitive tasks across long sequences to get better results. This will present challenges to the LLM inference. TensorRT-LLM can support different methods to process long sequences efficiently. This document will introduce those optimization techniques.
+In many real-world scenarios, such as long documents summarization or multi-turn conversations, LLMs are required to perform cognitive tasks across long sequences to get better results. This will present challenges to the LLM inference. TensorRT LLM can support different methods to process long sequences efficiently. This document will introduce those optimization techniques.
## Chunked Context
-Chunked context allows TensorRT-LLM to divide the input tokens into smaller chunks and batch those chunks with the decode requests.
+Chunked context allows TensorRT LLM to divide the input tokens into smaller chunks and batch those chunks with the decode requests.
With the chunked context feature, there are two benefits:
- This can prevent the context phase from becoming a bottleneck, enable more parallelization with tokens in the decode phase, and increase GPU utilization.
-- Chunked context allows TensorRT-LLM to handle requests with longer contexts while achieving higher concurrency. Since memory usage depends on the number of tokens processed per iteration, chunked context decouples memory consumption from the input request's context length, changing it to the smaller chunk size. This enables TensorRT-LLM to process longer contexts without increasing memory requirements, which can also help increase the concurrency under the same memory consumption.
+- Chunked context allows TensorRT LLM to handle requests with longer contexts while achieving higher concurrency. Since memory usage depends on the number of tokens processed per iteration, chunked context decouples memory consumption from the input request's context length, changing it to the smaller chunk size. This enables TensorRT LLM to process longer contexts without increasing memory requirements, which can also help increase the concurrency under the same memory consumption.
To enable chunked context, please set the `enable_chunked_prefill` in `LLM` API to `True`.
```bash
@@ -35,7 +35,7 @@ Instead of splitting the input tokens into smaller chunks for the whole model, c
With chunked attention, the tokens in context requests are split into chunks of a specified size. Then tokens can only attend to other tokens in the same chunk. For example, if the chunk size is 3, we might have a mask illustrated in Figure 1. Each token only needs to attend to at most the past chunk-sized tokens. As a result, both the KV cache size and the attention computation can be significantly reduced.
-Currently TensorRT-LLM can only support chunked attention in llama4 model with TRTLLM attention backend. TensorRT-LLM will read `attention_chunk_size` from the model config. If it is not None, the chunked attention will be enabled with chunk size `attention_chunk_size`. If you want to enable chunked attention to other models, you can set the `attention_chunk_size` in attention API to a valid value.
+Currently TensorRT LLM can only support chunked attention in llama4 model with TRTLLM attention backend. TensorRT LLM will read `attention_chunk_size` from the model config. If it is not None, the chunked attention will be enabled with chunk size `attention_chunk_size`. If you want to enable chunked attention to other models, you can set the `attention_chunk_size` in attention API to a valid value.
Note that chunked attention can only be applied to context requests.
@@ -53,7 +53,7 @@ Since attention layers are usually the performance bottleneck when processing re
Figure 2 shows the sliding window attention mask. Each token will only attend to the past `N` tokens. If the number of past tokens surpasses the max attention window size, `Sliding Window Attention` will be activated.
-TensorRT-LLM treats the kv cache as a circular buffer to support this feature, which is also called `Cyclic KV Cache`. It only stores the kv cache for the last `N` tokens, where `N` is determined by the `KvCacheConfig.max_attention_window` parameter in `LLM` API. TensorRT-LLM allows different `N` values for each layer and users can simply provide a `list[int]` to the `KvCacheConfig.max_attention_window`. To enable this feature, users can set
+TensorRT LLM treats the kv cache as a circular buffer to support this feature, which is also called `Cyclic KV Cache`. It only stores the kv cache for the last `N` tokens, where `N` is determined by the `KvCacheConfig.max_attention_window` parameter in `LLM` API. TensorRT LLM allows different `N` values for each layer and users can simply provide a `list[int]` to the `KvCacheConfig.max_attention_window`. To enable this feature, users can set
```bash
kv_cache_config = KvCacheConfig(
...
diff --git a/docs/source/models/adding-new-model.md b/docs/source/models/adding-new-model.md
index 26d7fa9c15b..7477758fbf3 100644
--- a/docs/source/models/adding-new-model.md
+++ b/docs/source/models/adding-new-model.md
@@ -183,7 +183,7 @@ __all__ = [
#### Out-of-Tree Models
-Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT-LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:
+Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:
```python
from tensorrt_llm import LLM
diff --git a/docs/source/torch/adding_new_model.md b/docs/source/torch/adding_new_model.md
index 55cbfd4794c..ffe7e60f72c 100644
--- a/docs/source/torch/adding_new_model.md
+++ b/docs/source/torch/adding_new_model.md
@@ -183,7 +183,7 @@ __all__ = [
#### Out-of-Tree Models
-Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT-LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:
+Alternatively, you can register the new model as an out-of-tree model, so that you can use the new model without touching the TensorRT LLM codebase. To do so, place `modeling_mymodel.py` (and potentially `configuration_mymodel.py`) in your working directory, and import the modeling code in your script:
```python
from tensorrt_llm import LLM
diff --git a/docs/source/torch/arch_overview.md b/docs/source/torch/arch_overview.md
index ec7f6e51abf..552a1200ea0 100644
--- a/docs/source/torch/arch_overview.md
+++ b/docs/source/torch/arch_overview.md
@@ -1,6 +1,6 @@
# Architecture Ovewiew
-TensorRT-LLM is a toolkit designed to create optimized solutions for Large Language Model (LLM) inference.
+TensorRT LLM is a toolkit designed to create optimized solutions for Large Language Model (LLM) inference.
Besides TensorRT, PyTorch can also serve as the backend for TensorRT-LLM. This document provides an overview of the PyTorch Backend architecture.
## Top Level API
diff --git a/docs/source/torch/attention.md b/docs/source/torch/attention.md
index 2cde32ae905..3ee7be6effd 100644
--- a/docs/source/torch/attention.md
+++ b/docs/source/torch/attention.md
@@ -9,7 +9,7 @@ involves a sequence of batched matrix multiplications, a softmax operation, and
as described in the [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
[Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are
variants of MHA that use fewer KV heads than the number of query heads.
-TensorRT-LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`.
+TensorRT LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`.
The following sections explain how to use these implementations and provide a brief guide on implementing new backends.
## Attention Backends
diff --git a/docs/source/torch/scheduler.md b/docs/source/torch/scheduler.md
index 57652642e4f..c52811d3b2b 100644
--- a/docs/source/torch/scheduler.md
+++ b/docs/source/torch/scheduler.md
@@ -1,6 +1,6 @@
# Scheduler
-TensorRT-LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step.
+TensorRT LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step.
The scheduler is invoked to determine which requests are scheduled at the current step.
## Scheduler Introduction
diff --git a/examples/apps/README.md b/examples/apps/README.md
index d0971150c51..34e5b773152 100644
--- a/examples/apps/README.md
+++ b/examples/apps/README.md
@@ -18,7 +18,7 @@ Note that, the `model_dir` could accept the following formats:
## FastAPI server
NOTE: This FastAPI-based server is only an example for demonstrating the usage
-of TensorRT-LLM LLM API. It is not intended for production use.
+of TensorRT LLM LLM API. It is not intended for production use.
For production, use the `trtllm-serve` command. The server exposes OpenAI compatible API endpoints.
### Install the additional requirements
diff --git a/examples/auto_deploy/CONTRIBUTING.md b/examples/auto_deploy/CONTRIBUTING.md
index f9ead35f5d4..defea1559e5 100644
--- a/examples/auto_deploy/CONTRIBUTING.md
+++ b/examples/auto_deploy/CONTRIBUTING.md
@@ -4,7 +4,7 @@
### 0. Clone the repo
-Clone the TensorRT-LLM repo and `cd` into it:
+Clone the TensorRT LLM repo and `cd` into it:
```bash
git clone https://github.com/NVIDIA/TensorRT-LLM.git
@@ -76,7 +76,7 @@ python build_and_run_ad.py --config '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1
### Linting and Pre-commit hooks
-TensorRT-LLM uses pre-commit hooks to lint code.
+TensorRT LLM uses pre-commit hooks to lint code.
#### Set up pre-commit hooks
diff --git a/examples/auto_deploy/README.md b/examples/auto_deploy/README.md
index cba226e7310..eb6b3754b9d 100644
--- a/examples/auto_deploy/README.md
+++ b/examples/auto_deploy/README.md
@@ -252,7 +252,7 @@ for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM`
### Expert Configuration of LLM API
-For expert TensorRT-LLM users, we also expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
+For expert TensorRT LLM users, we also expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
*at your own risk* (the argument list diverges from TRT-LLM's argument list):
diff --git a/examples/bindings/executor/README.md b/examples/bindings/executor/README.md
index 51a605c8376..df44568fd4f 100644
--- a/examples/bindings/executor/README.md
+++ b/examples/bindings/executor/README.md
@@ -5,7 +5,7 @@ using a TensorRT engine.
## Setup
-Build a TensorRT engine for one of the supported TensorRT-LLM model following
+Build a TensorRT engine for one of the supported TensorRT LLM model following
instructions in the corresponding `examples` folder.
## Usage
diff --git a/examples/cpp/executor/README.md b/examples/cpp/executor/README.md
index 4cc9b72ad98..a5e0815ce9b 100644
--- a/examples/cpp/executor/README.md
+++ b/examples/cpp/executor/README.md
@@ -10,9 +10,9 @@ This directory contains several examples that demonstrate how to use the `Execut
## Building the examples
-To build the examples, you first need to build the TensorRT-LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using the [`build_wheel.py`](source:scripts/build_wheel.py) script. Alternatively, if you have already build the TensorRT-LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly.
+To build the examples, you first need to build the TensorRT LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using the [`build_wheel.py`](source:scripts/build_wheel.py) script. Alternatively, if you have already build the TensorRT LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly.
-Once the TensorRT-LLM libraries are built, you can run
+Once the TensorRT LLM libraries are built, you can run
```
mkdir build
@@ -22,9 +22,9 @@ make -j
```
from the `./examples/cpp/executor/` folder to build the basic and advanced examples.
-## Preparing the TensorRT-LLM engine(s)
+## Preparing the TensorRT LLM engine(s)
-Before you run the examples, please make sure that you have already built engine(s) using the TensorRT-LLM API.
+Before you run the examples, please make sure that you have already built engine(s) using the TensorRT LLM API.
Use `trtllm-build` to build the TRT-LLM engine.
diff --git a/examples/disaggregated/README.md b/examples/disaggregated/README.md
index 2319e3f5b91..11c1888d78a 100644
--- a/examples/disaggregated/README.md
+++ b/examples/disaggregated/README.md
@@ -1,6 +1,6 @@
# Disaggregated Serving
-To run TensorRT-LLM in disaggregated mode, you must first launch context (prefill) and generation (decode) servers using `trtllm-serve`.
+To run TensorRT LLM in disaggregated mode, you must first launch context (prefill) and generation (decode) servers using `trtllm-serve`.
## Launching disaggregated servers locally on single node
@@ -142,7 +142,7 @@ CUDA_VISIBLE_DEVICES=3 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--extra_llm_api_options ./gen_extra-llm-api-config.yml \
--metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
```
-TensorRT-LLM will automatically register any newly launched server with the ETCD server, allowing the router to send new requests to the added server.
+TensorRT LLM will automatically register any newly launched server with the ETCD server, allowing the router to send new requests to the added server.
### Dynamically removing servers
diff --git a/examples/disaggregated/slurm/README.md b/examples/disaggregated/slurm/README.md
index a81607b8bd4..c13ffaa1abf 100644
--- a/examples/disaggregated/slurm/README.md
+++ b/examples/disaggregated/slurm/README.md
@@ -1,6 +1,6 @@
# Disaggregated Inference Benchmark Scripts
-This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
+This directory contains scripts to run disaggregated inference benchmarks using TensorRT LLM and SLURM.
## Overview
diff --git a/examples/draft_target_model/README.md b/examples/draft_target_model/README.md
index 85766aa177b..49128ea4c5d 100644
--- a/examples/draft_target_model/README.md
+++ b/examples/draft_target_model/README.md
@@ -1,10 +1,10 @@
# Draft-Target-Model Speculative Decoding (DTM)
-This document shows how to build and run a model using DTM speculative decoding (also known as `Speculative-Sampling`, [`Paper`](https://arxiv.org/abs/2302.01318)) in TensorRT-LLM on single GPU, or single node multiple GPU.
+This document shows how to build and run a model using DTM speculative decoding (also known as `Speculative-Sampling`, [`Paper`](https://arxiv.org/abs/2302.01318)) in TensorRT LLM on single GPU, or single node multiple GPU.
## Overview
-We provide two styles of running DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Here we introduce the detailed steps of running DTM in both workflows.
+We provide two styles of running DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT LLM directly. Here we introduce the detailed steps of running DTM in both workflows.
## Support Matrix
* GPU Compute Capability >= 8.0 (Ampere or newer)
@@ -64,7 +64,7 @@ trtllm-build \
--max_seq_len=${MAX_SEQ_LEN}
```
-### TensorRT-LLM workflow
+### TensorRT LLM workflow
+ `--draft_engine_dir` and `--engine_dir` must be specified for the draft and target engines respectively.
+ `--draft_target_model_config` is corresponding configuration of DTM, which has 4 hyperparameters that you need to specify to control the process of generation:
diff --git a/examples/eagle/README.md b/examples/eagle/README.md
index 0b103ca40ed..99a8c792616 100644
--- a/examples/eagle/README.md
+++ b/examples/eagle/README.md
@@ -1,11 +1,11 @@
# EAGLE speculative Decoding
-This document shows how to build and run a model using EAGLE decoding ([`GitHub`](https://github.com/SafeAILab/EAGLE/tree/main), [`BLOG`](https://sites.google.com/view/eagle-llm)) in TensorRT-LLM on a single node with one or multiple GPUs.
+This document shows how to build and run a model using EAGLE decoding ([`GitHub`](https://github.com/SafeAILab/EAGLE/tree/main), [`BLOG`](https://sites.google.com/view/eagle-llm)) in TensorRT LLM on a single node with one or multiple GPUs.
## Overview
Different from other models, EAGLE decoding needs a base model and an EAGLE model.
-The TensorRT-LLM EAGLE decoding implementation can be found in [tensorrt_llm/models/eagle/model.py](../../tensorrt_llm/models/eagle/model.py).
+The TensorRT LLM EAGLE decoding implementation can be found in [tensorrt_llm/models/eagle/model.py](../../tensorrt_llm/models/eagle/model.py).
The implementation adds an EAGLE drafter network to a base model.
For more info about EAGLE, refer to [speculative decoding documentation](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).
@@ -23,10 +23,10 @@ For more info about EAGLE, refer to [speculative decoding documentation](https:/
* C++ runtime
* Tensor Parallel
-This example is based on the Vicuna-7b v1.3 model, a fine-tuned Llama. With some modifications, you can add EAGLE to other base models as well. Some TensorRT-LLM models might not work with EAGLE due to the missing head size in the speculative decoding XQA attention kernels.
+This example is based on the Vicuna-7b v1.3 model, a fine-tuned Llama. With some modifications, you can add EAGLE to other base models as well. Some TensorRT LLM models might not work with EAGLE due to the missing head size in the speculative decoding XQA attention kernels.
## Usage
-The TensorRT-LLM EAGLE example code is located in [`examples/eagle`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with EAGLE decoding support.
+The TensorRT LLM EAGLE example code is located in [`examples/eagle`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with EAGLE decoding support.
In this example, we use the model from HuggingFace [`yuhuili/EAGLE-Vicuna-7B-v1.3`](https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3), which is a LLAMA-based model.
### Build TensorRT engine(s)
@@ -80,7 +80,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_eagle \
### Run
-To run a TensorRT-LLM model with EAGLE-1 decoding support, you can use `../run.py` script, with an additional argument
+To run a TensorRT LLM model with EAGLE-1 decoding support, you can use `../run.py` script, with an additional argument
`--eagle_choices`.
The `--eagle_choices` argument is of type `list[list[int]]`. If you do not specify any choices, the
default, [mc_sim_7b_63](https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py#L1) choices
diff --git a/examples/language_adapter/README.md b/examples/language_adapter/README.md
index 93be421ff53..8487c8ab42a 100755
--- a/examples/language_adapter/README.md
+++ b/examples/language_adapter/README.md
@@ -1,6 +1,6 @@
# Language-Adapter
-This document shows how to build and run a model with Language-Adapter plugin in TensorRT-LLM on NVIDIA GPUs.
+This document shows how to build and run a model with Language-Adapter plugin in TensorRT LLM on NVIDIA GPUs.
## Overview
The concept of Language Adapter during inference time was introduced in [MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer
diff --git a/examples/lookahead/README.md b/examples/lookahead/README.md
index efb3d14e8d5..9c33d8caa44 100644
--- a/examples/lookahead/README.md
+++ b/examples/lookahead/README.md
@@ -11,7 +11,7 @@ Lookahead algorithm is configured with a tuple of `(windows_size, ngram_size, ve
+ `ngram_size` is the n-gram size, meaning the maximum number of draft tokens accepted per iteration.
+ `verification_set_size` is the maximum number of n-grams considered for verification, meaning the number of draft token beam hypotheses.
-You can enable Lookahead decoding for any of decoder-only autoregressive LLM models without any fine-tuning. Some TensorRT-LLM models might not work with Lookahead due to the missing head size in the speculative decoding XQA attention kernels. Lookahead performance greatly depends on the base model, hardware, batch size, sequence length, and the dataset. It is recommended to profile various configurations to find the best `(W, N, G)` configuration given the setup.
+You can enable Lookahead decoding for any of decoder-only autoregressive LLM models without any fine-tuning. Some TensorRT LLM models might not work with Lookahead due to the missing head size in the speculative decoding XQA attention kernels. Lookahead performance greatly depends on the base model, hardware, batch size, sequence length, and the dataset. It is recommended to profile various configurations to find the best `(W, N, G)` configuration given the setup.
Specify the Lookahead related flags in three places:
@@ -25,8 +25,8 @@ def max_draft_len(windows_size, ngram_size, verification_set_size):
+ (windows_size - 1 + verification_set_size) * (ngram_size - 1)
```
-2. *Setup TensorRT-LLM runtime*
-When TensorRT-LLM server starts, the server reserves resources according to the `executor_lookahead_config`. `executor_lookahead_config` is noted as `(W, N, G)`. Ensure the `max_draft_len` derived from `executor_lookahead_config` equals to the `max_draft_len` specified in the engine-building phase -- `--max_draft_len == max_draft_len(W, N, G)`.
+2. *Setup TensorRT LLM runtime*
+When TensorRT LLM server starts, the server reserves resources according to the `executor_lookahead_config`. `executor_lookahead_config` is noted as `(W, N, G)`. Ensure the `max_draft_len` derived from `executor_lookahead_config` equals to the `max_draft_len` specified in the engine-building phase -- `--max_draft_len == max_draft_len(W, N, G)`.
3. *Setup the request*
Each request can specify a Lookahead configuration, noted as `(w, n, g)`. If none are specified, the `executor_lookahead_config` is used. The minimum Lookahead config `(1, 1, 0)` forces non speculative, autoregressive mode. The meaningful minimum configuration is `(2, 2, 1)`. Ensure the Lookahead configuration for each request satisfies `w <= W, n <= N, g <= G`.
diff --git a/examples/medusa/README.md b/examples/medusa/README.md
index 6ef93f36b2c..eb442554ec4 100644
--- a/examples/medusa/README.md
+++ b/examples/medusa/README.md
@@ -1,9 +1,9 @@
# Medusa Decoding
-This document shows how to build and run a model using Medusa decoding([`Github`](https://github.com/FasterDecoding/Medusa), [`BLOG`](https://sites.google.com/view/medusa-llm)) in TensorRT-LLM on single GPU, single node multiple GPU.
+This document shows how to build and run a model using Medusa decoding([`Github`](https://github.com/FasterDecoding/Medusa), [`BLOG`](https://sites.google.com/view/medusa-llm)) in TensorRT LLM on single GPU, single node multiple GPU.
## Overview
-Different from other models, Medusa decoding needs a base model and Medusa heads. The TensorRT-LLM Medusa Decoding implementation can be found in [tensorrt_llm/models/medusa/model.py](../../tensorrt_llm/models/medusa/model.py). The implementation adds Medusa heads to a base model.
+Different from other models, Medusa decoding needs a base model and Medusa heads. The TensorRT LLM Medusa Decoding implementation can be found in [tensorrt_llm/models/medusa/model.py](../../tensorrt_llm/models/medusa/model.py). The implementation adds Medusa heads to a base model.
For more info about Medusa visit [speculative decoding documentation](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).
@@ -16,7 +16,7 @@ For more info about Medusa visit [speculative decoding documentation](https://nv
* Tensor Parallel
## Usage
-The TensorRT-LLM Medusa example code is located in [`examples/medusa`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with Medusa decoding support.
+The TensorRT LLM Medusa example code is located in [`examples/medusa`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with Medusa decoding support.
In this example, we demonstrate the usage of two models:
1. The Vucuna 7B model from Hugging Face [`FasterDecoding/medusa-vicuna-7b-v1.3`](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3) with its Medusa heads [`medusa-vicuna-7b-v1.3`](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3).
2. The quantized checkpoint [`nvidia/Llama-3.1-8B-Medusa-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Medusa-FP8) on Hugging Face by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt). This model is based on [Llama-3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and enhanced with Medusa heads, with both the base model (except lm_head) and Medusa heads already quantized in FP8.
@@ -32,7 +32,7 @@ git clone https://huggingface.co/lmsys/vicuna-7b-v1.3
https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3
```
-We use `convert_checkpoint.py` script to convert the model for Medusa decoding into TensorRT-LLM checkpoint format.
+We use `convert_checkpoint.py` script to convert the model for Medusa decoding into TensorRT LLM checkpoint format.
We could use `--num_medusa_heads` to set the number of medusa heads that we want to use. If not, `num_medusa_heads` will be set according to the `medusa_num_heads` from medusa weights' `config.json`.
Here is the example:
@@ -118,7 +118,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_base_model_fp8_medusa_fp16
```
### Run
-To run a TensorRT-LLM model with Medusa decoding support, we can use `../run.py` script, with an additional argument `--medusa_choices`.
+To run a TensorRT LLM model with Medusa decoding support, we can use `../run.py` script, with an additional argument `--medusa_choices`.
The `--medusa_choices` is of type `list[list[int]]`.
Medusa decoding is supported by Python runtime and C++ runtime with inflight-batching. C++ runtime is recommended for performance.
diff --git a/examples/models/contrib/arctic/README.md b/examples/models/contrib/arctic/README.md
index 977f464ac0f..b707726881a 100644
--- a/examples/models/contrib/arctic/README.md
+++ b/examples/models/contrib/arctic/README.md
@@ -2,13 +2,13 @@
This document shows how to build and run a [Arctic](https://huggingface.co/Snowflake/snowflake-arctic-instruct) model in TensorRT-LLM.
-The TensorRT-LLM Arctic implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can
+The TensorRT LLM Arctic implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can
be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
See the LLaMA example [`examples/models/core/llama`](../../../llama) for details.
- [Arctic](#arctic)
- [Download model checkpoints](#download-model-checkpoints)
- - [TensorRT-LLM workflow](#tensorrt-llm-workflow)
+ - [TensorRT LLM workflow](#tensorrt-llm-workflow)
- [Apply FP8 PTQ](#apply-fp8-ptq)
- [Build TensorRT engine](#build-tensorrt-engine)
- [Run Engine](#run-engine)
@@ -26,7 +26,7 @@ git clone https://huggingface.co/Snowflake/snowflake-arctic-instruct tmp/hf_chec
```
-## TensorRT-LLM workflow
+## TensorRT LLM workflow
Next, we use the general quantization script `quantize.py` to convert the checkpoints in FP8, and build the model with `trtllm-build` on multi-GPUs. In the example below, we use Tensor Parallelism (TP) across 8 GPUs.
**Note: for such large model, it is deemed necessary to apply Post-Training Quantization (PTQ) methods on the model weights to deploy it on a cluster node, e.g., 8xH100 GPUs. In this example, we demonstrate the FP8 quantization workflow, which is supported on Hopper-and-next GPU architectures. For instructions of other PTQ methods other than FP8, please refer to the LLaMA or Mixtral examples.**
@@ -47,7 +47,7 @@ mkdir -p tmp/trt_engines
Notes:
- currently quantize.py does not support for Expert Parallelism (EP) mode yet. User should use `../llama/convert_checkpoint.py` and specify `--moe_ep_size 1` instead, if needed.
-- TensorRT-LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. `batch_size` and `calib_size` can be adjusted to shorten the calibration time. Please refer to ../quantization/README.md for explanation.
+- TensorRT LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. `batch_size` and `calib_size` can be adjusted to shorten the calibration time. Please refer to ../quantization/README.md for explanation.
- **due to the large model size and the calibration step (which has to load the HuggingFace model and run forward passes), it is likely that you will need more number of GPUs during quantization step than the number of GPUs for engine building and final deployment. For example, using 16xH100 or 8xH200 for quantization & 8xH100 for deployment.**
```bash
diff --git a/examples/models/contrib/baichuan/README.md b/examples/models/contrib/baichuan/README.md
index 13e3b01e889..876a70b17ed 100644
--- a/examples/models/contrib/baichuan/README.md
+++ b/examples/models/contrib/baichuan/README.md
@@ -1,6 +1,6 @@
# Baichuan
-This document shows how to build and run a Baichuan models (including `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b`) in TensorRT-LLM on both single GPU and single node multi-GPU.
+This document shows how to build and run a Baichuan models (including `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b`) in TensorRT LLM on both single GPU and single node multi-GPU.
## Table of Contents
@@ -21,9 +21,9 @@ This document shows how to build and run a Baichuan models (including `v1_7b`/`v
## Overview
-The TensorRT-LLM Baichuan implementation can be found in [tensorrt_llm/models/baichuan/model.py](../../tensorrt_llm/models/baichuan/model.py). The TensorRT-LLM Baichuan example code is located in [`examples/models/contrib/baichuan`](./). There is one main file:
+The TensorRT LLM Baichuan implementation can be found in [tensorrt_llm/models/baichuan/model.py](../../tensorrt_llm/models/baichuan/model.py). The TensorRT LLM Baichuan example code is located in [`examples/models/contrib/baichuan`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert supported checkpoints into TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert supported checkpoints into TensorRT LLM format.
The script accepts an argument named model_version, whose value should be `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` and the default value is `v1_13b`.
@@ -43,7 +43,7 @@ In addition, there are two shared files in the folder [`examples`](../../../) fo
## Usage
-The TensorRT-LLM Baichuan example code locates at [examples/models/contrib/baichuan](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM Baichuan example code locates at [examples/models/contrib/baichuan](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -55,14 +55,14 @@ pip install -r requirements.txt
Need to specify the HF Baichuan checkpoint path. For `v1_13b`, you should use whether [baichuan-inc/Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat) or [baichuan-inc/Baichuan-13B-Base](https://huggingface.co/baichuan-inc/Baichuan-13B-Base). For `v2_13b`, you should use whether [baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) or [baichuan-inc/Baichuan2-13B-Base](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base). More Baichuan models could be found on [baichuan-inc](https://huggingface.co/baichuan-inc).
-TensorRT-LLM Baichuan builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
+TensorRT LLM Baichuan builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) with dummy weights.
***For all kinds of checkpoints, they share the same trtllm-build command like:***
```bash
-# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
+# Enable several TensorRT LLM plugins to increase runtime performance. It also helps with build time.
-# The TensorRT-LLM GPT Attention plugin (--gpt_attention_plugin) is
+# The TensorRT LLM GPT Attention plugin (--gpt_attention_plugin) is
# enabled by default to increase runtime performance.
# 7B models should always enable `gpt_attention_plugin`` since RoPE is only
# supported with GPTAttention plugin now.
@@ -117,7 +117,7 @@ python convert_checkpoint.py --model_version v1_13b \
#### SmoothQuant
-The SmoothQuant supports all Baichuan model variants. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
+The SmoothQuant supports all Baichuan model variants. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
`--smoothquant` is the starting point of INT8 inference. By default, it
will run the model in the _per-tensor_ mode.
@@ -150,7 +150,7 @@ python ../../../quantization/quantize.py --model_dir /code/model/Baichuan2-13B-C
--calib_size 256
```
-The quantized model checkpoint is saved to `./quantized_fp8/` for future TensorRT-LLM engine build directly with the `trtllm-build` command mentioned above.
+The quantized model checkpoint is saved to `./quantized_fp8/` for future TensorRT LLM engine build directly with the `trtllm-build` command mentioned above.
Note that you can enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable` when building the engines.
#### Groupwise quantization (AWQ/GPTQ)
@@ -164,7 +164,7 @@ python ../../../quantization/quantize.py --model_dir /code/model/Baichuan2-13B-C
--output_dir ./quantized_int4-awq_gs128 \
--calib_size 32
```
-The quantized model checkpoint is saved to `./quantized_int4-awq_gs128/` for future TensorRT-LLM engine build directly with the `trtllm-build` command mentioned above.
+The quantized model checkpoint is saved to `./quantized_int4-awq_gs128/` for future TensorRT LLM engine build directly with the `trtllm-build` command mentioned above.
##### GPTQ
To run the GPTQ Baichuan example, the following steps are required:
@@ -173,7 +173,7 @@ To run the GPTQ Baichuan example, the following steps are required:
Quantized weights for GPTQ can be generated using an open source project such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa.git).
- Let us build the TensorRT-LLM engine with the saved `./baichuan-2-13b-4bit-gs64.safetensors`.
+ Let us build the TensorRT LLM engine with the saved `./baichuan-2-13b-4bit-gs64.safetensors`.
2. Checkpoint conversion:
@@ -189,7 +189,7 @@ To run the GPTQ Baichuan example, the following steps are required:
--tp_size 2 \
--output_dir ./tmp/baichuan_v2_13b/trt_ckpts/int4_gptq_gs64/2-gpu/
```
- The quantized model checkpoint is saved for future TensorRT-LLM engine build directly with the `trtllm-build` command mentioned above.
+ The quantized model checkpoint is saved for future TensorRT LLM engine build directly with the `trtllm-build` command mentioned above.
#### INT8 KV cache
INT8 KV cache could be enabled to reduce memory footprint. It will bring more performance gains when batch size gets larger.
@@ -249,7 +249,7 @@ python convert_checkpoint.py --model_version v1_13b \
### Run
-To run a TensorRT-LLM Baichuan model using the engines generated by `trtllm-build`
+To run a TensorRT LLM Baichuan model using the engines generated by `trtllm-build`
```bash
# With fp16 inference
diff --git a/examples/models/contrib/bloom/README.md b/examples/models/contrib/bloom/README.md
index f4f738c35e1..e0ab0ad6553 100644
--- a/examples/models/contrib/bloom/README.md
+++ b/examples/models/contrib/bloom/README.md
@@ -1,6 +1,6 @@
# BLOOM
-This document shows how to build and run a BLOOM model in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
+This document shows how to build and run a BLOOM model in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
## Table of Contents
@@ -17,9 +17,9 @@ This document shows how to build and run a BLOOM model in TensorRT-LLM on both s
## Overview
-The TensorRT-LLM BLOOM implementation can be found in [tensorrt_llm/models/bloom/model.py](../../tensorrt_llm/models/bloom/model.py). The TensorRT-LLM BLOOM example code is located in [`examples/models/contrib/bloom`](./). There is one main file:
+The TensorRT LLM BLOOM implementation can be found in [tensorrt_llm/models/bloom/model.py](../../tensorrt_llm/models/bloom/model.py). The TensorRT LLM BLOOM example code is located in [`examples/models/contrib/bloom`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -36,7 +36,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM BLOOM example code locates at [examples/models/contrib/bloom](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM BLOOM example code locates at [examples/models/contrib/bloom](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -57,7 +57,7 @@ rm -rf ./bloom/560M
mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M
```
-TensorRT-LLM BLOOM builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
+TensorRT LLM BLOOM builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) with dummy weights.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
@@ -159,7 +159,7 @@ trtllm-build --checkpoint_dir ./bloom/560m/trt_ckpt/int8/1-gpu/ \
#### SmoothQuant
-Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
+Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
```bash
diff --git a/examples/models/contrib/chatglm-6b/README.md b/examples/models/contrib/chatglm-6b/README.md
index fbe463b4c5c..73d60e235bb 100644
--- a/examples/models/contrib/chatglm-6b/README.md
+++ b/examples/models/contrib/chatglm-6b/README.md
@@ -1,6 +1,6 @@
# ChatGLM
-This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
+This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) models using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- [ChatGLM](#chatglm)
- [Overview](#overview)
@@ -9,7 +9,7 @@ This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUD
- [Tokenizer and special tokens comparison](#tokenizer-and-special-tokens-comparison)
- [Usage](#usage)
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [Enable plugins](#enable-plugins)
- [In-flight batching](#in-flight-batching)
@@ -26,10 +26,10 @@ This document explains how to build the [ChatGLM-6B](https://huggingface.co/THUD
## Overview
-The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
-The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm-6b`](./). There is one main file:
+The TensorRT LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
+The TensorRT LLM ChatGLM example code is located in [`examples/models/contrib/chatglm-6b`](./). There is one main file:
-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
diff --git a/examples/models/contrib/chatglm2-6b/README.md b/examples/models/contrib/chatglm2-6b/README.md
index 30fc3ce3933..b75b3a069ea 100644
--- a/examples/models/contrib/chatglm2-6b/README.md
+++ b/examples/models/contrib/chatglm2-6b/README.md
@@ -1,6 +1,6 @@
# ChatGLM
-This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b), [ChatGLM2-6B-32k](https://huggingface.co/THUDM/chatglm2-6b-32k) models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
+This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b), [ChatGLM2-6B-32k](https://huggingface.co/THUDM/chatglm2-6b-32k) models using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- [ChatGLM](#chatglm)
- [Overview](#overview)
@@ -9,7 +9,7 @@ This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THU
- [Tokenizer and special tokens comparison](#tokenizer-and-special-tokens-comparison)
- [Usage](#usage)
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [Enable plugins](#enable-plugins)
- [In-flight batching](#in-flight-batching)
@@ -26,10 +26,10 @@ This document explains how to build the [ChatGLM2-6B](https://huggingface.co/THU
## Overview
-The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
-The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm2-6b`](./). There is one main file:
+The TensorRT LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
+The TensorRT LLM ChatGLM example code is located in [`examples/models/contrib/chatglm2-6b`](./). There is one main file:
-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
diff --git a/examples/models/contrib/chatglm3-6b-32k/README.md b/examples/models/contrib/chatglm3-6b-32k/README.md
index 211844d95e4..ed9d7ab81f4 100644
--- a/examples/models/contrib/chatglm3-6b-32k/README.md
+++ b/examples/models/contrib/chatglm3-6b-32k/README.md
@@ -1,6 +1,6 @@
# ChatGLM
-This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), [ChatGLM3-6B-32k](https://huggingface.co/THUDM/chatglm3-6b-32k) models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
+This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), [ChatGLM3-6B-32k](https://huggingface.co/THUDM/chatglm3-6b-32k) models using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- [ChatGLM](#chatglm)
- [Overview](#overview)
@@ -9,7 +9,7 @@ This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THU
- [Tokenizer and special tokens comparison](#tokenizer-and-special-tokens-comparison)
- [Usage](#usage)
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [Enable plugins](#enable-plugins)
- [In-flight batching](#in-flight-batching)
@@ -26,10 +26,10 @@ This document explains how to build the [ChatGLM3-6B](https://huggingface.co/THU
## Overview
-The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
-The TensorRT-LLM ChatGLM example code is located in [`examples/models/contrib/chatglm3-6b-32k`](./). There is one main file:
+The TensorRT LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../tensorrt_llm/models/chatglm/model.py).
+The TensorRT LLM ChatGLM example code is located in [`examples/models/contrib/chatglm3-6b-32k`](./). There is one main file:
-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
diff --git a/examples/models/contrib/dbrx/README.md b/examples/models/contrib/dbrx/README.md
index e9a9f580f2f..fb799c7c749 100644
--- a/examples/models/contrib/dbrx/README.md
+++ b/examples/models/contrib/dbrx/README.md
@@ -14,7 +14,7 @@ This document shows how to build and run a DBRX model in TensorRT-LLM. DBRX is a
## Overview
-The TensorRT-LLM DBRX implementation can be found in [tensorrt_llm/models/dbrx/model.py](../../../../tensorrt_llm/models/dbrx/model.py).
+The TensorRT LLM DBRX implementation can be found in [tensorrt_llm/models/dbrx/model.py](../../../../tensorrt_llm/models/dbrx/model.py).
## Support Matrix
* BF16
@@ -41,7 +41,7 @@ pip install -r requirements.txt
git lfs install
```
-Download one or more DBRX models that you would like to build to TensorRT-LLM engines. You can download from the [HuggingFace](https://huggingface.co) hub:
+Download one or more DBRX models that you would like to build to TensorRT LLM engines. You can download from the [HuggingFace](https://huggingface.co) hub:
```bash
# Download dbrx-base
@@ -53,9 +53,9 @@ git clone https://huggingface.co/databricks/dbrx-instruct
### Build TensorRT engine(s)
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints. A DBRX model has 132B parameters, so you need at least 4 x 80GB GPUs to load the model in 16-bit precision for weight conversion.
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints. A DBRX model has 132B parameters, so you need at least 4 x 80GB GPUs to load the model in 16-bit precision for weight conversion.
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is same to the number of GPUs used to run inference. Normally, `trtllm-build` uses one GPU by default, but if you have already more GPUs available at build time, you may enable parallel builds to make the engine building process faster by adding the `--workers` argument.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The number of engine files is same to the number of GPUs used to run inference. Normally, `trtllm-build` uses one GPU by default, but if you have already more GPUs available at build time, you may enable parallel builds to make the engine building process faster by adding the `--workers` argument.
Here are some examples:
@@ -221,10 +221,10 @@ mpirun -n 8 \
If the engines are run successfully, you will see output like:
```
......
-[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT-LLM (total latency: 9.962657451629639 sec)
-[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1189)
-[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 119.34566713477734)
-[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT LLM (total latency: 9.962657451629639 sec)
+[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1189)
+[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT LLM (tokens per second: 119.34566713477734)
+[04/02/2024-11:16:37] [TRT-LLM] [I] TensorRT LLM beam 0 result
[04/02/2024-11:16:37] [TRT-LLM] [I] rouge1 : 26.842471264679535
[04/02/2024-11:16:37] [TRT-LLM] [I] rouge2 : 9.979512100961314
[04/02/2024-11:16:37] [TRT-LLM] [I] rougeL : 19.50336050538688
diff --git a/examples/models/contrib/deepseek_v1/README.md b/examples/models/contrib/deepseek_v1/README.md
index 3e18c3a7da2..d3d6272a436 100755
--- a/examples/models/contrib/deepseek_v1/README.md
+++ b/examples/models/contrib/deepseek_v1/README.md
@@ -25,15 +25,15 @@ The Deepseek-v1 model requires 1x80G GPU memory.
## Overview
-The TensorRT-LLM Deepseek-v1 implementation can be found in [tensorrt_llm/models/deepseek_v1/model.py](../../tensorrt_llm/models/deepseek_v1/model.py). The TensorRT-LLM Deepseek-v1 example code is located in [`examples/models/contrib/deepseek_v1`](./). There is one main file:
+The TensorRT LLM Deepseek-v1 implementation can be found in [tensorrt_llm/models/deepseek_v1/model.py](../../tensorrt_llm/models/deepseek_v1/model.py). The TensorRT LLM Deepseek-v1 example code is located in [`examples/models/contrib/deepseek_v1`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Deepseek-v1 model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Deepseek-v1 model into TensorRT LLM checkpoint format.
In addition, there are three shared files in the parent folder [`examples`](../../../) can be used for inference and evaluation:
* [`../../../run.py`](../../../run.py) to run the model inference output by given an input text.
-* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT-LLM model.
-* [`../../../mmlu.py`](../../../mmlu.py) to running score script from https://github.com/declare-lab/instruct-eval to compare HF model and TensorRT-LLM model on the MMLU dataset.
+* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT LLM model.
+* [`../../../mmlu.py`](../../../mmlu.py) to running score script from https://github.com/declare-lab/instruct-eval to compare HF model and TensorRT LLM model on the MMLU dataset.
## Support Matrix
@@ -43,13 +43,13 @@ In addition, there are three shared files in the parent folder [`examples`](../.
## Usage
-The TensorRT-LLM Deepseek-v1 example code locates at [examples/models/contrib/deepseek_v1](./). It takes PyTorch weights as input, and builds corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM Deepseek-v1 example code locates at [examples/models/contrib/deepseek_v1](./). It takes PyTorch weights as input, and builds corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
Below is the step-by-step to run Deepseek-v1 with TensorRT-LLM.
-First the checkpoint will be converted to the TensorRT-LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT-LLM checkpoint.
+First the checkpoint will be converted to the TensorRT LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT LLM checkpoint.
```bash
# Build the bfloat16 engine from Deepseek-v1 HF weights.
@@ -76,7 +76,7 @@ python ../../../run.py --engine_dir ./trtllm_engines/deepseek_v1/bf16/tp1 \
### FP8 Quantization
-The [`../../../quantization/quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT-LLM checkpoints.
+The [`../../../quantization/quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT LLM checkpoints.
```bash
# Deepseek-v1: single gpu, fp8 quantization
diff --git a/examples/models/contrib/deepseek_v2/README.md b/examples/models/contrib/deepseek_v2/README.md
index b26ba54fadf..01d22e4dd8b 100644
--- a/examples/models/contrib/deepseek_v2/README.md
+++ b/examples/models/contrib/deepseek_v2/README.md
@@ -27,15 +27,15 @@ The Deepseek-v2 model requires least 8x80G GPU memory, model contains 236B param
## Overview
-The TensorRT-LLM Deepseek-v2 implementation can be found in [tensorrt_llm/models/deepseek_v2/model.py](../../tensorrt_llm/models/deepseek_v2/model.py). The TensorRT-LLM Deepseek-v2 example code is located in [`examples/models/contrib/deepseek_v2`](./). There is one main file:
+The TensorRT LLM Deepseek-v2 implementation can be found in [tensorrt_llm/models/deepseek_v2/model.py](../../tensorrt_llm/models/deepseek_v2/model.py). The TensorRT LLM Deepseek-v2 example code is located in [`examples/models/contrib/deepseek_v2`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Deepseek-v2 model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Deepseek-v2 model into TensorRT LLM checkpoint format.
In addition, there are three shared files in the parent folder [`examples`](../../../) can be used for inference and evaluation:
* [`../../../run.py`](../../../run.py) to run the model inference output by given an input text.
-* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT-LLM model.
-* [`../../../mmlu.py`](../../../mmlu.py) to running score script from https://github.com/declare-lab/instruct-eval to compare HF model and TensorRT-LLM model on the MMLU dataset.
+* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT LLM model.
+* [`../../../mmlu.py`](../../../mmlu.py) to running score script from https://github.com/declare-lab/instruct-eval to compare HF model and TensorRT LLM model on the MMLU dataset.
## Support Matrix
@@ -46,16 +46,16 @@ In addition, there are three shared files in the parent folder [`examples`](../.
## Usage
-The TensorRT-LLM Deepseek-v2 example code locates at [examples/models/contrib/deepseek_v2](./). It takes PyTorch weights as input, and builds corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM Deepseek-v2 example code locates at [examples/models/contrib/deepseek_v2](./). It takes PyTorch weights as input, and builds corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
Below is the step-by-step to run Deepseek-v2 with TensorRT-LLM.
-First the checkpoint will be converted to the TensorRT-LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT-LLM checkpoint.
+First the checkpoint will be converted to the TensorRT LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT LLM checkpoint.
```bash
-# Convert Deepseek-v2 HF weights to TensorRT-LLM checkpoint format.
+# Convert Deepseek-v2 HF weights to TensorRT LLM checkpoint format.
python convert_checkpoint.py --model_dir ./DeepSeek-V2 \
--output_dir ./trtllm_checkpoint_deepseek_v2_8gpu_bf16 \
--dtype bfloat16 \
@@ -72,7 +72,7 @@ python convert_checkpoint.py --model_dir ./DeepSeek-V2 \
We observe use GPUs(8xH200) the checkpoint conversion time took ~ 34 mints, while use CPUs took ~ 21 mints and CPU memory required >= 770GB.
-After the checkpoint conversion, the TensorRT engine(s) can be built with the TensorRT-LLM checkpoint.
+After the checkpoint conversion, the TensorRT engine(s) can be built with the TensorRT LLM checkpoint.
```bash
# Build engine
@@ -142,10 +142,10 @@ and the output will be like:
[10/28/2024-16:46:22] [TRT-LLM] [I]
Output : [[' James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88.']]
[10/28/2024-16:46:22] [TRT-LLM] [I] ---------------------------------------------------------
-[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT-LLM (total latency: 32.02327513694763 sec)
-[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1394)
-[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 43.53083793080361)
-[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT LLM (total latency: 32.02327513694763 sec)
+[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1394)
+[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT LLM (tokens per second: 43.53083793080361)
+[10/28/2024-16:49:33] [TRT-LLM] [I] TensorRT LLM beam 0 result
[10/28/2024-16:49:33] [TRT-LLM] [I] rouge1 : 17.85755990133811
[10/28/2024-16:49:33] [TRT-LLM] [I] rouge2 : 6.273032755727469
[10/28/2024-16:49:33] [TRT-LLM] [I] rougeL : 14.768323033457317
diff --git a/examples/models/contrib/dit/README.md b/examples/models/contrib/dit/README.md
index d0e163dc1b3..7d8105fd3b2 100644
--- a/examples/models/contrib/dit/README.md
+++ b/examples/models/contrib/dit/README.md
@@ -3,9 +3,9 @@ This document shows how to build and run a [DiT](https://arxiv.org/abs/2212.0974
## Overview
-The TensorRT-LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT-LLM DiT example code is located in [`examples/dit`](./). There are main files to build and run DiT with TensorRT-LLM:
+The TensorRT LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT LLM DiT example code is located in [`examples/dit`](./). There are main files to build and run DiT with TensorRT-LLM:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the DiT model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the DiT model into TensorRT LLM checkpoint format.
* [`sample.py`](./sample.py) to generate images with TensorRT engine(s).
## Support Matrix
@@ -17,13 +17,13 @@ The TensorRT-LLM DiT implementation can be found in [tensorrt_llm/models/dit/mod
## Usage
-The TensorRT-LLM DiT example code locates at [examples/dit](./). It takes PyTorch weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM DiT example code locates at [examples/dit](./). It takes PyTorch weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build DiT TensorRT engine(s)
First, download the pretrained DiT-XL/2 PyTorch checkpoint from the official pytorch implementation repo [here](https://github.com/facebookresearch/DiT/tree/main?tab=readme-ov-file#sampling--), please review its license items before use.
-This checkpoint will be converted to the TensorRT-LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT-LLM checkpoint.
+This checkpoint will be converted to the TensorRT LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT LLM checkpoint.
As for run inference with FP8 quantization, currently only linear layers are supported to be quantized. Make sure that scaling factors for weights are also stored in the quantized checkpoint.
@@ -57,7 +57,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_fp8/ \
Set `--max_batch_size` to tell how many images at most you would like to generate. We disable `--remove_input_padding` since we don't need to padding DiT's patches. Besides, we disable `--bert_attention_plugin` for better performance, since the plugin's fmha is not supported for DiT's hidden size (72 for DiT-XL).
-After build, we can find a `./engine_output` directory, it is ready for running DiT model with TensorRT-LLM now.
+After build, we can find a `./engine_output` directory, it is ready for running DiT model with TensorRT LLM now.
### Build VAE TensorRT engine
We can further accelerate VAE decoder by TensorRT.
diff --git a/examples/models/contrib/falcon/README.md b/examples/models/contrib/falcon/README.md
index 613def2eb0b..3c5c8a229cf 100644
--- a/examples/models/contrib/falcon/README.md
+++ b/examples/models/contrib/falcon/README.md
@@ -1,13 +1,13 @@
# Falcon
-This document shows how to build and run a Falcon model in TensorRT-LLM on single GPU, single node multi-GPU, and multi-node multi-GPU.
+This document shows how to build and run a Falcon model in TensorRT LLM on single GPU, single node multi-GPU, and multi-node multi-GPU.
- [Falcon](#falcon)
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download weights from HuggingFace Transformers](#1-download-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [4. Run summarization task with the TensorRT engine(s)](#4-run-summarization-task-with-the-tensorrt-engines)
- [FP8 Post-Training Quantization](#fp8-post-training-quantization)
@@ -18,9 +18,9 @@ This document shows how to build and run a Falcon model in TensorRT-LLM on singl
## Overview
-The TensorRT-LLM Falcon implementation can be found in [tensorrt_llm/models/falcon/model.py](../../tensorrt_llm/models/falcon/model.py). The TensorRT-LLM Falcon example code is located in [`examples/models/contrib/falcon`](./). There is one main file:
+The TensorRT LLM Falcon implementation can be found in [tensorrt_llm/models/falcon/model.py](../../tensorrt_llm/models/falcon/model.py). The TensorRT LLM Falcon example code is located in [`examples/models/contrib/falcon`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -39,7 +39,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
The next two sections describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
-format to the TensorRT-LLM format.
+format to the TensorRT LLM format.
### 1. Download weights from HuggingFace Transformers
@@ -72,8 +72,8 @@ git clone https://huggingface.co/tiiuae/falcon-180B falcon/180b
git clone https://huggingface.co/tiiuae/falcon-11B falcon/11b
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
+### 2. Convert weights from HF Transformers to TensorRT LLM format
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
```bash
# falcon-rw-1b: single gpu, dtype float16
@@ -127,7 +127,7 @@ For example, you can't configure 2-way tensor parallelism for [falcon-7b](https:
### 3. Build TensorRT engine(s)
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
Normally, the `trtllm-build` command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the `--workers` argument.
@@ -239,8 +239,8 @@ python ../../../summarize.py --test_trt_llm \
If the engines are run successfully, you will see output like (falcon-rw-1b as the example):
```
......
-[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.816917419433594 sec)
-[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT LLM (total latency: 5.816917419433594 sec)
+[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT LLM beam 0 result
[12/27/2023-03:57:02] [TRT-LLM] [I] rouge1 : 15.061493342516243
[12/27/2023-03:57:02] [TRT-LLM] [I] rouge2 : 4.495335888974063
[12/27/2023-03:57:02] [TRT-LLM] [I] rougeL : 11.800002670828547
diff --git a/examples/models/contrib/gptj/README.md b/examples/models/contrib/gptj/README.md
index 35d6e1cc52c..177acd063cd 100644
--- a/examples/models/contrib/gptj/README.md
+++ b/examples/models/contrib/gptj/README.md
@@ -1,6 +1,6 @@
# GPT-J
-This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using TensorRT-LLM and run on a single GPU.
+This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using TensorRT LLM and run on a single GPU.
- [GPT-J](#gpt-j)
- [Overview](#overview)
@@ -18,10 +18,10 @@ This document explains how to build the [GPT-J](https://huggingface.co/EleutherA
## Overview
-The TensorRT-LLM GPT-J implementation can be found in [`tensorrt_llm/models/gptj/model.py`](../../tensorrt_llm/models/gptj/model.py). The TensorRT-LLM GPT-J example
+The TensorRT LLM GPT-J implementation can be found in [`tensorrt_llm/models/gptj/model.py`](../../tensorrt_llm/models/gptj/model.py). The TensorRT LLM GPT-J example
code is located in [`examples/models/contrib/gptj`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -61,7 +61,7 @@ wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
### 2. Build TensorRT engine(s)
-TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using
+TensorRT LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) using
dummy weights.
Examples of build invocations:
@@ -76,7 +76,7 @@ python convert_checkpoint.py --model_dir ./gpt-j-6b \
***For all kinds of checkpoints, they share the same trtllm-build command like:***
```bash
-# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
+# Enable several TensorRT LLM plugins to increase runtime performance. It also helps with build time.
trtllm-build --checkpoint_dir ./trt_ckpt/gptj_fp16_tp1/ \
--output_dir ./trt_engines/gptj_fp16_tp1/ \
--gemm_plugin float16 \
@@ -229,7 +229,7 @@ Building command is identical to the common one above.
### 3. Run
-To run a TensorRT-LLM GPT-J model:
+To run a TensorRT LLM GPT-J model:
```bash
python3 ../../../run.py --max_output_len=50 --engine_dir=gptj_engine --tokenizer_dir=gptj_model
@@ -237,7 +237,7 @@ python3 ../../../run.py --max_output_len=50 --engine_dir=gptj_engine --tokenizer
## Summarization using the GPT-J model
-The following section describes how to run a TensorRT-LLM GPT-J model to summarize the articles from the
+The following section describes how to run a TensorRT LLM GPT-J model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF GPT-J model.
diff --git a/examples/models/contrib/gptneox/README.md b/examples/models/contrib/gptneox/README.md
index 5c0a7289947..a34c9bcd990 100644
--- a/examples/models/contrib/gptneox/README.md
+++ b/examples/models/contrib/gptneox/README.md
@@ -1,6 +1,6 @@
# GPT-NeoX
-This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using TensorRT-LLM and run on a single GPU and a single node with
+This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using TensorRT LLM and run on a single GPU and a single node with
multiple GPUs.
- [GPT-NeoX](#gpt-neox)
@@ -8,21 +8,21 @@ multiple GPUs.
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download weights from HuggingFace (HF) Transformers](#1-download-weights-from-huggingface-hf-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [4. Summarization using the GPT-NeoX model](#4-summarization-using-the-gpt-neox-model)
- [Apply groupwise quantization GPTQ](#apply-groupwise-quantization-gptq)
- [1. Download weights from HuggingFace (HF)](#1-download-weights-from-huggingface-hf)
- [2. Generating quantized weights](#2-generating-quantized-weights)
- - [3. Convert weights from HF Transformers to TensorRT-LLM format](#3-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [3. Convert weights from HF Transformers to TensorRT LLM format](#3-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [4. Build TensorRT engine(s)](#4-build-tensorrt-engines)
- [5. Summarization using the GPT-NeoX model](#5-summarization-using-the-gpt-neox-model)
## Overview
-The TensorRT-LLM GPT-NeoX implementation can be found in [`tensorrt_llm/models/gptneox/model.py`](../../tensorrt_llm/models/gptneox/model.py). The TensorRT-LLM GPT-NeoX example code is located in [`examples/models/contrib/gptneox`](./). There is one main file:
+The TensorRT LLM GPT-NeoX implementation can be found in [`tensorrt_llm/models/gptneox/model.py`](../../tensorrt_llm/models/gptneox/model.py). The TensorRT LLM GPT-NeoX example code is located in [`examples/models/contrib/gptneox`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -37,7 +37,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM GPT-NeoX example code locates at [examples/models/contrib/gptneox](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM GPT-NeoX example code locates at [examples/models/contrib/gptneox](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### 1. Download weights from HuggingFace (HF) Transformers
@@ -52,7 +52,7 @@ pip install -r requirements.txt
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
+### 2. Convert weights from HF Transformers to TensorRT LLM format
If you want to use Int8 weight only quantization, just need to add `--use_weight_only` flag.
@@ -117,7 +117,7 @@ trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ \
### 4. Summarization using the GPT-NeoX model
-The following section describes how to run a TensorRT-LLM GPT-NeoX model to summarize the articles from the
+The following section describes how to run a TensorRT LLM GPT-NeoX model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF GPT-NeoX model.
@@ -164,7 +164,7 @@ In this example, the weights are quantized using [GPTQ-for-LLaMa](https://github
sh gptq_convert.sh
```
-### 3. Convert weights from HF Transformers to TensorRT-LLM format
+### 3. Convert weights from HF Transformers to TensorRT LLM format
To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`:
Here `--quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script.
diff --git a/examples/models/contrib/grok/README.md b/examples/models/contrib/grok/README.md
index 0e6f228ffa7..8395a32f03f 100644
--- a/examples/models/contrib/grok/README.md
+++ b/examples/models/contrib/grok/README.md
@@ -1,6 +1,6 @@
# Grok-1
-This document shows how to build and run grok-1 model in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
+This document shows how to build and run grok-1 model in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
- [Grok1](#Grok-1)
- [Prerequisite](#prerequisite)
@@ -22,9 +22,9 @@ The grok-1 model requires a node with 8x80GB GPU memory(at least).
## Overview
-The TensorRT-LLM Grok-1 implementation can be found in [tensorrt_llm/models/grok/model.py](../../../../tensorrt_llm/models/grok/model.py). The TensorRT-LLM Grok-1 example code is located in [`examples/models/contrib/grok`](./). There is one main file:
+The TensorRT LLM Grok-1 implementation can be found in [tensorrt_llm/models/grok/model.py](../../../../tensorrt_llm/models/grok/model.py). The TensorRT LLM Grok-1 example code is located in [`examples/models/contrib/grok`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Grok-1 model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Grok-1 model into TensorRT LLM checkpoint format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -38,7 +38,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM Grok-1 example code locates at [examples/models/contrib/grok](./). It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM Grok-1 example code locates at [examples/models/contrib/grok](./). It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -50,7 +50,7 @@ pip install -r requirements.txt
Need to prepare the Grok-1 checkpoint by following the guides here https://github.com/xai-org/grok-1.
-TensorRT-LLM Grok-1 builds TensorRT engine(s) from Xai's checkpoints.
+TensorRT LLM Grok-1 builds TensorRT engine(s) from Xai's checkpoints.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
diff --git a/examples/models/contrib/hyperclovax/README.md b/examples/models/contrib/hyperclovax/README.md
index a870a178e63..7a634639d4a 100644
--- a/examples/models/contrib/hyperclovax/README.md
+++ b/examples/models/contrib/hyperclovax/README.md
@@ -84,7 +84,7 @@ The output will be like:
For more information, you can refer to [examples/llm-api](../../../llm-api).
## TRT flow
-The next section describes how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../../core/llama/convert_checkpoint.py) for the HyperCLOVAX model and then build the model with `trtllm-build`.
+The next section describes how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. We will use llama's [convert_checkpoint.py](../../core/llama/convert_checkpoint.py) for the HyperCLOVAX model and then build the model with `trtllm-build`.
### Convert checkpoint and build TensorRT engine(s)
@@ -248,5 +248,5 @@ python ../../../summarize.py \
--engine_dir trt_engines/$MODEL_NAME/fp16/1-gpu
```
-The TensorRT-LLM HyperCLOVAX implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
+The TensorRT LLM HyperCLOVAX implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
For more examples, see [`examples/models/core/llama/README.md`](../../core/llama/README.md)
diff --git a/examples/models/contrib/internlm/README.md b/examples/models/contrib/internlm/README.md
index b9a063caafa..1295d6626b9 100644
--- a/examples/models/contrib/internlm/README.md
+++ b/examples/models/contrib/internlm/README.md
@@ -1,6 +1,6 @@
# InternLM
-This document shows how to build and run InternLM 7B / 20B models in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
+This document shows how to build and run InternLM 7B / 20B models in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
- [InternLM](#internlm)
- [Overview](#overview)
@@ -14,12 +14,12 @@ This document shows how to build and run InternLM 7B / 20B models in TensorRT-LL
## Overview
-The TensorRT-LLM InternLM implementation is based on the LLaMA model. The implementation can
+The TensorRT LLM InternLM implementation is based on the LLaMA model. The implementation can
be found in [tensorrt_llm/models/llama/model.py](../../tensorrt_llm/models/llama/model.py).
-The TensorRT-LLM InternLM example code lies in [`examples/models/contrib/internlm`](./):
+The TensorRT LLM InternLM example code lies in [`examples/models/contrib/internlm`](./):
-* [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT-LLM checkpoint.
-* [`convert_checkpoint.py`] to to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format
+* [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT LLM checkpoint.
+* [`convert_checkpoint.py`] to to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -35,7 +35,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM InternLM example code locates at [examples/models/contrib/internlm](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM InternLM example code locates at [examples/models/contrib/internlm](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -45,7 +45,7 @@ Please install required packages first:
pip install -r requirements.txt
```
-TensorRT-LLM InternLM builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
+TensorRT LLM InternLM builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) with dummy weights.
InternLM has released several checkpoints of different size or capabilities under https://huggingface.co/internlm. Users can pick any one repository and follow instructions to prepare the checkpoint.
@@ -178,7 +178,7 @@ python ../../../summarize.py --test_trt_llm --test_hf \
#### SmoothQuant
-Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
+Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
```bash
@@ -247,7 +247,7 @@ python ../../../summarize.py --test_trt_llm --test_hf \
### Run
-To run a TensorRT-LLM InternLM model using the engines generated by `trtllm-build`
+To run a TensorRT LLM InternLM model using the engines generated by `trtllm-build`
```bash
# InternLM 7B with fp16
diff --git a/examples/models/contrib/jais/README.md b/examples/models/contrib/jais/README.md
index 5c54d4631bd..1e01af8b9af 100644
--- a/examples/models/contrib/jais/README.md
+++ b/examples/models/contrib/jais/README.md
@@ -16,9 +16,9 @@ Currently it has been tested on
## Overview
-The TensorRT-LLM support for Jais is based on the GPT model, the implementation can be found in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the [GPT example code](../../../gpt) for Jais,
+The TensorRT LLM support for Jais is based on the GPT model, the implementation can be found in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the [GPT example code](../../../gpt) for Jais,
-* [`convert_checkpoint.py`](../../../gpt/convert_checkpoint.py) to convert the Jais model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](../../../gpt/convert_checkpoint.py) to convert the Jais model into TensorRT LLM checkpoint format.
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
@@ -34,7 +34,7 @@ The tested configurations are:
## Usage
-This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.
+This section gives a whole process where we convert HF models, build TensorRT LLM engines and ultimately perform summarization.
### Build TensorRT engine(s)
@@ -54,15 +54,15 @@ python3 ../../../gpt/convert_checkpoint.py --model_dir core42/jais-30b-chat-v3 \
```
```bash
-# Build a single-GPU float16 engine from TensorRT-LLM checkpoint for jais-13b-chat
-# Enable the special TensorRT-LLM GPT Attention plugin (--gpt_attention_plugin) to increase runtime performance.
+# Build a single-GPU float16 engine from TensorRT LLM checkpoint for jais-13b-chat
+# Enable the special TensorRT LLM GPT Attention plugin (--gpt_attention_plugin) to increase runtime performance.
# It is recommend to use --remove_input_padding along with --gpt_attention_plugin for better performance
trtllm-build --checkpoint_dir jais-13b-chat/trt_ckpt/fp16/1-gpu \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--output_dir jais-13b-chat/trt_engines/fp16/1-gpu
-# Build 2-way tensor parallelism engines from TensorRT-LLM checkpoint for jais-30b-chat-v3
+# Build 2-way tensor parallelism engines from TensorRT LLM checkpoint for jais-30b-chat-v3
trtllm-build --checkpoint_dir jais-30b-chat-v3/trt_ckpt/fp16/2-gpu \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
diff --git a/examples/models/contrib/mmdit/README.md b/examples/models/contrib/mmdit/README.md
index 2b34c216cc9..603a0553c94 100644
--- a/examples/models/contrib/mmdit/README.md
+++ b/examples/models/contrib/mmdit/README.md
@@ -3,9 +3,9 @@ This document shows how to build and run a [MMDiT](https://github.com/huggingfac
## Overview
-The TensorRT-LLM implementation of MMDiT can be found in [tensorrt_llm/models/sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT-LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are main files to build and run MMDiT with TensorRT-LLM:
+The TensorRT LLM implementation of MMDiT can be found in [tensorrt_llm/models/sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are main files to build and run MMDiT with TensorRT-LLM:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the MMDiT model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the MMDiT model into TensorRT LLM checkpoint format.
* [`sample.py`](./sample.py) to run the [diffusers](https://huggingface.co/docs/diffusers/index) pipeline with TensorRT engine(s) to generate images.
## Support Matrix
@@ -16,11 +16,11 @@ The TensorRT-LLM implementation of MMDiT can be found in [tensorrt_llm/models/sd
## Usage
-The TensorRT-LLM MMDiT example code locates at [examples/models/contrib/mmdit](./). It takes HuggingFace checkpoint as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM MMDiT example code locates at [examples/models/contrib/mmdit](./). It takes HuggingFace checkpoint as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build MMDiT TensorRT engine(s)
-This checkpoint will be converted to the TensorRT-LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT-LLM checkpoint.
+This checkpoint will be converted to the TensorRT LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT LLM checkpoint.
```
# Convert to TRT-LLM
@@ -33,7 +33,7 @@ trtllm-build --checkpoint_dir=./tllm_checkpoint/ \
Set `--max_batch_size` to tell how many images at most you would like to generate. We disable `--remove_input_padding` since we don't need to padding MMDiT's patches.
-After build, we can find a `./engine_output` directory, it is ready for running MMDiT model with TensorRT-LLM now.
+After build, we can find a `./engine_output` directory, it is ready for running MMDiT model with TensorRT LLM now.
### Generate images
diff --git a/examples/models/contrib/mpt/README.md b/examples/models/contrib/mpt/README.md
index 8223fc7acc0..2ee34d6c436 100644
--- a/examples/models/contrib/mpt/README.md
+++ b/examples/models/contrib/mpt/README.md
@@ -1,6 +1,6 @@
# MPT
-This document explains how to build the [MPT](https://huggingface.co/mosaicml/mpt-7b) model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs.
+This document explains how to build the [MPT](https://huggingface.co/mosaicml/mpt-7b) model using TensorRT LLM and run on a single GPU and a single node with multiple GPUs.
- [MPT](#mpt)
- [Overview](#overview)
@@ -22,9 +22,9 @@ This document explains how to build the [MPT](https://huggingface.co/mosaicml/mp
## Overview
-The TensorRT-LLM MPT implementation can be found in [`tensorrt_llm/models/mpt/model.py`](../../tensorrt_llm/models/mpt/model.py). The TensorRT-LLM MPT example code is located in [`examples/models/contrib/mpt`](./). There is one main file:
+The TensorRT LLM MPT implementation can be found in [`tensorrt_llm/models/mpt/model.py`](../../tensorrt_llm/models/mpt/model.py). The TensorRT LLM MPT example code is located in [`examples/models/contrib/mpt`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
diff --git a/examples/models/contrib/opt/README.md b/examples/models/contrib/opt/README.md
index c2cb288ff46..e1d08142515 100644
--- a/examples/models/contrib/opt/README.md
+++ b/examples/models/contrib/opt/README.md
@@ -1,6 +1,6 @@
# OPT
-This document explains how to build the [OPT](https://huggingface.co/docs/transformers/model_doc/opt) model using TensorRT-LLM and run on a single GPU, a single node with
+This document explains how to build the [OPT](https://huggingface.co/docs/transformers/model_doc/opt) model using TensorRT LLM and run on a single GPU, a single node with
multiple GPUs or multiple nodes with multiple GPUs.
- [OPT](#opt)
@@ -8,7 +8,7 @@ multiple GPUs or multiple nodes with multiple GPUs.
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download weights from HuggingFace Transformers](#1-download-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [4. Summarization using the OPT model](#4-summarization-using-the-opt-model)
- [Fused MultiHead Attention (FMHA)](#fused-multihead-attention-fmha)
@@ -18,9 +18,9 @@ multiple GPUs or multiple nodes with multiple GPUs.
## Overview
-The TensorRT-LLM OPT implementation can be found in [`tensorrt_llm/models/opt/model.py`](../../tensorrt_llm/models/opt/model.py). The TensorRT-LLM OPT example code is located in [`examples/models/contrib/opt`](./). There is one file:
+The TensorRT LLM OPT implementation can be found in [`tensorrt_llm/models/opt/model.py`](../../tensorrt_llm/models/opt/model.py). The TensorRT LLM OPT example code is located in [`examples/models/contrib/opt`](./). There is one file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
@@ -35,7 +35,7 @@ In addition, there are two shared files in the parent folder [`examples`](../) f
## Usage
The next two sections describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
-format to the TensorRT-LLM format.
+format to the TensorRT LLM format.
### 1. Download weights from HuggingFace Transformers
@@ -61,7 +61,7 @@ git-lfs clone https://huggingface.co/facebook/opt-2.7b
git-lfs clone https://huggingface.co/facebook/opt-66b
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
+### 2. Convert weights from HF Transformers to TensorRT LLM format
```bash
# OPT-125M
@@ -126,7 +126,7 @@ trtllm-build --checkpoint_dir ./opt/66B/trt_ckpt/fp16/4-gpu/ \
### 4. Summarization using the OPT model
-The following section describes how to run a TensorRT-LLM OPT model to summarize the articles from the
+The following section describes how to run a TensorRT LLM OPT model to summarize the articles from the
[cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF OPT model.
diff --git a/examples/models/contrib/skywork/README.md b/examples/models/contrib/skywork/README.md
index ff3f7032ef2..acadf88ac6f 100644
--- a/examples/models/contrib/skywork/README.md
+++ b/examples/models/contrib/skywork/README.md
@@ -3,11 +3,11 @@
This document elaborates how to build the [Skywork](https://huggingface.co/Skywork/) model to runnable engines on single GPU node and perform a summarization task using these engines.
## Overview
-The TensorRT-LLM Skywork implementation is based on the LLaMA model. The implementation can
+The TensorRT LLM Skywork implementation is based on the LLaMA model. The implementation can
be found in [tensorrt_llm/models/llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
-The TensorRT-LLM Skywork example code lies in [`examples/models/contrib/skywork`](./):
+The TensorRT LLM Skywork example code lies in [`examples/models/contrib/skywork`](./):
-* [`convert_checkpoint.py`](../llama/convert_checkpoint.py) converts the Huggingface Model of Skywork into TensorRT-LLM checkpoint.
+* [`convert_checkpoint.py`](../llama/convert_checkpoint.py) converts the Huggingface Model of Skywork into TensorRT LLM checkpoint.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -19,7 +19,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.
+This section gives a whole process where we convert HF models, build TensorRT LLM engines and ultimately perform summarization.
### 1. Clone Code and Weights from Huggingface
@@ -78,7 +78,7 @@ trtllm-build --checkpoint_dir ./skywork-13b-base/trt_ckpt/bf16 \
### 4. Summarization using the Engines
-After building TRT engines, we can use them to perform various tasks. TensorRT-LLM provides handy code to run summarization on [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset and get [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores. The `ROUGE-1` score can be used to validate model implementations.
+After building TRT engines, we can use them to perform various tasks. TensorRT LLM provides handy code to run summarization on [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset and get [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores. The `ROUGE-1` score can be used to validate model implementations.
```bash
# fp16
diff --git a/examples/models/contrib/smaug/README.md b/examples/models/contrib/smaug/README.md
index 736151e8cf5..12673f12a90 100644
--- a/examples/models/contrib/smaug/README.md
+++ b/examples/models/contrib/smaug/README.md
@@ -4,9 +4,9 @@ This document elaborates how to build the [Smaug-72B-v0.1](https://huggingface.c
## Overview
-The TensorRT-LLM support for Smaug-72B-v0.1 is based on the LLaMA model, the implementation can be found in [tensorrt_llm/models/llama/model.py](../../../../tensorrt_llm/models/llama/model.py). Smaug model resembles LLaMA very much except it uses bias term in its attention module, we therefore reuse the [LLaMA example code](../../../llama) for Smaug,
+The TensorRT LLM support for Smaug-72B-v0.1 is based on the LLaMA model, the implementation can be found in [tensorrt_llm/models/llama/model.py](../../../../tensorrt_llm/models/llama/model.py). Smaug model resembles LLaMA very much except it uses bias term in its attention module, we therefore reuse the [LLaMA example code](../../../llama) for Smaug,
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the LLaMA model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the LLaMA model into TensorRT LLM checkpoint format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -19,7 +19,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.
+This section gives a whole process where we convert HF models, build TensorRT LLM engines and ultimately perform summarization.
### Build TensorRT engine(s)
@@ -43,7 +43,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
### Run Summarization
-After building TRT engine, we can use it to perform various tasks. TensorRT-LLM provides handy code to run summarization on [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset and get [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores. The `ROUGE-1` score can be used to validate model implementations.
+After building TRT engine, we can use it to perform various tasks. TensorRT LLM provides handy code to run summarization on [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset and get [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores. The `ROUGE-1` score can be used to validate model implementations.
```bash
mpirun -n 8 -allow-run-as-root python ../../../summarize.py \
diff --git a/examples/models/contrib/stdit/README.md b/examples/models/contrib/stdit/README.md
index 5b918464b2b..0bfd0160f06 100644
--- a/examples/models/contrib/stdit/README.md
+++ b/examples/models/contrib/stdit/README.md
@@ -3,9 +3,9 @@ This document shows how to build and run a STDiT in [OpenSoRA](https://github.co
## Overview
-The TensorRT-LLM implementation of STDiT can be found in [tensorrt_llm/models/stdit/model.py](../../../../tensorrt_llm/models/stdit/model.py). The TensorRT-LLM STDiT (OpenSoRA) example code is located in [`examples/models/contrib/stdit`](./). There are main files to build and run STDiT with TensorRT-LLM:
+The TensorRT LLM implementation of STDiT can be found in [tensorrt_llm/models/stdit/model.py](../../../../tensorrt_llm/models/stdit/model.py). The TensorRT LLM STDiT (OpenSoRA) example code is located in [`examples/models/contrib/stdit`](./). There are main files to build and run STDiT with TensorRT-LLM:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into TensorRT LLM checkpoint format.
* [`sample.py`](./sample.py) to run the pipeline with TensorRT engine(s) to generate videos.
## Support Matrix
@@ -16,7 +16,7 @@ The TensorRT-LLM implementation of STDiT can be found in [tensorrt_llm/models/st
## Usage
-The TensorRT-LLM STDiT example code locates at [examples/models/contrib/stdit](./). It takes HuggingFace checkpoint as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM STDiT example code locates at [examples/models/contrib/stdit](./). It takes HuggingFace checkpoint as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Requirements
@@ -30,7 +30,7 @@ pip install colossalai --no-deps
### Build STDiT TensorRT engine(s)
-This checkpoint will be converted to the TensorRT-LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT-LLM checkpoint. The pretrained checkpoint can be downloaded from [here](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3).
+This checkpoint will be converted to the TensorRT LLM checkpoint format by [`convert_checkpoint.py`](./convert_checkpoint.py). After that, we can build TensorRT engine(s) with the TensorRT LLM checkpoint. The pretrained checkpoint can be downloaded from [here](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3).
```bash
# Convert to TRT-LLM
@@ -46,7 +46,7 @@ trtllm-build --checkpoint_dir=tllm_checkpoint/ \
--context_fmha=enable
```
-After build, we can find a `./engine_output` directory, it is ready for running STDiT model with TensorRT-LLM now.
+After build, we can find a `./engine_output` directory, it is ready for running STDiT model with TensorRT LLM now.
### Generate videos
diff --git a/examples/models/core/bert/README.md b/examples/models/core/bert/README.md
index 8c5b1a366f7..da1826edff5 100644
--- a/examples/models/core/bert/README.md
+++ b/examples/models/core/bert/README.md
@@ -4,10 +4,10 @@ This document explains how to build the BERT family, specifically [BERT](https:/
## Overview
-The TensorRT-LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../../../tensorrt_llm/models/bert/model.py).
-The TensorRT-LLM BERT family example code is located in [`examples/models/core/bert`](./). There are two main files in that folder:
+The TensorRT LLM BERT family implementation can be found in [`tensorrt_llm/models/bert/model.py`](../../../../tensorrt_llm/models/bert/model.py).
+The TensorRT LLM BERT family example code is located in [`examples/models/core/bert`](./). There are two main files in that folder:
- * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into tensorrt-llm checkpoint format.
+ * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the BERT model into TensorRT LLM checkpoint format.
* [`run.py`](./run.py) to run the inference on an input text,
## Convert Weights
@@ -47,14 +47,14 @@ python convert_checkpoint.py \
## Build TensorRT engine(s)
-TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s).
+TensorRT LLM converts HuggingFace BERT family models into TensorRT engine(s).
To build the TensorRT engine, the basic command is:
```bash
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
--output_dir ${model_name}_engine_outputs \
```
-Beside the basic engine build, TensorRT-LLM provides these features by adding these flags with basic build command:
+Beside the basic engine build, TensorRT LLM provides these features by adding these flags with basic build command:
- To use `bert_attention_plugin`, add `--bert_attention_plugin` to command.
@@ -75,7 +75,7 @@ trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
```
## Run TensorRT engine(s)
-Run a TensorRT-LLM BERT model using the engines generated by build command mentioned above.
+Run a TensorRT LLM BERT model using the engines generated by build command mentioned above.
Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.
[`run.py`](./run.py) provides an example for performing the inference and decoding the output. By default, it will use the task specific datasets as input text, for example, ['squad_v2'](https://huggingface.co/datasets/rajpurkar/squad_v2) for BertForQuestionAnswering.
diff --git a/examples/models/core/commandr/README.md b/examples/models/core/commandr/README.md
index 3bffe933cce..88788e65eb6 100644
--- a/examples/models/core/commandr/README.md
+++ b/examples/models/core/commandr/README.md
@@ -1,13 +1,13 @@
# Command R
-This document explains how to build the [C4AI Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01), [C4AI Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus), [Aya-23-8B](https://huggingface.co/CohereForAI/aya-23-8B), [Aya-23-35B](https://huggingface.co/CohereForAI/aya-23-35B) models using TensorRT-LLM and run on a single GPU or a single node with multiple GPUs.
+This document explains how to build the [C4AI Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01), [C4AI Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus), [Aya-23-8B](https://huggingface.co/CohereForAI/aya-23-8B), [Aya-23-35B](https://huggingface.co/CohereForAI/aya-23-35B) models using TensorRT LLM and run on a single GPU or a single node with multiple GPUs.
- [Command R](#Command-R)
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [4. Run inference](#4-run-inference)
- [Single node, single GPU](#single-node-single-gpu)
@@ -18,10 +18,10 @@ This document explains how to build the [C4AI Command-R](https://huggingface.co/
## Overview
-The TensorRT-LLM Command-R implementation can be found in [`tensorrt_llm/models/commandr/model.py`](../../../../tensorrt_llm/models/commandr/model.py).
-The TensorRT-LLM Command-R example code is located in [`examples/models/core/commandr`](./). There is one main file:
+The TensorRT LLM Command-R implementation can be found in [`tensorrt_llm/models/commandr/model.py`](../../../../tensorrt_llm/models/commandr/model.py).
+The TensorRT LLM Command-R example code is located in [`examples/models/core/commandr`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -52,9 +52,9 @@ git clone https://huggingface.co/CohereForAI/aya-23-8B aya_23_8
git clone https://huggingface.co/CohereForAI/aya-23-35B aya_23_35B
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
+### 2. Convert weights from HF Transformers to TensorRT LLM format
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
```bash
# Command-R: single gpu, dtype float16
@@ -72,7 +72,7 @@ python3 convert_checkpoint.py --model_dir aya_23_35B --output_dir trt_ckpt/aya_2
### 3. Build TensorRT engine(s)
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
Normally, the `trtllm-build` command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the `--workers` argument.
@@ -174,10 +174,10 @@ If the engines are run successfully, you will see output like (Command-R as the
```txt
......
-[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (total latency: 81.05689692497253 sec)
-[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2000)
-[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 24.67402621952367)
-[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT LLM (total latency: 81.05689692497253 sec)
+[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2000)
+[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT LLM (tokens per second: 24.67402621952367)
+[01/26/2024-02:51:56] [TRT-LLM] [I] TensorRT LLM beam 0 result
[01/26/2024-02:51:56] [TRT-LLM] [I] rouge1 : 24.06804397902119
[01/26/2024-02:51:56] [TRT-LLM] [I] rouge2 : 6.456513335555016
[01/26/2024-02:51:56] [TRT-LLM] [I] rougeL : 16.77644999660741
diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md
index 2efe14b986d..3fb4a22372c 100644
--- a/examples/models/core/deepseek_v3/README.md
+++ b/examples/models/core/deepseek_v3/README.md
@@ -1,11 +1,11 @@
# DeepSeek‑V3 and DeepSeek-R1
-This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1 models using NVIDIA's TensorRT-LLM framework with the PyTorch backend.
+This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1 models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
To benchmark the model with best configurations, refer to [DeepSeek R1 benchmarking blog](../../../../docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md).
-Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) for how to build TensorRT-LLM from source and start a TRT-LLM docker container.
+Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) for how to build TensorRT LLM from source and start a TRT-LLM docker container.
> [!NOTE]
> This guide assumes that you replace placeholder values (e.g. ``) with the appropriate paths.
@@ -390,7 +390,7 @@ settings for your specific use case.
### Dynamo
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
-Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
### tensorrtllm_backend for triton inference server (Prototype)
To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as a prototype feature.
@@ -414,7 +414,7 @@ Available parameters for the requests are listed in https://github.com/triton-in
## Advanced Usages
### Multi-node
-TensorRT-LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.
+TensorRT LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.
#### mpirun
mpirun requires each node to have passwordless ssh access to the other node. We need to setup the environment inside the docker container. Run the container with host network and mount the current directory as well as model directory to the container.
@@ -606,7 +606,7 @@ sbatch --nodes=2 --ntasks=8 --ntasks-per-node=4 benchmark.slurm
### DeepGEMM
-TensorRT-LLM uses DeepGEMM for DeepSeek-V3/R1, which provides significant e2e performance boost on Hopper GPUs. DeepGEMM can be disabled by setting the environment variable `TRTLLM_DG_ENABLED` to `0`:
+TensorRT LLM uses DeepGEMM for DeepSeek-V3/R1, which provides significant e2e performance boost on Hopper GPUs. DeepGEMM can be disabled by setting the environment variable `TRTLLM_DG_ENABLED` to `0`:
DeepGEMM-related behavior can be controlled by the following environment variables:
@@ -677,7 +677,7 @@ mpirun -H :8,:8 \
```
### FlashMLA
-TensorRT-LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
+TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
### FP8 KV Cache and MLA
@@ -693,7 +693,7 @@ You can enable FP8 MLA through either of these methods:
**Option 1: Checkpoint config**
-TensorRT-LLM automatically detects the `hf_quant_config.json` file in the model directory, which configures both GEMM and KV cache quantization. For example, see the FP4 DeepSeek-R1 checkpoint [configuration](https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json) provided by [ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+TensorRT LLM automatically detects the `hf_quant_config.json` file in the model directory, which configures both GEMM and KV cache quantization. For example, see the FP4 DeepSeek-R1 checkpoint [configuration](https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json) provided by [ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
To enable FP8 MLA, modify the `kv_cache_quant_algo` property. The following shows the config for DeepSeek's block-wise FP8 GEMM quantization + FP8 MLA:
@@ -717,7 +717,7 @@ kv_cache_dtype: fp8
### W4AFP8
-TensorRT-LLM supports W(INT)4-A(FP)8 for DeepSeek on __Hopper__. Activations and weights are quantized at per-tensor and per-group (1x128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers.
+TensorRT LLM supports W(INT)4-A(FP)8 for DeepSeek on __Hopper__. Activations and weights are quantized at per-tensor and per-group (1x128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers.
We provide a pre-quantized checkpoint for DeepSeek-R1 W4AFP8 at [HF model hub](https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8).
diff --git a/examples/models/core/enc_dec/README.md b/examples/models/core/enc_dec/README.md
index aa6d94abb09..878aa9ec607 100644
--- a/examples/models/core/enc_dec/README.md
+++ b/examples/models/core/enc_dec/README.md
@@ -1,6 +1,6 @@
# Encoder-Decoder
-This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in TensorRT-LLM on NVIDIA GPUs.
+This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in TensorRT LLM on NVIDIA GPUs.
## Table of Contents
@@ -27,7 +27,7 @@ This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in T
## Overview
-The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT-LLM Enc-Dec example code is located in [`examples/models/core/enc_dec`](./):
+The TensorRT LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc_dec/model.py](../../../../tensorrt_llm/models/enc_dec/model.py). The TensorRT LLM Enc-Dec example code is located in [`examples/models/core/enc_dec`](./):
* `trtllm-build` to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Enc-Dec model,
* [`run.py`](./run.py) to run the inference on an example input text.
@@ -35,7 +35,7 @@ The TensorRT-LLM Enc-Dec implementation can be found in [tensorrt_llm/models/enc
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert weights from HuggingFace or FairSeq format to TRT-LLM format, and split weights for multi-GPU inference,
## Usage
-The TensorRT-LLM Enc-Dec example code locates at [examples/models/core/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
+The TensorRT LLM Enc-Dec example code locates at [examples/models/core/enc_dec](./). It takes HuggingFace or FairSeq model name as input, and builds the corresponding TensorRT engines. On each GPU, there will be two TensorRT engines, one for Encoder and one for Decoder.
## Encoder-Decoder Model Support
@@ -68,7 +68,7 @@ The `convert_checkpoint.py` script converts weights from HuggingFace or FairSeq
The HuggingFace or Fairseq checkpoints of the enc-dec models mentioned in this Readme are all float32 precision. Use `--dtype` to set the target inference precision during the weight conversion.
-After weight conversion, TensorRT-LLM converted weights and model configuration will be saved under `/` directory, which is the `--checkpoint_dir` input path you should give to the **next** engine building phase.
+After weight conversion, TensorRT LLM converted weights and model configuration will be saved under `/` directory, which is the `--checkpoint_dir` input path you should give to the **next** engine building phase.
Take T5 for example:
@@ -91,7 +91,7 @@ python convert_checkpoint.py --model_type ${MODEL_TYPE} \
### Build TensorRT engine(s)
-TensorRT-LLM builds TensorRT engine(s) with flexible controls on different types of optimizations. Note that these are just examples to demonstrate multi-GPU inference. For small models like T5-small, single GPU is usually sufficient.
+TensorRT LLM builds TensorRT engine(s) with flexible controls on different types of optimizations. Note that these are just examples to demonstrate multi-GPU inference. For small models like T5-small, single GPU is usually sufficient.
After engine building, TensorRT engines will be saved under `/` directory, which is the `--engine_dir` path you should give to the next engine running phase. It is recommended to have `/` in the output path where `Y` is number of total GPU ranks in a multi-node, multi-GPU setup, because the same `Y` number GPUs could be executed with different TP (Tensor Parallelism) and PP (Pipeline Parallelism) combinations.
@@ -194,7 +194,7 @@ trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION
### Run
-Run a TensorRT-LLM Enc-Dec model using the engines generated by build.py.
+Run a TensorRT LLM Enc-Dec model using the engines generated by build.py.
Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.
Different types of runtime are provided for encoder-decoder models. Following an order of serving performance and good usability, we recommend:
diff --git a/examples/models/core/exaone/README.md b/examples/models/core/exaone/README.md
index 51c17e14c02..a2989c835db 100644
--- a/examples/models/core/exaone/README.md
+++ b/examples/models/core/exaone/README.md
@@ -2,7 +2,7 @@
This document shows how to build and run a [EXAONE](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) model in TensorRT-LLM.
-The TensorRT-LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
+The TensorRT LLM EXAONE implementation is based on the LLaMA model. The implementation can be found in [llama/model.py](../../../../tensorrt_llm/models/llama/model.py).
See the LLaMA example [`examples/models/core/llama`](../llama) for details.
- [EXAONE](#exaone)
@@ -114,7 +114,7 @@ For models with sliding window attention, DynamicCache is less memory-efficient
### TRT flow
-The next section describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../llama/convert_checkpoint.py) for EXAONE model and then we build the model with `trtllm-build`.
+The next section describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. We will use llama's [convert_checkpoint.py](../llama/convert_checkpoint.py) for EXAONE model and then we build the model with `trtllm-build`.
### Convert checkpoint and build TensorRT engine(s)
diff --git a/examples/models/core/gemma/README.md b/examples/models/core/gemma/README.md
index b84ad90ffc9..ac6cfaea4b5 100644
--- a/examples/models/core/gemma/README.md
+++ b/examples/models/core/gemma/README.md
@@ -51,7 +51,7 @@ Please install required packages first:
pip install -r requirements.txt
```
-Users can use `convert_checkpoint.py` to convert the different source checkpoint to unified TensorRT-LLM checkpoint format. Users could set `--dtype` to determine the inference data type, and set the quantization options like `--enable_fp8`, `--fp8_kv_cache` `--use_smooth_quant`, `--calibrate_kv_cache` (for INT8 kv cache) and `--use-weight-only-with-precision` (weight only). Users could also control the source checkpoint type by `--ckpt-type`. Currently, supported checkpoint types are `jax`, `torch` and `keras`.
+Users can use `convert_checkpoint.py` to convert the different source checkpoint to unified TensorRT LLM checkpoint format. Users could set `--dtype` to determine the inference data type, and set the quantization options like `--enable_fp8`, `--fp8_kv_cache` `--use_smooth_quant`, `--calibrate_kv_cache` (for INT8 kv cache) and `--use-weight-only-with-precision` (weight only). Users could also control the source checkpoint type by `--ckpt-type`. Currently, supported checkpoint types are `jax`, `torch` and `keras`.
```bash
CKPT_PATH=/tmp/models/gemma_nv/checkpoints/tmp_2b_it
@@ -67,7 +67,7 @@ python3 ./convert_checkpoint.py \
### Build engine
-After getting checkpoint, we can use `trtllm-build` command to build TensorRT-LLM engines from TensorRT-LLM checkpoints.
+After getting checkpoint, we can use `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints.
```bash
ENGINE_PATH=/tmp/gemma/2B/bf16/1-gpu/
@@ -97,7 +97,7 @@ python3 ../../../run.py --engine_dir ${ENGINE_PATH} \
--max_output_len 30 \
--vocab_file ${VOCAB_FILE_PATH}
-[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Input [Text 0]: " Born in north-east France, Soyer trained as a"
+[TensorRT-LLM] TensorRT LLM version: 0.9.0.dev2024020600Input [Text 0]: " Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "chef in the renowned kitchens of Lyon. After honing his skills in various Michelin-starred establishments, he embarked on a solo venture, establishing his own restaurant"
```
@@ -110,10 +110,10 @@ python3 ../../../summarize.py --test_trt_llm \
--max_ite 5 \
--vocab_file ${VOCAB_FILE_PATH}
-[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.2821836471557617 sec)
-[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1989)
-[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 605.9989975648089)
-[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT LLM (total latency: 3.2821836471557617 sec)
+[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1989)
+[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT LLM (tokens per second: 605.9989975648089)
+[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/06/2024-10:08:55] [TRT-LLM] [I] rouge1 : 26.376388677070615
[02/06/2024-10:08:55] [TRT-LLM] [I] rouge2 : 7.468157586877296
[02/06/2024-10:08:55] [TRT-LLM] [I] rougeL : 17.953060795106556
@@ -178,10 +178,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.0897433757781982 sec)
-[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2141)
-[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 692.9378073221881)
-[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT LLM (total latency: 3.0897433757781982 sec)
+[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2141)
+[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT LLM (tokens per second: 692.9378073221881)
+[03/05/2024-02:24:39] [TRT-LLM] [I] TensorRT LLM beam 0 result
[03/05/2024-02:24:39] [TRT-LLM] [I] rouge1 : 21.042873132085678
[03/05/2024-02:24:39] [TRT-LLM] [I] rouge2 : 6.322669223228836
[03/05/2024-02:24:39] [TRT-LLM] [I] rougeL : 16.450116567540338
@@ -226,10 +226,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.116227149963379 sec)
-[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2419)
-[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 776.259201781368)
-[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT LLM (total latency: 3.116227149963379 sec)
+[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2419)
+[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT LLM (tokens per second: 776.259201781368)
+[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-10:37:15] [TRT-LLM] [I] rouge1 : 20.206082692133098
[02/08/2024-10:37:15] [TRT-LLM] [I] rouge2 : 5.902141189518428
[02/08/2024-10:37:15] [TRT-LLM] [I] rougeL : 15.403458457907643
@@ -274,10 +274,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.460859775543213 sec)
-[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1786)
-[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 516.0567361385428)
-[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT LLM (total latency: 3.460859775543213 sec)
+[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1786)
+[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT LLM (tokens per second: 516.0567361385428)
+[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-04:42:06] [TRT-LLM] [I] rouge1 : 22.534044843245525
[02/08/2024-04:42:06] [TRT-LLM] [I] rouge2 : 5.940093176022924
[02/08/2024-04:42:06] [TRT-LLM] [I] rougeL : 16.258991712579736
@@ -319,10 +319,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5987987518310547 sec)
-[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1797)
-[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 499.3332842203787)
-[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT LLM (total latency: 3.5987987518310547 sec)
+[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1797)
+[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT LLM (tokens per second: 499.3332842203787)
+[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-04:44:54] [TRT-LLM] [I] rouge1 : 24.48521318679745
[02/08/2024-04:44:54] [TRT-LLM] [I] rouge2 : 7.240543314565931
[02/08/2024-04:44:54] [TRT-LLM] [I] rougeL : 17.857921729984078
@@ -360,10 +360,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5348474979400635 sec)
-[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1819)
-[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 514.5907994786265)
-[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT LLM (total latency: 3.5348474979400635 sec)
+[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT LLM (total output tokens: 1819)
+[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT LLM (tokens per second: 514.5907994786265)
+[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-04:52:22] [TRT-LLM] [I] rouge1 : 24.0397941580232
[02/08/2024-04:52:22] [TRT-LLM] [I] rouge2 : 7.325311340360227
[02/08/2024-04:52:22] [TRT-LLM] [I] rougeL : 17.54210044633271
@@ -447,10 +447,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.884302377700806 sec)
-[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2694)
-[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 457.8282737830064)
-[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT LLM (total latency: 5.884302377700806 sec)
+[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2694)
+[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT LLM (tokens per second: 457.8282737830064)
+[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-06:42:13] [TRT-LLM] [I] rouge1 : 27.18633861010837
[02/08/2024-06:42:13] [TRT-LLM] [I] rouge2 : 7.734928823230158
[02/08/2024-06:42:13] [TRT-LLM] [I] rougeL : 19.32537431798716
@@ -488,10 +488,10 @@ python3 ../../../summarize.py --test_trt_llm \
--max_ite 5
[02/19/2024-10:02:53] [TRT-LLM] [I] ---------------------------------------------------------
-[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM (total latency: 13.65670919418335 sec)
-[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 8351)
-[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 611.494312521266)
-[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT LLM (total latency: 13.65670919418335 sec)
+[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT LLM (total output tokens: 8351)
+[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT LLM (tokens per second: 611.494312521266)
+[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/19/2024-10:03:09] [TRT-LLM] [I] rouge1 : 28.8107815115074
[02/19/2024-10:03:09] [TRT-LLM] [I] rouge2 : 8.623835512061866
[02/19/2024-10:03:09] [TRT-LLM] [I] rougeL : 19.7277195532959
@@ -537,10 +537,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.49835753440857 sec)
-[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2654)
-[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 312.2956393931832)
-[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT LLM (total latency: 8.49835753440857 sec)
+[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2654)
+[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT LLM (tokens per second: 312.2956393931832)
+[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-07:38:16] [TRT-LLM] [I] rouge1 : 20.396209981234687
[02/08/2024-07:38:16] [TRT-LLM] [I] rouge2 : 5.73302850102211
[02/08/2024-07:38:16] [TRT-LLM] [I] rougeL : 16.001683776127507
@@ -577,10 +577,10 @@ python3 ../../../summarize.py --test_trt_llm \
--batch_size 8 \
--max_ite 5
-[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.73880124092102 sec)
-[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2771)
-[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 317.09154649544956)
-[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT LLM (total latency: 8.73880124092102 sec)
+[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT LLM (total output tokens: 2771)
+[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT LLM (tokens per second: 317.09154649544956)
+[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT LLM beam 0 result
[02/08/2024-07:51:11] [TRT-LLM] [I] rouge1 : 20.934864626327627
[02/08/2024-07:51:11] [TRT-LLM] [I] rouge2 : 4.954721611692932
[02/08/2024-07:51:11] [TRT-LLM] [I] rougeL : 15.307592049634444
@@ -669,10 +669,10 @@ python3 ../../../summarize.py --test_trt_llm \
...
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (512, 512, 512, 512, 512, 3100) * 4 + (512, 512)
...
-[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT-LLM (total latency: 1.6197962760925293 sec)
-[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 475)
-[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 293.2467539349165)
-[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT LLM (total latency: 1.6197962760925293 sec)
+[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT LLM (total output tokens: 475)
+[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT LLM (tokens per second: 293.2467539349165)
+[04/09/2025-18:28:26] [TRT-LLM] [I] TensorRT LLM beam 0 result
[04/09/2025-18:28:26] [TRT-LLM] [I] rouge1: 22.780314381954003
[04/09/2025-18:28:26] [TRT-LLM] [I] rouge2: 4.331099231480823
[04/09/2025-18:28:26] [TRT-LLM] [I] rougeL: 15.26751867562475
@@ -768,7 +768,7 @@ curl http://localhost:8000/v1/completions \
#### Dynamo
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
-Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
### Run Modelopt Quantization
diff --git a/examples/models/core/glm-4-9b/README.md b/examples/models/core/glm-4-9b/README.md
index 04988e59e82..9766c9124c5 100644
--- a/examples/models/core/glm-4-9b/README.md
+++ b/examples/models/core/glm-4-9b/README.md
@@ -1,6 +1,6 @@
# ChatGLM
-This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b) models using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
+This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b) models using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- [glm-4-9b](#glm-4-9b)
- [Overview](#overview)
@@ -9,7 +9,7 @@ This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/
- [Tokenizer and special tokens comparison](#tokenizer-and-special-tokens-comparison)
- [Usage](#usage)
- [1. Download repo and weights from HuggingFace Transformers](#1-download-repo-and-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [Enable plugins](#enable-plugins)
- [In-flight batching](#in-flight-batching)
@@ -26,10 +26,10 @@ This document explains how to build the [glm-4-9b](https://huggingface.co/THUDM/
## Overview
-The TensorRT-LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../../../tensorrt_llm/models/chatglm/model.py).
-The TensorRT-LLM ChatGLM example code is located in [`examples/models/core/glm-4-9b`](./). There is one main file:
+The TensorRT LLM ChatGLM implementation can be found in [`tensorrt_llm/models/chatglm/model.py`](../../../../tensorrt_llm/models/chatglm/model.py).
+The TensorRT LLM ChatGLM example code is located in [`examples/models/core/glm-4-9b`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -98,9 +98,9 @@ git clone https://huggingface.co/THUDM/glm-10b glm_10b
git clone https://huggingface.co/THUDM/glm-4-9b glm_4_9b
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
+### 2. Convert weights from HF Transformers to TensorRT LLM format
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
```bash
# GLM-4-9B: single gpu, dtype float16
@@ -109,7 +109,7 @@ python3 convert_checkpoint.py --model_dir glm_4_9b --output_dir trt_ckpt/glm_4_9
### 3. Build TensorRT engine(s)
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
Normally, the `trtllm-build` command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the `--workers` argument.
@@ -240,7 +240,7 @@ python3 ../../../run.py --input_text "What's new between ChatGLM3-6B and ChatGLM
### Activation-aware Weight Quantization (AWQ)
-The [`quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT-LLM checkpoints.
+The [`quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT LLM checkpoints.
```bash
# glm_4_9b: single gpu, int4 awq quantization
@@ -263,7 +263,7 @@ python3 ../../../run.py --input_text "What's new between ChatGLM3-6B and ChatGLM
### FP8 Quantization
-The [`quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT-LLM checkpoints.
+The [`quantize.py`](../../../quantization/quantize.py) script can be used to quantize the models and export TensorRT LLM checkpoints.
```bash
# glm_4_9b: single gpu, fp8 quantization
diff --git a/examples/models/core/gpt/README.md b/examples/models/core/gpt/README.md
index 376839b3c4a..703347325f0 100644
--- a/examples/models/core/gpt/README.md
+++ b/examples/models/core/gpt/README.md
@@ -1,13 +1,13 @@
# GPT
-This document explains how to build the [GPT](https://huggingface.co/gpt2) model using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
+This document explains how to build the [GPT](https://huggingface.co/gpt2) model using TensorRT LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.
- [GPT](#gpt)
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- [1. Download weights from HuggingFace Transformers](#1-download-weights-from-huggingface-transformers)
- - [2. Convert weights from HF Transformers to TensorRT-LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [2. Convert weights from HF Transformers to TensorRT LLM format](#2-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [3. Build TensorRT engine(s)](#3-build-tensorrt-engines)
- [Fused MultiHead Attention (FMHA)](#fused-multihead-attention-fmha)
- [In-flight batching and paged KV cache](#in-flight-batching-and-paged-kv-cache)
@@ -37,9 +37,9 @@ This document explains how to build the [GPT](https://huggingface.co/gpt2) model
## Overview
-The TensorRT-LLM GPT implementation can be found in [`tensorrt_llm/models/gpt/model.py`](../../../../tensorrt_llm/models/gpt/model.py). The TensorRT-LLM GPT example code is located in [`examples/models/core/gpt`](./). There is one main file:
+The TensorRT LLM GPT implementation can be found in [`tensorrt_llm/models/gpt/model.py`](../../../../tensorrt_llm/models/gpt/model.py). The TensorRT LLM GPT example code is located in [`examples/models/core/gpt`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -62,7 +62,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
The next two sections describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers)
-format to the TensorRT-LLM format.
+format to the TensorRT LLM format.
### 1. Download weights from HuggingFace Transformers
@@ -78,8 +78,8 @@ rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd
```
-### 2. Convert weights from HF Transformers to TensorRT-LLM format
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
+### 2. Convert weights from HF Transformers to TensorRT LLM format
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
```bash
# single gpu, dtype float16
@@ -102,7 +102,7 @@ python3 convert_checkpoint.py --model_dir gpt2 \
```
### 3. Build TensorRT engine(s)
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The checkpoint directory provides the model's weights and architecture configuration. The number of engine files is also same to the number of GPUs used to run inference.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The checkpoint directory provides the model's weights and architecture configuration. The number of engine files is also same to the number of GPUs used to run inference.
`trtllm-build` command has a variety of options. In particular, the plugin-related options have two categories:
* Plugin options that requires a data type (e.g., `gpt_attention_plugin`), you can
@@ -117,16 +117,16 @@ The defaults have been carefully tuned for better performance. For example, `gpt
Normally, the `trtllm-build` command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the `--workers` argument.
```bash
-# Build a single-GPU float16 engine from TensorRT-LLM checkpoint.
-# gpt_attention_plugin (the special TensorRT-LLM GPT Attention plugin) and remove_input_padding are enabled by default for runtime performance.
+# Build a single-GPU float16 engine from TensorRT LLM checkpoint.
+# gpt_attention_plugin (the special TensorRT LLM GPT Attention plugin) and remove_input_padding are enabled by default for runtime performance.
trtllm-build --checkpoint_dir gpt2/trt_ckpt/fp16/1-gpu \
--output_dir gpt2/trt_engines/fp16/1-gpu
-# Build 2-way tensor parallelism engines from TensorRT-LLM checkpoint.
+# Build 2-way tensor parallelism engines from TensorRT LLM checkpoint.
trtllm-build --checkpoint_dir gpt2/trt_ckpt/fp16/2-gpu \
--output_dir gpt2/trt_engines/fp16/2-gpu
-# Build 2-way tensor parallelism and 2-way pipeline parallelism engines from TensorRT-LLM checkpoint.
+# Build 2-way tensor parallelism and 2-way pipeline parallelism engines from TensorRT LLM checkpoint.
trtllm-build --checkpoint_dir gpt2/trt_ckpt/fp16/4-gpu \
--output_dir gpt2/trt_engines/fp16/4-gpu
```
@@ -157,7 +157,7 @@ Note that the FMHA kernels have to be used together with `gpt_attention_plugin`
If one wants to use [in-flight batching in C++ runtime](../../docs/in_flight_batching.md), the engine(s) must be built accordingly. In-flight batching in C++ runtime works only with attention plugin, paged KV cache and with packed data. Currently, the `trtllm-build` by default enables `gpt_attention_plugin`, `paged_kv_cache` and `remove_input_padding`, so the built engine(s) can support in-flight batching (unless you explicitly disable one of these options). One can additionally control the size of the block in paged KV cache using `--tokens_per_block=N`.
### 4. Build TensorRT engine(s) with Random Weights
-You can build engine(s) using random weights, which is useful for benchmarking. First, the [`../generate_checkpoint_config.py`](../generate_checkpoint_config.py) script can be used to generate a TensorRT-LLM checkpoint config file:
+You can build engine(s) using random weights, which is useful for benchmarking. First, the [`../generate_checkpoint_config.py`](../generate_checkpoint_config.py) script can be used to generate a TensorRT LLM checkpoint config file:
```bash
# Generate an 8-GPU GPT-175B float16 checkpoint config file.
@@ -186,7 +186,7 @@ Then, use `trtllm-build` command to build engine(s) with random weights and the
```bash
# Build 8-GPU GPT-175B float16 engines using dummy weights, useful for performance tests.
-# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
+# Enable several TensorRT LLM plugins to increase runtime performance. It also helps with build time.
trtllm-build --model_config gpt_175b/trt_ckpt/fp16/8-gpu/config.json \
--gemm_plugin auto \
--max_batch_size 256 \
@@ -194,7 +194,7 @@ trtllm-build --model_config gpt_175b/trt_ckpt/fp16/8-gpu/config.json \
--workers 8
# Build 16-GPU GPT-530B float16 engines using dummy weights, useful for performance tests.
-# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
+# Enable several TensorRT LLM plugins to increase runtime performance. It also helps with build time.
trtllm-build --model_config gpt_530b/trt_ckpt/fp16/16-gpu/config.json \
--gemm_plugin auto \
--max_batch_size 128 \
@@ -225,7 +225,7 @@ Output [Text 0 Beam 0]: " chef before moving to London in the early"
The [`summarize.py`](../../../summarize.py) script can run the built engines to summarize the articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.
For each summary, the script can compute the
[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
-By passing `--test_trt_llm` flag, the script will evaluate TensorRT-LLM engines. You may also pass `--test_hf` flag to evaluate the HF model.
+By passing `--test_trt_llm` flag, the script will evaluate TensorRT LLM engines. You may also pass `--test_hf` flag to evaluate the HF model.
```bash
python3 ../../../summarize.py --engine_dir gpt2/trt_engines/fp16/1-gpu \
@@ -236,10 +236,10 @@ python3 ../../../summarize.py --engine_dir gpt2/trt_engines/fp16/1-gpu \
If the engines are run successfully, you will see output like:
```
......
-[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT-LLM (total latency: 1.520904541015625 sec)
-[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 0)
-[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 0.0)
-[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT LLM (total latency: 1.520904541015625 sec)
+[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT LLM (total output tokens: 0)
+[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT LLM (tokens per second: 0.0)
+[03/13/2024-05:43:18] [TRT-LLM] [I] TensorRT LLM beam 0 result
[03/13/2024-05:43:18] [TRT-LLM] [I] rouge1 : 21.13474087351942
[03/13/2024-05:43:18] [TRT-LLM] [I] rouge2 : 6.2641616526063775
[03/13/2024-05:43:18] [TRT-LLM] [I] rougeL : 16.693574311238077
@@ -270,7 +270,7 @@ mpirun -np 8 \
#### Multiple nodes, multiple GPUs using [Slurm](https://slurm.schedmd.com)
-To run engines using multiple nodes, you should use a cluster manager like `Slurm`. The following section shows how to configure TensorRT-LLM to execute on two nodes using Slurm.
+To run engines using multiple nodes, you should use a cluster manager like `Slurm`. The following section shows how to configure TensorRT LLM to execute on two nodes using Slurm.
We start by preparing an `sbatch` script called `tensorrt_llm_run.sub`. That script contains the following code (you must replace the `` strings with your own values):
@@ -340,7 +340,7 @@ during inference.
#### INT8 inference
-The INT8 quantization scheme used in TensorRT-LLM theoretically works on any
+The INT8 quantization scheme used in TensorRT LLM theoretically works on any
GPT model. However, Smoothquant'd models tend to produce more accurate results
with reduced precision.
@@ -354,7 +354,7 @@ range: `[-128, 127]`. Similarly, W is scaled, `W_{i8} <- W_{fp} * s_w` but that
operation is done at model export time, no need for subsequent operations at
run-time.
-The optimized TensorRT-LLM GEMM implementation for SmoothQuant does the integer
+The optimized TensorRT LLM GEMM implementation for SmoothQuant does the integer
matrix-multiplication `Y_{i32} <- X_{i8} W_{i8}` and rescales the result to its
original range `Y_{fp} <- Y_{i32} * (s_x)^{-1} * (s_w)^{-1}`. Note that
`Y_{i32}` isn't stored in memory, the re-scaling happens in the GEMM's epilogue
@@ -364,9 +364,9 @@ By default `s_x` and `s_w` are single-value coefficients. This is the
*per-tensor* mode. Values for `s_x` and `s_w` are static, estimated at model
export time.
-TensorRT-LLM also supports more elaborate modes:
+TensorRT LLM also supports more elaborate modes:
- per-channel: `s_w` is a fixed vector of size `[1, m]`. For that,
- TensorRT-LLM loads the adequately scaled version of of `W_{i8}` at model
+ TensorRT LLM loads the adequately scaled version of of `W_{i8}` at model
construction time.
- per-token: `s_x` is a vector of size `[n, 1]` determined at run-time, based
on the per-token (a.k.a. per-row) absolute maximum of `X`.
@@ -482,7 +482,7 @@ trtllm-build --checkpoint_dir gpt2/trt_ckpt/int4-wo/1-gpu \
### FP8 Quantization
-[`quantize.py`](../../../quantization/quantize.py) can do FP8 quantization and/or FP8 kv cache quantization, and export TensorRT-LLM checkpoint.
+[`quantize.py`](../../../quantization/quantize.py) can do FP8 quantization and/or FP8 kv cache quantization, and export TensorRT LLM checkpoint.
```bash
# FP8 quantization with FP8 kv cache
@@ -544,14 +544,14 @@ For Granite, the steps are similar to StarCoder.
# Download hf granite model
git clone https://huggingface.co/ibm-granite/granite-34b-code-instruct granite
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 convert_checkpoint.py --model_dir granite \
--dtype float16 \
--gpt_variant starcoder \
--tp_size 4 \
--output_dir granite/trt_ckpt/fp16/4-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
trtllm-build --checkpoint_dir granite/trt_ckpt/fp16/4-gpu \
--gemm_plugin auto \
--output_dir granite/trt_engines/fp16/4-gpu
@@ -572,13 +572,13 @@ The SantaCoder extends the existing GPT model with multi-query attention mechani
# Download hf santacoder model
git clone https://huggingface.co/bigcode/santacoder
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 convert_checkpoint.py --model_dir santacoder \
--dtype float16 \
--tp_size 4 \
--output_dir santacoder/trt_ckpt/fp16/4-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
trtllm-build --checkpoint_dir santacoder/trt_ckpt/fp16/4-gpu \
--gemm_plugin auto \
--output_dir santacoder/trt_engines/fp16/4-gpu
@@ -600,13 +600,13 @@ For StarCoder, the steps are similar to SantaCoder.
# Download hf starcoder model
git clone https://huggingface.co/bigcode/starcoder
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 convert_checkpoint.py --model_dir starcoder \
--dtype float16 \
--tp_size 4 \
--output_dir starcoder/trt_ckpt/fp16/4-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
trtllm-build --checkpoint_dir starcoder/trt_ckpt/fp16/4-gpu \
--gemm_plugin auto \
--output_dir starcoder/trt_engines/fp16/4-gpu
@@ -626,7 +626,7 @@ For StarCoder2, you can use almost the same steps as shown above.
### Run StarCoder2 with LoRA
-TensorRT-LLM supports running StarCoder2 models with FP16/BF16/FP32 LoRA. In this section, we use starcoder2-15b as an example to show how to run an FP8 base model with FP16 LoRA module.
+TensorRT LLM supports running StarCoder2 models with FP16/BF16/FP32 LoRA. In this section, we use starcoder2-15b as an example to show how to run an FP8 base model with FP16 LoRA module.
* download the base model and lora model from HF
@@ -667,19 +667,19 @@ python ../../../run.py --engine_dir starcoder2-15b/trt_engines/fp8_lora/1-gpu \
NVIDIA has released a GPT-like model with some architectural improvements, that you can find here: [https://huggingface.co/nvidia/GPT-2B-001](https://huggingface.co/nvidia/GPT-2B-001). This architecture is also supported by TensorRT-LLM.
-Different from Huggingface's checkpoint, you should specify the NeMo checkpoint path using `--nemo_ckpt_path` for `convert_checkpoint.py`. The script also extracts the tokenizer file from the NeMo checkpoint and saves it to the TensorRT-LLM checkpoint folder, which can be used in the inference scripts.
+Different from Huggingface's checkpoint, you should specify the NeMo checkpoint path using `--nemo_ckpt_path` for `convert_checkpoint.py`. The script also extracts the tokenizer file from the NeMo checkpoint and saves it to the TensorRT LLM checkpoint folder, which can be used in the inference scripts.
```bash
# Download NeMo checkpoint
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
-# Convert to TensorRT-LLM checkpoint
-# It also extracts the tokenizer file and saves to the TensorRT-LLM checkpoint folder
+# Convert to TensorRT LLM checkpoint
+# It also extracts the tokenizer file and saves to the TensorRT LLM checkpoint folder
python3 convert_checkpoint.py --nemo_ckpt_path GPT-2B-001_bf16_tp1.nemo \
--dtype bfloat16 \
--output_dir gpt-next-2B/trt_ckpt/bf16/1-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
# --gpt_attention_plugin must be set for GPT-Next since Rotary positional embeddings (RoPE) is only supported by the gpt attention plugin at this time.
trtllm-build --checkpoint_dir gpt-next-2B/trt_ckpt/bf16/1-gpu \
--output_dir gpt-next-2B/trt_engines/bf16/1-gpu
@@ -696,15 +696,15 @@ python3 ../../../run.py --engine_dir gpt-next-2B/trt_engines/bf16/1-gpu \
For efficient fine-tuning, the NeMo framework allows you to learn virtual tokens to accomplish a downstream task. For more details, please read the
NeMo documentation [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
-TensorRT-LLM supports inference with those virtual tokens. To enable it, pass the prompt embedding table's maximum size at build time with `--max_prompt_embedding_table_size N`. For example:
+TensorRT LLM supports inference with those virtual tokens. To enable it, pass the prompt embedding table's maximum size at build time with `--max_prompt_embedding_table_size N`. For example:
```bash
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 convert_checkpoint.py --nemo_ckpt_path megatron_converted_8b_tp4_pp1.nemo \
--dtype float16 \
--output_dir gpt-next-8B/trt_ckpt/fp16/1-gpu
-# Build TensorRT-LLM engines with prompt-tuning enabled
+# Build TensorRT LLM engines with prompt-tuning enabled
trtllm-build --checkpoint_dir gpt-next-8B/trt_ckpt/fp16/1-gpu \
--max_prompt_embedding_table_size 100 \
--output_dir gpt-next-8B/trt_engines/fp16/1-gpu
@@ -733,12 +733,12 @@ python3 ../../../run.py --engine_dir gpt-next-8B/trt_engines/fp16/1-gpu \
# Download NeMo checkpoint
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 convert_checkpoint.py --nemo_ckpt_path GPT-2B-001_bf16_tp1.nemo \
--dtype float16 \
--output_dir gpt-next-2B/trt_ckpt/fp16/1-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
trtllm-build --checkpoint_dir gpt-next-2B/trt_ckpt/fp16/1-gpu \
--lora_plugin auto \
--lora_dir gpt2b_lora-900.nemo gpt2b_lora-stories.nemo \
diff --git a/examples/models/core/granite/README.md b/examples/models/core/granite/README.md
index 442085841c9..407107199c6 100644
--- a/examples/models/core/granite/README.md
+++ b/examples/models/core/granite/README.md
@@ -2,11 +2,11 @@
This document shows how to build and run a [Granite 3.0](https://huggingface.co/collections/ibm-granite/granite-30-language-models-66fdb59bbb54785c3512114f) model in TensorRT-LLM.
-The TensorRT-LLM Granite implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can be found in [`llama/model.py`](../../../../tensorrt_llm/models/llama/model.py). See the LLaMA example [`examples/models/core/llama`](../llama) for details.
+The TensorRT LLM Granite implementation is based on the LLaMA model, with Mixture of Experts (MoE) enabled. The implementation can be found in [`llama/model.py`](../../../../tensorrt_llm/models/llama/model.py). See the LLaMA example [`examples/models/core/llama`](../llama) for details.
- [Granite 3.0](#Granite)
- [Download model checkpoints](#download-model-checkpoints)
- - [Convert weights from HF Transformers to TensorRT-LLM format](#Convert-weights-from-HF-Transformers-to-TensorRT-LLM-format)
+ - [Convert weights from HF Transformers to TensorRT LLM format](#Convert-weights-from-HF-Transformers-to-TensorRT-LLM-format)
- [Build TensorRT engine](#build-tensorrt-engine)
- [Run Engine](#run-engine)
@@ -20,7 +20,7 @@ HF_MODEL="granite-3.0-8b-instruct" # or granite-3.0-3b-a800m-instruct
git clone https://huggingface.co/ibm-granite/${HF_MODEL} tmp/hf_checkpoints/${HF_MODEL}
```
-## Convert weights from HF Transformers to TensorRT-LLM format
+## Convert weights from HF Transformers to TensorRT LLM format
Set environment variables and necessary directory:
```bash
@@ -46,7 +46,7 @@ python3 ../llama/convert_checkpoint.py --model_dir tmp/hf_checkpoints/${HF_MODEL
### FP8 PTQ
Notes:
- Currently quantize.py does not support Expert Parallelism (EP) mode yet. User should use `../llama/convert_checkpoint.py` and specify `--moe_ep_size 1` instead, if needed.
-- TensorRT-LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. `batch_size` and `calib_size` can be adjusted to shorten the calibration time. Please refer to `../../../quantization/README.md` for explanation.
+- TensorRT LLM uses static quantization methods, which is expected to be faster at runtime as compared to dynamic quantization methods. This comes at a cost of an offline calibration step during quantization. `batch_size` and `calib_size` can be adjusted to shorten the calibration time. Please refer to `../../../quantization/README.md` for explanation.
```bash
PREC_QUANT="fp8"
diff --git a/examples/models/core/internlm2/README.md b/examples/models/core/internlm2/README.md
index 7073999b850..d58f04713c8 100644
--- a/examples/models/core/internlm2/README.md
+++ b/examples/models/core/internlm2/README.md
@@ -1,14 +1,14 @@
# InternLM2
-This document shows how to build and run InternLM2 7B / 20B models in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
+This document shows how to build and run InternLM2 7B / 20B models in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
## Overview
-The TensorRT-LLM InternLM2 implementation is based on the LLaMA model. The implementation can
+The TensorRT LLM InternLM2 implementation is based on the LLaMA model. The implementation can
be found in [model.py](../../../../tensorrt_llm/models/llama/model.py).
-The TensorRT-LLM InternLM2 example code lies in [`examples/models/core/internlm2`](./):
+The TensorRT LLM InternLM2 example code lies in [`examples/models/core/internlm2`](./):
-* [`convert_checkpoint.py`](./convert_checkpoint.py) converts the Huggingface Model of InternLM2 into TensorRT-LLM checkpoint.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) converts the Huggingface Model of InternLM2 into TensorRT LLM checkpoint.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -23,7 +23,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM InternLM2 example code locates at [examples/models/core/internlm2](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM InternLM2 example code locates at [examples/models/core/internlm2](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -33,7 +33,7 @@ Please install required packages first to make sure the example uses matched `te
pip install -r requirements.txt
```
-TensorRT-LLM InternLM2 builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
+TensorRT LLM InternLM2 builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) with dummy weights.
InternLM2 has released several checkpoints of different size or capabilities under https://huggingface.co/internlm. Users can pick any one repository and follow instructions to prepare the checkpoint.
@@ -134,7 +134,7 @@ trtllm-build --checkpoint_dir ./internlm2-chat-20b/w8a16 \
### Run
-To run a TensorRT-LLM InternLM2 model using the engines generated by `trtllm-build`
+To run a TensorRT LLM InternLM2 model using the engines generated by `trtllm-build`
```bash
# InternLM2 7B with fp16
diff --git a/examples/models/core/llama/README.md b/examples/models/core/llama/README.md
index b888b287b01..e0c6d0858fc 100644
--- a/examples/models/core/llama/README.md
+++ b/examples/models/core/llama/README.md
@@ -1,6 +1,6 @@
# LLaMA
-This document shows how to build and run a LLaMA model in TensorRT-LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
+This document shows how to build and run a LLaMA model in TensorRT LLM on both single GPU, single node multi-GPU and multi-node multi-GPU.
- [LLaMA](#llama)
- [Overview](#overview)
@@ -34,19 +34,19 @@ This document shows how to build and run a LLaMA model in TensorRT-LLM on both s
- [Run INT4-AWQ LLaMa with several FP16 lora checkpoints](#run-int4-awq-llama-with-several-fp16-lora-checkpoints)
- [Run LLaMa with StreamingLLM](#run-llama-with-streamingllm)
- [Run LLaMA-3.1 405B Model](#run-llama-31-405b-model)
- - [Convert Checkpoint to TensorRT-LLM Unified Checkpoint](#convert-checkpoint-to-tensorrt-llm-unified-checkpoint)
+ - [Convert Checkpoint to TensorRT LLM Unified Checkpoint](#convert-checkpoint-to-tensorrt-llm-unified-checkpoint)
- [Build Engine](#build-engine)
- [Run Inference](#run-inference)
- [Run LLaMa-3.3 70B Model on PyTorch Backend](#run-llama-33-70b-model-on-pytorch-backend)
- - [Prepare TensorRT-LLM extra configs](#prepare-tensorrt-llm-extra-configs)
+ - [Prepare TensorRT LLM extra configs](#prepare-tensorrt-llm-extra-configs)
- [Launch trtllm-serve OpenAI-compatible API server](#launch-trtllm-serve-openai-compatible-api-server)
- [Run performance benchmarks](#run-performance-benchmarks)
## Overview
-The TensorRT-LLM LLaMA implementation can be found in [tensorrt_llm/models/llama/model.py](../../../../tensorrt_llm/models/llama/model.py). The TensorRT-LLM LLaMA example code is located in [`examples/models/core/llama`](./). There is one main file:
+The TensorRT LLM LLaMA implementation can be found in [tensorrt_llm/models/llama/model.py](../../../../tensorrt_llm/models/llama/model.py). The TensorRT LLM LLaMA example code is located in [`examples/models/core/llama`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the LLaMA model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the LLaMA model into TensorRT LLM checkpoint format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -67,7 +67,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-The TensorRT-LLM LLaMA example code locates at [examples/models/core/llama](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM LLaMA example code locates at [examples/models/core/llama](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Build TensorRT engine(s)
@@ -79,7 +79,7 @@ pip install --upgrade -r requirements.txt
Need to prepare the HF LLaMA checkpoint by following the guides here https://huggingface.co/docs/transformers/main/en/model_doc/llama.
-The `trtllm-build` command builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
+The `trtllm-build` command builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) with dummy weights.
`trtllm-build` command has a variety of options. In particular, the plugin-related options have two categories:
* Plugin options that requires a data type (e.g., `gpt_attention_plugin`), you can
@@ -614,7 +614,7 @@ python ../../../summarize.py --test_trt_llm \
### SmoothQuant
-The smoothquant supports both LLaMA v1 and LLaMA v2. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
+The smoothquant supports both LLaMA v1 and LLaMA v2. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
```bash
@@ -800,7 +800,7 @@ To run the GPTQ LLaMa example, the following steps are required:
### w4aINT8 quantization (QServe)
-TensorRT-LLM integrates the quantized GEMM from [QServe](https://arxiv.org/abs/2405.04532), which employs 4-bit quantization for weights and 8-bit quantization for activations. This technique offers versatile performance benefits across different scenarios. When the GEMM's m dimension is small, as in small batch-size decoding, it achieves performance comparable to w4a16 by reducing the memory bandwidth required for weight access. Conversely, for larger m dimensions, such as during prefilling or large batch-size decoding, it matches the performance of w8a8 by leveraging INT8 Tensor Cores.
+TensorRT LLM integrates the quantized GEMM from [QServe](https://arxiv.org/abs/2405.04532), which employs 4-bit quantization for weights and 8-bit quantization for activations. This technique offers versatile performance benefits across different scenarios. When the GEMM's m dimension is small, as in small batch-size decoding, it achieves performance comparable to w4a16 by reducing the memory bandwidth required for weight access. Conversely, for larger m dimensions, such as during prefilling or large batch-size decoding, it matches the performance of w8a8 by leveraging INT8 Tensor Cores.
Please follow the steps to run the model using QServe w4aINT8:
@@ -815,7 +815,7 @@ Please follow the steps to run the model using QServe w4aINT8:
2. Checkpoint conversion:
- Convert the DeepCompressor checkpoint into TensorRT-LLM checkpoint, potentially with tensor parallelism:
+ Convert the DeepCompressor checkpoint into TensorRT LLM checkpoint, potentially with tensor parallelism:
```bash
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1 # The current checkpoint conversion code requires legacy path
@@ -868,7 +868,7 @@ Please follow the steps to run the model using:
### Run
-To run a TensorRT-LLM LLaMA model using the engines generated by `trtllm-build`
+To run a TensorRT LLM LLaMA model using the engines generated by `trtllm-build`
```bash
# With fp16 inference
@@ -921,7 +921,7 @@ srun --container-image= \
Finally, you can submit the task with `sbatch .sh`.
-Considering the Slurm or other cluster management systems may be highly customized and the task-submit command may be variant, the forementioned example is for reference only. The key point is to submit the Python script with the MPI runtime, and TensorRT-LLM will take care of the rest.
+Considering the Slurm or other cluster management systems may be highly customized and the task-submit command may be variant, the forementioned example is for reference only. The key point is to submit the Python script with the MPI runtime, and TensorRT LLM will take care of the rest.
### Summarization using the LLaMA model
@@ -1151,7 +1151,7 @@ fine-tuned embedding table or logit GEMM, users should guarantee that all the in
embedding table or logit GEMM.
Here, we use two LoRA checkpoints as examples. These two LoRA checkponits add LoRA modules to `q_proj` and `v_proj`. Because we only
support adding lora modules on `q`, `k` and `v` at the same time, we need to add `--lora_target_modules "attn_q" "attn_k" "attn_v"`.
-In this case, we assign null pointers for the `k` LoRA module in TensorRT-LLM and skip the computation at runtime.
+In this case, we assign null pointers for the `k` LoRA module in TensorRT LLM and skip the computation at runtime.
As the rank of the LoRA modules of both checkpoints is 8, we can set `--max_lora_rank 8` to reduce the memory requirement for the LoRA plugin.
@@ -1273,7 +1273,7 @@ Output [Text 0 Beam 0]: "날씨가 아주 좋은 날에 공원에 갔을 때는
### Run INT4-AWQ LLaMa with several FP16 lora checkpoints
-TensorRT-LLM can also support Quantized base model + FP16/BF16 LoRA. We can first quantize the base model and build engine with the quantized checkpoint and different LoRA adapters. In this section, we show how to run an INT4-AWQ llama model with multiple FP16 LoRA modules.
+TensorRT LLM can also support Quantized base model + FP16/BF16 LoRA. We can first quantize the base model and build engine with the quantized checkpoint and different LoRA adapters. In this section, we show how to run an INT4-AWQ llama model with multiple FP16 LoRA modules.
* Quantize the llama model to INT4-AWQ from HF
```bash
@@ -1368,11 +1368,11 @@ Note that the sink tokens is included in the sliding attention tokens, and there
## Run LLaMA-3.1 405B Model
-Currently, TensorRT-LLM supports Meta checkpoint and Huggingface checkpoint for LLaMA-3.1. In this section, we demonstrate how to run the LLaMA-3.1 405B model via TensorRT-LLM. Here, we assume users have downloaded the checkpoints and placed them at `llama_3.1_405B_meta_model/` (Meta BF16 checkpoint), `llama_3.1_405B_HF_model/` (HF BF16 checkpoint) and `llama_3.1_405B_HF_FP8_model/` (HF FP8 checkpoint). Before converting the checkpoints to TensorRT-LLM unified checkpoints, **please check that `{"rope_scaling": {"rope_type": "llama3"}}` is set in the configuration file**. With this flag, TensorRT-LLM will enable the rope scaling of LLaMA-3.1. If not, please add it to the config file.
+Currently, TensorRT LLM supports Meta checkpoint and Huggingface checkpoint for LLaMA-3.1. In this section, we demonstrate how to run the LLaMA-3.1 405B model via TensorRT-LLM. Here, we assume users have downloaded the checkpoints and placed them at `llama_3.1_405B_meta_model/` (Meta BF16 checkpoint), `llama_3.1_405B_HF_model/` (HF BF16 checkpoint) and `llama_3.1_405B_HF_FP8_model/` (HF FP8 checkpoint). Before converting the checkpoints to TensorRT LLM unified checkpoints, **please check that `{"rope_scaling": {"rope_type": "llama3"}}` is set in the configuration file**. With this flag, TensorRT LLM will enable the rope scaling of LLaMA-3.1. If not, please add it to the config file.
Users can run the LLaMA-3.1 model with higher precision (bf16/fp16) or fp8. Here, to prevent accuracy drop, we perform per-channel per-token fp8 quantization (leveraged from https://github.com/pytorch/FBGEMM) on MLP layers, keeping other layers at higher precision. Note that per-channel per-token fp8 quantization is only supported on Huggingface checkpoint now. We will support it on Meta checkpoint soon. Note that this feature only supports SM90.
-### Convert Checkpoint to TensorRT-LLM Unified Checkpoint
+### Convert Checkpoint to TensorRT LLM Unified Checkpoint
To use the fp8 quantization, please add the `--use_fp8_rowwise` flag during the checkpoint conversion. In this demonstration, we convert the Meta checkpoint to bfloat16 with TP8-PP2 and the HF checkpoint to FP8 with TP8.
@@ -1548,10 +1548,10 @@ bash -c 'python ./examples/mmlu.py --test_trt_llm \
```
## Run LLaMa-3.3 70B Model on PyTorch Backend
-This section provides the steps to run LLaMa-3.3 70B model FP8 precision on PyTorch backend by launching TensorRT-LLM server and run performance benchmarks.
+This section provides the steps to run LLaMa-3.3 70B model FP8 precision on PyTorch backend by launching TensorRT LLM server and run performance benchmarks.
-### Prepare TensorRT-LLM extra configs
+### Prepare TensorRT LLM extra configs
```bash
cat >./extra-llm-api-config.yml <./extra-llm-api-config.yml <./extra-llm-api-config.yml <./extra-llm-api-config.yml <=4.44.0
# Download hf minitron model
git clone https://huggingface.co/nvidia/Minitron-4B-Base
-# Convert to TensorRT-LLM checkpoint
+# Convert to TensorRT LLM checkpoint
python3 ../gpt/convert_checkpoint.py --model_dir Minitron-4B-Base \
--dtype bfloat16 \
--output_dir minitron/trt_ckpt/bf16/1-gpu
-# Build TensorRT-LLM engines
+# Build TensorRT LLM engines
trtllm-build --checkpoint_dir minitron/trt_ckpt/bf16/1-gpu \
--gemm_plugin auto \
--output_dir minitron/trt_engines/bf16/1-gpu
diff --git a/examples/models/core/nemotron_nas/README.md b/examples/models/core/nemotron_nas/README.md
index ada6894fdc8..0b2201605ff 100644
--- a/examples/models/core/nemotron_nas/README.md
+++ b/examples/models/core/nemotron_nas/README.md
@@ -12,9 +12,9 @@ This document shows how to convert and build a model generated by Nemotron-NAS,
## Overview
-The TensorRT-LLM Nemotron-NAS implementation can be found in [tensorrt_llm/models/nemotron_nas/model.py](../../../../tensorrt_llm/models/nemotron_nas/model.py). The TensorRT-LLM Nemotron-NAS example code is located in [`examples/models/core/nemotron_nas`](./). There is one main file:
+The TensorRT LLM Nemotron-NAS implementation can be found in [tensorrt_llm/models/nemotron_nas/model.py](../../../../tensorrt_llm/models/nemotron_nas/model.py). The TensorRT LLM Nemotron-NAS example code is located in [`examples/models/core/nemotron_nas`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into tensorrt-llm checkpoint format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into TensorRT LLM checkpoint format.
The recommended flow for using Nemotron-NAS models is through TRTLLM's PyTorch-based flow.
An example of how to run `Nemotron-NAS` models through the PyTorch workflow can be found in the [PyTorch quickstart example](../../../pytorch/README.md).
@@ -43,7 +43,7 @@ Due the non-uniform architecture of the model, the different pipeline parallelis
## Usage
-The TensorRT-LLM example code is located at [examples/models/core/nemotron_nas](./).
+The TensorRT LLM example code is located at [examples/models/core/nemotron_nas](./).
The `convert_checkpoint.py` script accepts Hugging Face weights as input, and builds the corresponding TensorRT engines.
The number of TensorRT engines depends on the number of GPUs used to run inference.
@@ -51,8 +51,8 @@ The number of TensorRT engines depends on the number of GPUs used to run inferen
To build a TensorRT engine, you first need to obtain a Nemotron-NAS checkpoint in Hugging Face format. For example, [Llama-3_1-Nemotron-51B-Instruct](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct).
-The `trtllm-build` command builds TensorRT engines from a TensorRT-LLM checkpoint.
-If no checkpoint directory is specified, TensorRT-LLM builds the engines with dummy weights.
+The `trtllm-build` command builds TensorRT engines from a TensorRT LLM checkpoint.
+If no checkpoint directory is specified, TensorRT LLM builds the engines with dummy weights.
The `trtllm-build` command has a variety of options.
In particular, the plugin-related options have two categories:
@@ -118,7 +118,7 @@ The conversion script supports additional models with variable GQA, such as [Dec
## Runtime
-After you build the engine, you can use the engine with any TensorRT-LLM entrypoint or API.
+After you build the engine, you can use the engine with any TensorRT LLM entrypoint or API.
For example, you can run inference with [examples/run.py](../../../run.py):
```bash
diff --git a/examples/models/core/phi/README.md b/examples/models/core/phi/README.md
index 3a3543c3536..304bec48a25 100644
--- a/examples/models/core/phi/README.md
+++ b/examples/models/core/phi/README.md
@@ -1,12 +1,12 @@
# Phi
-This document explains how to build Phi-2, Phi-3 and Phi-3.5 family of models using TensorRT-LLM and run on a single or multiple GPUs.
+This document explains how to build Phi-2, Phi-3 and Phi-3.5 family of models using TensorRT LLM and run on a single or multiple GPUs.
For multimodal models (Phi-3-vision-128k-instruct and Phi-3.5-vision-instruct), see `../multimodal/README.md`.
- [Overview](#overview)
- [Support Matrix](#support-matrix)
- [Usage](#usage)
- - [1. Convert weights from HF Transformers to TensorRT-LLM format](#1-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
+ - [1. Convert weights from HF Transformers to TensorRT LLM format](#1-convert-weights-from-hf-transformers-to-tensorrt-llm-format)
- [2. Build TensorRT engine(s)](#2-build-tensorrt-engines)
- [3. Summarization using the Phi model](#3-summarization-using-the-phi-model)
- [4. Quantization](#4-quantization)
@@ -14,9 +14,9 @@ For multimodal models (Phi-3-vision-128k-instruct and Phi-3.5-vision-instruct),
## Overview
-The TensorRT-LLM Phi implementation can be found in [`tensorrt_llm/models/phi/model.py`](../../../../tensorrt_llm/models/phi/model.py) and [`tensorrt_llm/models/phi3/model.py`](../../../../tensorrt_llm/models/phi3/model.py). The TensorRT-LLM Phi example code is located in [`examples/models/core/phi`](./) with a single file:
+The TensorRT LLM Phi implementation can be found in [`tensorrt_llm/models/phi/model.py`](../../../../tensorrt_llm/models/phi/model.py) and [`tensorrt_llm/models/phi3/model.py`](../../../../tensorrt_llm/models/phi3/model.py). The TensorRT LLM Phi example code is located in [`examples/models/core/phi`](./) with a single file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -43,7 +43,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
## Usage
-### 1. Convert weights from HF Transformers to TensorRT-LLM format
+### 1. Convert weights from HF Transformers to TensorRT LLM format
Please install required packages first:
@@ -65,13 +65,13 @@ The section on Parallelism Modes in `../mixtral/README.md` discusses tensor and
### 2. Build TensorRT engine(s)
-TensorRT-LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) using dummy weights.
+TensorRT LLM builds TensorRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, TensorRT LLM will build engine(s) using dummy weights.
Examples of build invocations:
```bash
# Build a float16 engine using a single GPU and HF weights.
-# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
+# Enable several TensorRT LLM plugins to increase runtime performance. It also helps with build time.
trtllm-build \
--checkpoint_dir ./phi-checkpoint \
--output_dir ./phi-engine \
@@ -83,7 +83,7 @@ trtllm-build \
### 3. Summarization using the Phi model
-The following section describes how to run a TensorRT-LLM Phi model to summarize the articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
+The following section describes how to run a TensorRT LLM Phi model to summarize the articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. For each summary, the script can compute the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) scores and use the `ROUGE-1` score to validate the implementation.
The script can also perform the same summarization using the HF Phi model.
As previously explained, the first step is to build the TensorRT engine as described above using HF weights. You also have to install the requirements:
@@ -95,7 +95,7 @@ pip install -r requirements.txt
The summarization can be done using the [`summarize.py`](../../../summarize.py) script as follows:
```bash
-# Run the summarization task using a TensorRT-LLM model and a single GPU.
+# Run the summarization task using a TensorRT LLM model and a single GPU.
python3 ../../../summarize.py --engine_dir ./phi-engine \
--hf_model_dir /path/to/phi-model \
--batch_size 1 \
@@ -105,7 +105,7 @@ python3 ../../../summarize.py --engine_dir ./phi-engine \
--check_accuracy \
--tensorrt_llm_rouge1_threshold=20
-# Run the summarization task using a TensorRT-LLM model and 2-way tensor parallelism.
+# Run the summarization task using a TensorRT LLM model and 2-way tensor parallelism.
mpirun -n 2 --allow-run-as-root \
python3 ../../../summarize.py --engine_dir ./phi-engine-tp2 \
--hf_model_dir /path/to/phi-model \
@@ -149,7 +149,7 @@ and to run [summarization test](#3-summarization-using-the-phi-model) are same a
### 5. Run Phi-3 with LoRA
-TensorRT-LLM supports running Phi-3-mini/small models with FP16/BF16/FP32 LoRA. In this section, we use Phi-3-mini as an example to show how to run an FP8 base model with FP16 LoRA module.
+TensorRT LLM supports running Phi-3-mini/small models with FP16/BF16/FP32 LoRA. In this section, we use Phi-3-mini as an example to show how to run an FP8 base model with FP16 LoRA module.
* download the base model and lora model from HF
diff --git a/examples/models/core/qwen/README.md b/examples/models/core/qwen/README.md
index dc160903ce2..65f66e66402 100644
--- a/examples/models/core/qwen/README.md
+++ b/examples/models/core/qwen/README.md
@@ -1,6 +1,6 @@
# Qwen
-This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) model in TensorRT-LLM on both single GPU, single node multi-GPU.
+This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) model in TensorRT LLM on both single GPU, single node multi-GPU.
- [Qwen](#qwen)
- [Overview](#overview)
@@ -33,7 +33,7 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
## Overview
-The TensorRT-LLM Qwen implementation can be found in [models/qwen](../../../../tensorrt_llm/models/qwen/). The TensorRT-LLM Qwen example code is located in [`examples/models/core/qwen`](./). There is one main file:
+The TensorRT LLM Qwen implementation can be found in [models/qwen](../../../../tensorrt_llm/models/qwen/). The TensorRT LLM Qwen example code is located in [`examples/models/core/qwen`](./). There is one main file:
* [`convert_checkpoint.py`](./convert_checkpoint.py) to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Qwen model.
@@ -89,7 +89,7 @@ For Qwen3 models, we only list the largest models for dense and MoE architecture
## Usage
-The TensorRT-LLM Qwen example code locates at [examples/models/core/qwen](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
+The TensorRT LLM Qwen example code locates at [examples/models/core/qwen](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
### Download model weights
@@ -103,7 +103,7 @@ pip install -r requirements.txt
git lfs install
```
-Download one or more Qwen models that you would like to build to TensorRT-LLM engines. You may download from the [HuggingFace](https://huggingface.co) hub:
+Download one or more Qwen models that you would like to build to TensorRT LLM engines. You may download from the [HuggingFace](https://huggingface.co) hub:
```bash
git clone https://huggingface.co/Qwen/Qwen-7B-Chat ./tmp/Qwen/7B
@@ -113,9 +113,9 @@ git clone https://huggingface.co/Qwen/Qwen-72B-Chat ./tmp/Qwen/72B
### Build TensorRT engine(s)
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT-LLM checkpoints.
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF weights to TensorRT LLM checkpoints.
-The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
+The `trtllm-build` command builds TensorRT LLM engines from TensorRT LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
@@ -232,7 +232,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq \
#### SmoothQuant
-The smoothquant supports Qwen models. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
+The smoothquant supports Qwen models. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
```bash
@@ -338,7 +338,7 @@ To run the AWQ Qwen example, the following steps are required:
### Run
-To run a TensorRT-LLM Qwen model using the engines generated by `trtllm-build`
+To run a TensorRT LLM Qwen model using the engines generated by `trtllm-build`
```bash
# With fp16 inference
@@ -586,8 +586,8 @@ Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingfac
load rouge ...
Downloading builder script: 5.60kB [00:00, 18.9MB/s]
load rouge done
-[11/09/2023-02:24:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 30.13867211341858 sec)
-[11/09/2023-02:24:06] [TRT-LLM] [I] TensorRT-LLM beam 0 result
+[11/09/2023-02:24:06] [TRT-LLM] [I] TensorRT LLM (total latency: 30.13867211341858 sec)
+[11/09/2023-02:24:06] [TRT-LLM] [I] TensorRT LLM beam 0 result
[11/09/2023-02:24:06] [TRT-LLM] [I] rouge1 : 26.35215119137573
[11/09/2023-02:24:06] [TRT-LLM] [I] rouge2 : 9.507814774384485
[11/09/2023-02:24:06] [TRT-LLM] [I] rougeL : 18.171982659482865
@@ -596,7 +596,7 @@ load rouge done
## Qwen3
-TensorRT-LLM now supports Qwen3, the latest version of the Qwen model series. This guide walks you through the examples to run the Qwen3 models using NVIDIA's TensorRT-LLM framework with the PyTorch backend. According to the support matrix, TensorRT-LLM provides comprehensive support for various Qwen3 model variants including:
+TensorRT LLM now supports Qwen3, the latest version of the Qwen model series. This guide walks you through the examples to run the Qwen3 models using NVIDIA's TensorRT LLM framework with the PyTorch backend. According to the support matrix, TensorRT LLM provides comprehensive support for various Qwen3 model variants including:
- Qwen3-0.6B
- Qwen3-1.7B
@@ -607,7 +607,7 @@ TensorRT-LLM now supports Qwen3, the latest version of the Qwen model series. Th
- Qwen3-30B-A3B
- Qwen3-235B-A22B
-Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) for how to build TensorRT-LLM from source and start a TRT-LLM docker container if needed.
+Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) for how to build TensorRT LLM from source and start a TRT-LLM docker container if needed.
> [!NOTE]
> This guide assumes that you replace placeholder values (e.g. ``) with the appropriate paths.
@@ -924,7 +924,7 @@ For further details, please refer to [speculative-decoding.md](../../../../docs/
### Dynamo
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
-Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
## Notes and Troubleshooting
diff --git a/examples/models/core/qwen2audio/README.md b/examples/models/core/qwen2audio/README.md
index 4fe7fb1667a..8f8d2ab4347 100644
--- a/examples/models/core/qwen2audio/README.md
+++ b/examples/models/core/qwen2audio/README.md
@@ -35,7 +35,7 @@
--output_dir=./tllm_checkpoint_1gpu_fp16_wo8
```
-- Build TensorRT-LLM engine
+- Build TensorRT LLM engine
NOTE: `max_prompt_embedding_table_size = query_token_num * max_batch_size`, therefore, if you change `max_batch_size`, `--max_prompt_embedding_table_size` must be reset accordingly.
```bash
diff --git a/examples/models/core/qwenvl/README.md b/examples/models/core/qwenvl/README.md
index fa2672e693e..894d0c2ad46 100644
--- a/examples/models/core/qwenvl/README.md
+++ b/examples/models/core/qwenvl/README.md
@@ -30,7 +30,7 @@
--dtype float16
```
-- Build TensorRT-LLM engine
+- Build TensorRT LLM engine
NOTE: `max_prompt_embedding_table_size = query_token_num * max_batch_size`, therefore, if you change `max_batch_size`, `--max_prompt_embedding_table_size` must be reset accordingly.
```bash
diff --git a/examples/models/core/recurrentgemma/README.md b/examples/models/core/recurrentgemma/README.md
index c3c398f6ec0..050709c3859 100644
--- a/examples/models/core/recurrentgemma/README.md
+++ b/examples/models/core/recurrentgemma/README.md
@@ -4,9 +4,9 @@ This document shows how to build and run a [RecurrentGemma](https://github.com/g
## Overview
-The TensorRT-LLM RecurrentGemma implementation can be found in [`tensorrt_llm/models/recurrentgemma/model.py`](../../../../tensorrt_llm/models/recurrentgemma/model.py). The TensorRT-LLM RecurrentGemma example code is located in [`examples/models/core/recurrentgemma`](./). There is one main file:
+The TensorRT LLM RecurrentGemma implementation can be found in [`tensorrt_llm/models/recurrentgemma/model.py`](../../../../tensorrt_llm/models/recurrentgemma/model.py). The TensorRT LLM RecurrentGemma example code is located in [`examples/models/core/recurrentgemma`](./). There is one main file:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the JAX format to the TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the JAX format to the TensorRT LLM format.
In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
@@ -19,7 +19,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
| Huggingface (HF) | Y | Y | Y | Y | Y | Y |
| Jax | Y | Y | N | N | N | Y |
-* TensorRT-LLM can support different post-training quantization for the Huggingface checkpoints, including FP8, INT8 SmoothQuant, and INT4 AWQ.
+* TensorRT LLM can support different post-training quantization for the Huggingface checkpoints, including FP8, INT8 SmoothQuant, and INT4 AWQ.
## Usage
@@ -48,8 +48,8 @@ git clone https://huggingface.co/google/recurrentgemma-2b-flax ./recurrentgemma_
git clone https://huggingface.co/google/recurrentgemma-2b-it-flax ./recurrentgemma_model/recurrentgemma-2b-it-flax
```
-### 2. Convert weights from JAX to TensorRT-LLM format
-The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF/JAX weights to TensorRT-LLM checkpoints. TensorRT-LLM can support different post-training quantization methods. Here we use recurrentgemma-2b-it model as an example to show how to run quantized model.
+### 2. Convert weights from JAX to TensorRT LLM format
+The [`convert_checkpoint.py`](./convert_checkpoint.py) script converts HF/JAX weights to TensorRT LLM checkpoints. TensorRT LLM can support different post-training quantization methods. Here we use recurrentgemma-2b-it model as an example to show how to run quantized model.
```bash
# recurrentgemma-2b
@@ -109,7 +109,7 @@ python convert_checkpoint.py --model_dir ${CKPT_2B_IT_FLAX_PATH} \
```
### 3. Build TensorRT engine(s)
-After getting checkpoint, we can use `trtllm-build` command to build TensorRT-LLM engines from TensorRT-LLM checkpoints.
+After getting checkpoint, we can use `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints.
```bash
# recurrentgemma-2b
diff --git a/examples/models/core/whisper/README.md b/examples/models/core/whisper/README.md
index 4a5d5652cfb..49712ccf793 100755
--- a/examples/models/core/whisper/README.md
+++ b/examples/models/core/whisper/README.md
@@ -1,6 +1,6 @@
# Whisper
-This document shows how to build and run a [whisper model](https://github.com/openai/whisper/tree/main) in TensorRT-LLM on a single GPU.
+This document shows how to build and run a [whisper model](https://github.com/openai/whisper/tree/main) in TensorRT LLM on a single GPU.
- [Whisper](#whisper)
- [Overview](#overview)
@@ -16,7 +16,7 @@ This document shows how to build and run a [whisper model](https://github.com/op
## Overview
-The TensorRT-LLM Whisper example code is located in [`examples/models/core/whisper`](./).
+The TensorRT LLM Whisper example code is located in [`examples/models/core/whisper`](./).
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert weights from OpenAI Whisper format to TRT-LLM format.
* `trtllm-build` to build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run the Whisper model.
@@ -29,7 +29,7 @@ The TensorRT-LLM Whisper example code is located in [`examples/models/core/whisp
## Usage
-The TensorRT-LLM Whisper example code locates at [examples/models/core/whisper](./). It takes whisper pytorch weights as input, and builds the corresponding TensorRT engines.
+The TensorRT LLM Whisper example code locates at [examples/models/core/whisper](./). It takes whisper pytorch weights as input, and builds the corresponding TensorRT engines.
### Build TensorRT engine(s)
@@ -44,7 +44,7 @@ wget --directory-prefix=assets https://raw.githubusercontent.com/yuekaizhang/Tri
wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt
```
-TensorRT-LLM Whisper builds TensorRT engine(s) from the pytorch checkpoint.
+TensorRT LLM Whisper builds TensorRT engine(s) from the pytorch checkpoint.
```bash
# install requirements first
@@ -57,7 +57,7 @@ MAX_BATCH_SIZE=8
checkpoint_dir=whisper_large_v3_weights_${WEIGHT_ONLY_PRECISION}
output_dir=whisper_large_v3_${WEIGHT_ONLY_PRECISION}
-# Convert the large-v3 model weights into TensorRT-LLM format.
+# Convert the large-v3 model weights into TensorRT LLM format.
python3 convert_checkpoint.py \
--use_weight_only \
--weight_only_precision $WEIGHT_ONLY_PRECISION \
@@ -136,7 +136,7 @@ Calculates the character error rate (CER) instead of the word error rate (WER) f
These options allow you to select different decoding audio datasets from Hugging Face.
### Distil-Whisper
-TensorRT-LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format.
+TensorRT LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format.
You can do so by running the script [distil_whisper/convert_from_distil_whisper.py](./convert_from_distil_whisper.py) as follows:
```bash
@@ -196,4 +196,4 @@ python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeec
### Acknowledgment
-This implementation of TensorRT-LLM for Whisper has been adapted from the [NVIDIA TensorRT-LLM Hackathon 2023](https://github.com/NVIDIA/trt-samples-for-hackathon-cn/tree/master/Hackathon2023) submission of Jinheng Wang, which can be found in the repository [Eddie-Wang-Hackathon2023](https://github.com/Eddie-Wang1120/Eddie-Wang-Hackathon2023) on GitHub. We extend our gratitude to Jinheng for providing a foundation for the implementation.
+This implementation of TensorRT LLM for Whisper has been adapted from the [NVIDIA TensorRT LLM Hackathon 2023](https://github.com/NVIDIA/trt-samples-for-hackathon-cn/tree/master/Hackathon2023) submission of Jinheng Wang, which can be found in the repository [Eddie-Wang-Hackathon2023](https://github.com/Eddie-Wang1120/Eddie-Wang-Hackathon2023) on GitHub. We extend our gratitude to Jinheng for providing a foundation for the implementation.
diff --git a/examples/ngram/README.md b/examples/ngram/README.md
index 60201ce063f..7be1bab5fea 100644
--- a/examples/ngram/README.md
+++ b/examples/ngram/README.md
@@ -1,6 +1,6 @@
# NGram Speculative Decoding
-This document shows how to build and run a model using NGram speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
+This document shows how to build and run a model using NGram speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT LLM on single GPU, or single node multiple GPU.
## Overview
diff --git a/examples/openai_triton/README.md b/examples/openai_triton/README.md
index b0428c2541d..b5f39d10597 100644
--- a/examples/openai_triton/README.md
+++ b/examples/openai_triton/README.md
@@ -1,6 +1,6 @@
# Integration for OpenAI Triton
-The typical approach to integrate a kernel into TensorRT-LLM is to create TensorRT plugins.
+The typical approach to integrate a kernel into TensorRT LLM is to create TensorRT plugins.
Specially for integrating OpenAI Triton kernels, there are two methods:
1. Creating TensorRT plugin manually, you can refer to [manual plugin example](./manual_plugin/) for details,
diff --git a/examples/openai_triton/manual_plugin/README.md b/examples/openai_triton/manual_plugin/README.md
index 40e6f124b93..5c8b5d481d5 100644
--- a/examples/openai_triton/manual_plugin/README.md
+++ b/examples/openai_triton/manual_plugin/README.md
@@ -4,12 +4,12 @@ This document describes how to build and run a custom plugin leveraging [OpenAI
The workflow can be summarized as follows.
1. Implement a kernel using Triton in Python.
2. Compile that kernel using Triton AoT (Ahead-of-Time) compilation tool to generate C files.
- 3. Implement a custom TensorRT-LLM plugin to execute the compiled kernel.
+ 3. Implement a custom TensorRT LLM plugin to execute the compiled kernel.
4. Build the TensorRT engine.
5. It is ready to be executed by TensorRT.
-In this example, we show how to create a TensorRT-LLM plugin to wrap a [Fused Attention]((fmha_triton.py)) kernel implemented in OpenAI Triton.
-As a prerequisite, it is necessary to have the TensorRT-LLM C++ runtime library.
+In this example, we show how to create a TensorRT LLM plugin to wrap a [Fused Attention]((fmha_triton.py)) kernel implemented in OpenAI Triton.
+As a prerequisite, it is necessary to have the TensorRT LLM C++ runtime library.
The instructions to build that library can be found [here](../../README.md#build-from-source).
## 1. Triton AoT Preparation
@@ -71,7 +71,7 @@ If GPU resources are limited, it is recommended to adjust the number of stages o
## 2. Implement a Custom TensorRT Plugin
-This section describes how to implement a custom plugin for TensorRT-LLM to execute the Triton kernel created in the previous section.
+This section describes how to implement a custom plugin for TensorRT LLM to execute the Triton kernel created in the previous section.
We provide an example of plugin implementation.
- TritonFlashAttentionPlugin([.cpp](TritonFlashAttentionPlugin.cpp), [.h](TritonFlashAttentionPlugin.h)): TensorRT plugin.
- [plugin.py](plugin.py): Python wrapper.
@@ -88,15 +88,15 @@ mkdir -p build && cd build
cmake .. && make
cd ..
```
-As mentioned in the previous section, it is necessary to have the TensorRT-LLM C++ runtime library.
+As mentioned in the previous section, it is necessary to have the TensorRT LLM C++ runtime library.
If you want to specify the library paths, run:
```bash
cmake -DTRT_LIB_DIR= -DTRT_INCLUDE_DIR= -DTRT_LLM_LIB_DIR= ..
```
If the build is successful, you should be able to find a shared library for the custom plugin at `build/libtrt_llm_custom_plugins.so`.
-A Python wrapper of the Fused Multihead Attention (FMHA) operator and the corresponding TensorRT-LLM layer are implemented in [plugin.py](plugin.py).
-It is similar to other TensorRT-LLM operators and layers implemented in [functional.py](../../tensorrt_llm/functional.py) and [layers](../../tensorrt_llm/layers), respectively.
+A Python wrapper of the Fused Multihead Attention (FMHA) operator and the corresponding TensorRT LLM layer are implemented in [plugin.py](plugin.py).
+It is similar to other TensorRT LLM operators and layers implemented in [functional.py](../../tensorrt_llm/functional.py) and [layers](../../tensorrt_llm/layers), respectively.
That FMHA operator uses the custom plugin that wraps the functions generated from the Triton kernel.
## 3. Build and Run the TensorRT Engine
diff --git a/examples/openai_triton/plugin_autogen/README.md b/examples/openai_triton/plugin_autogen/README.md
index e671c6f62fa..0c046330fa0 100644
--- a/examples/openai_triton/plugin_autogen/README.md
+++ b/examples/openai_triton/plugin_autogen/README.md
@@ -20,7 +20,7 @@ There are three command-line arguments:
1. `workspace`: This is the root directory to hold the temporary generation files. PluginGen should not alter anything outside of the workspace,
2. `kernel_config`: This is a Python file that holds a variable called `KERNELS` of type `List[KernelMetaData]`. PluginGen can process one or more kernels at a time,
-3. `tensorrt_llm_include_path`: This is the path to the TensorRT-LLM include directory. It is used to include the TensorRT-LLM header files in the generated plugin.
+3. `tensorrt_llm_include_path`: This is the path to the TensorRT LLM include directory. It is used to include the TensorRT LLM header files in the generated plugin.
You can refer to [./kernel_config.py](./kernel_config.py) for an example of `KernelMetaData` for the Fused Attention kernel. It contains several fields:
@@ -41,7 +41,7 @@ The user should provide the kernel configurations as well as the Triton kernel s
4. Perform the compilation and generate `libtriton_plugins.so`.
5. Generate a `functional.py` containing a Python wrapper for this plugin.
-After the generation, you should have `libtriton_plugins.so` and `functional.py` in the workspace. You can use them to integrate the Triton kernel by simply using the corresponding Python methods in the generated `functional.py` during the model-building stage, just like other layers located in the TensorRT-LLM built-in `functional.py`.
+After the generation, you should have `libtriton_plugins.so` and `functional.py` in the workspace. You can use them to integrate the Triton kernel by simply using the corresponding Python methods in the generated `functional.py` during the model-building stage, just like other layers located in the TensorRT LLM built-in `functional.py`.
## End-to-End Example for FHMA Kernel Integration
@@ -79,7 +79,7 @@ PluginGen will generate all the necessary files within the `./tmp` directory. Th
### Post-Stage: Use the Plugin
-To use the plugin in a TensorRT-LLM model, please refer to the generated `output/functional.py`. It should contain Python wrappers for all the plugins. To use the plugins, first import `functional.py` and then use the corresponding Python methods to build the model.
+To use the plugin in a TensorRT LLM model, please refer to the generated `output/functional.py`. It should contain Python wrappers for all the plugins. To use the plugins, first import `functional.py` and then use the corresponding Python methods to build the model.
For an example of using the Fused Attention plugin in a model, please refer to [build_engine.py](./build_engine.py) for building the TensorRT engine and [run_engine.py](./run_engine.py) for running the engine in the runtime.
diff --git a/examples/python_plugin/README.md b/examples/python_plugin/README.md
index c7ca1b9ab98..8079d381109 100644
--- a/examples/python_plugin/README.md
+++ b/examples/python_plugin/README.md
@@ -1,18 +1,18 @@
-# TensorRT-LLM Python Plugin
+# TensorRT LLM Python Plugin
-TensorRT-LLM provides a Python plugin interface to integrate TensorRT-LLM with pure Python.
+TensorRT LLM provides a Python plugin interface to integrate TensorRT LLM with pure Python.
+ `openai_triton_plugin`: plugin package
-+ `build_lookup.py`: Build a TensorRT engine with TensorRT-LLM Python plugin
++ `build_lookup.py`: Build a TensorRT engine with TensorRT LLM Python plugin
+ `run_lookup.py`: Run the engine and compare the result with PyTorch
## Plugin Definition
The following code shows how to create a look-up plugin.
-We only need to do a few things to define a TensorRT-LLM plugin.
+We only need to do a few things to define a TensorRT LLM plugin.
1. Inherit the `PluginBase`.
-2. Register the plugin class to TensorRT-LLM by using `@trtllm_plugin("your_plugin_name")`.
+2. Register the plugin class to TensorRT LLM by using `@trtllm_plugin("your_plugin_name")`.
3. Define an `__init__` function and initialize the base class.
4. Define a shape and dtype inference function.
5. Define the compute flow.
@@ -55,7 +55,7 @@ class LookUpPlugin(PluginBase):
```
-## Adding a TensorRT-LLM Plugin to a Network
+## Adding a TensorRT LLM Plugin to a Network
You only need an instance of the plugin object and then call it with `tensorrt_llm.Tensor` as input arguments.
@@ -80,7 +80,7 @@ with tensorrt_llm.net_guard(network):
## Plugin Code Structure
-Because TensorRT-LLM performs plugin registration when importing the custom TensorRT-LLM plugin, there are some code structure conventions to register the plugin at runtime.
+Because TensorRT LLM performs plugin registration when importing the custom TensorRT LLM plugin, there are some code structure conventions to register the plugin at runtime.
```text
plugin_lib
@@ -99,7 +99,7 @@ from .lookup_plugin import LookUpPlugin
__all__ = ["LookUpPlugin"]
```
-## Deserialize an Engine with TensorRT-LLM Plugin
+## Deserialize an Engine with TensorRT LLM Plugin
During deserialization, TensorRT needs to find the user-defined plugin. Thus, we need to import the plugin once to register them. If the plugin follows the code structure convention, users only need to import that package to register all the custom plugins.
diff --git a/examples/quantization/README.md b/examples/quantization/README.md
index 94b3510ac18..e74736b61b8 100644
--- a/examples/quantization/README.md
+++ b/examples/quantization/README.md
@@ -1,10 +1,10 @@
-# TensorRT-LLM Quantization Toolkit Installation Guide
+# TensorRT LLM Quantization Toolkit Installation Guide
## Introduction
This document introduces:
-- The steps to install the TensorRT-LLM quantization toolkit.
+- The steps to install the TensorRT LLM quantization toolkit.
- The Python APIs to quantize the models.
The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples.
@@ -129,13 +129,13 @@ FP_O * output_scale = FP8_O
### Format of Mixed Precision Checkpoints
-ModelOpt can produce a mixed precision TensorRT-LLM checkpoint. After producing the quantized checkpoint, you can build engine directly by `trtllm-build` command:
+ModelOpt can produce a mixed precision TensorRT LLM checkpoint. After producing the quantized checkpoint, you can build engine directly by `trtllm-build` command:
```bash
trtllm-build --checkpoint_dir --output_dir $OUTPUT_PATH
```
If you have some special needs about the model weights, such as int4 for MLP and int8 for the rest, you need to generate the checkpoint and config files by yourself.
-The `trtllm-build` command consumes the same format of weights, which is presented in [TensorRT-LLM checkpoint formats](https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html), but has different quantization method for every linear. Therefore, each layer, such as layer30.mlp.fc, layer30.attention.dense, and so on, keeps the same model weights according to the quantization formats in TensorRT-LLM checkpoint. What's more, the `quantization` field in `config.json` will be like this:
+The `trtllm-build` command consumes the same format of weights, which is presented in [TensorRT LLM checkpoint formats](https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html), but has different quantization method for every linear. Therefore, each layer, such as layer30.mlp.fc, layer30.attention.dense, and so on, keeps the same model weights according to the quantization formats in TensorRT LLM checkpoint. What's more, the `quantization` field in `config.json` will be like this:
```
"quantization": {
"quant_algo": "MIXED_PRECISION",
@@ -171,11 +171,11 @@ There will be another file about per-layer quantization information named `quant
}
```
-TensorRT-LLM will automatically read `quant_cfg.json` after recogniziong the `MIXED_PRECISION` quantization method in `config.json`. All the specific algorithm keeps the same as what in `quantization` field before. If some layers are not listed, they'll be treated as no quantization.
+TensorRT LLM will automatically read `quant_cfg.json` after recogniziong the `MIXED_PRECISION` quantization method in `config.json`. All the specific algorithm keeps the same as what in `quantization` field before. If some layers are not listed, they'll be treated as no quantization.
## APIs
-[`quantize.py`](./quantize.py) uses the quantization toolkit to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). It will produce one another quantization config for per-layer's information when setting auto quantization. The checkpoints can be directly used by `trtllm-build` command to build TensorRT-LLM engines. See this [`doc`](../../docs/source/architecture/checkpoint.md) for more details on the TensorRT-LLM checkpoint format.
+[`quantize.py`](./quantize.py) uses the quantization toolkit to calibrate the PyTorch models and export TensorRT LLM checkpoints. Each TensorRT LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). It will produce one another quantization config for per-layer's information when setting auto quantization. The checkpoints can be directly used by `trtllm-build` command to build TensorRT LLM engines. See this [`doc`](../../docs/source/architecture/checkpoint.md) for more details on the TensorRT LLM checkpoint format.
> *This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs*
@@ -227,7 +227,7 @@ with torch.no_grad():
### Export Quantized Model
-After the model is quantized, it can be exported to a TensorRT-LLM checkpoint, which includes
+After the model is quantized, it can be exported to a TensorRT LLM checkpoint, which includes
- One json file recording the model structure and metadata, and
- One or several rank weight files storing quantized model weights and scaling factors.
diff --git a/examples/redrafter/README.md b/examples/redrafter/README.md
index 0e08871f2dc..5f7c2bec110 100644
--- a/examples/redrafter/README.md
+++ b/examples/redrafter/README.md
@@ -1,6 +1,6 @@
# Recurrent Drafter (ReDrafter) Speculative Decoding
-This document describes how to build and run a model using the ReDrafter speculative decoding technique ([`Github`](https://github.com/apple/ml-recurrent-drafter), [`Paper`](https://arxiv.org/abs/2403.09919)) in TensorRT-LLM on single GPU, single node multiple GPU.
+This document describes how to build and run a model using the ReDrafter speculative decoding technique ([`Github`](https://github.com/apple/ml-recurrent-drafter), [`Paper`](https://arxiv.org/abs/2403.09919)) in TensorRT LLM on single GPU, single node multiple GPU.
## Overview
Similar to other speculative decoding techniques, ReDrafter contains two major components: base LLM model and a drafter model which contains one language model (LM) head.
@@ -26,7 +26,7 @@ While choosing a large number of beams and maximum draft length per beam can lea
* Tensor Parallel
## Usage
-The TensorRT-LLM ReDrafter example code is located in [`examples/redrafter`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with ReDrafter decoding support.
+The TensorRT LLM ReDrafter example code is located in [`examples/redrafter`](./). There is one [`convert_checkpoint.py`](./convert_checkpoint.py) file to convert and build the [TensorRT](https://developer.nvidia.com/tensorrt) engine(s) needed to run models with ReDrafter decoding support.
**NOTE**: At the time of writing this, the Drafter checkpoint is not public. The following assumes that the base model is Vicuna 7B and you have access to a Drafter checkpoint for this model.
@@ -41,7 +41,7 @@ git clone https://huggingface.co/lmsys/vicuna-7b-v1.3
# assuming the drafter checkpoint is located in dir "vicuna-7b-drafter"
```
-We use `convert_checkpoint.py` script to convert the model for ReDrafter decoding into TensorRT-LLM checkpoint format.
+We use `convert_checkpoint.py` script to convert the model for ReDrafter decoding into TensorRT LLM checkpoint format.
You can specify the 3 hyperparameters (described above) during this conversion. The resulting config.json file can be modified to alter these hyperparameters before the engine building process.
```bash
diff --git a/examples/sample_weight_stripping/README.md b/examples/sample_weight_stripping/README.md
index a005f0904b1..a427dd3df45 100644
--- a/examples/sample_weight_stripping/README.md
+++ b/examples/sample_weight_stripping/README.md
@@ -14,11 +14,11 @@
- [Engine Plan File Size Results](#engine-plan-file-size-results)
- [Prototype](#prototype)
* [Checkpoint Pruner](#checkpoint-pruner)
- * [Pruning a TensorRT-LLM Checkpoint](#pruning-a-tensorrt-llm-checkpoint)
+ * [Pruning a TensorRT LLM Checkpoint](#pruning-a-tensorrt-llm-checkpoint)
## Overview
-This workflow introduces a new script `trtllm-refit`. `trtllm-refit` allows you to refit the generated engine with weights from any TensorRT-LLM checkpoint matching the same architecture, so long as you build the engine as refittable or stripped.
+This workflow introduces a new script `trtllm-refit`. `trtllm-refit` allows you to refit the generated engine with weights from any TensorRT LLM checkpoint matching the same architecture, so long as you build the engine as refittable or stripped.
### Build Weights Stripped Engine
TensorRT can generate refittable engines with the same performance as the non-refittable ones when TensorRT builder optimize under the assumption that the engine will be refitted with weights identical to those provide at build time. Those refittable weights can be stripped to reduce the engine plan file size, with the option to subsequently supply them via the refit interface.
@@ -30,7 +30,7 @@ trtllm-build --strip_plan --checkpoint_dir ${CHECKPOINT_DIR} --output_dir ${ENGI
```
### Engine Refitter
-The refitter allows you to refit an engine with weights in a TensorRT-LLM checkpoint. It does this by doing a textual match between engine and checkpoint weight names. In order for the refitter to work, the engine must be built with refitting enabled. This can be accomplished by passing `--strip_plan` to `trtllm-build`.
+The refitter allows you to refit an engine with weights in a TensorRT LLM checkpoint. It does this by doing a textual match between engine and checkpoint weight names. In order for the refitter to work, the engine must be built with refitting enabled. This can be accomplished by passing `--strip_plan` to `trtllm-build`.
After building a stripped engine via `trtllm-build`, run
@@ -61,7 +61,7 @@ wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
```
-2. Convert the Hugging Face checkpoint into TensorRT-LLM format.
+2. Convert the Hugging Face checkpoint into TensorRT LLM format.
Run below command lines in [`examples/models/contrib/gpt`](../gptj) directory.
```bash
# Build a float16 checkpoint using HF weights.
@@ -112,7 +112,7 @@ python3 ../summarize.py --engine_dir ./trt_engines/gptj_fp16_tp1.refit \
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
-2. Calibrate the checkpoint and convert into TensorRT-LLM format.
+2. Calibrate the checkpoint and convert into TensorRT LLM format.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
# Calibrate INT4 using AMMO.
@@ -153,7 +153,7 @@ python3 ../summarize.py --engine_dir trt_int4_AWQ_full_from_wtless \
1. Download the llama-7b-hf checkpoint and saved in /llm-models/llama-models/llama-7b-hf/.
-2. Convert the checkpoint into TensorRT-LLM format.
+2. Convert the checkpoint into TensorRT LLM format.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
python3 convert_checkpoint.py --model_dir /llm-models/llama-models/llama-7b-hf/ \
@@ -193,7 +193,7 @@ python3 ../summarize.py --engine_dir ./engines/llama-7b-hf-fp16-woq-1gpu-wtless-
1. Download the llama-v2-70b-hf checkpoint and saved in /llm-models/llama-models-v2/llama-v2-70b-hf/.
-2. Calibrate the checkpoint and convert into TensorRT-LLM format.
+2. Calibrate the checkpoint and convert into TensorRT LLM format.
Run below command lines in [`examples/models/core/llama`](../llama) directory.
```bash
# Calibrate FP8 using AMMO.
@@ -241,16 +241,16 @@ python3 ../summarize.py --engine_dir engines/llama2-70b-hf-fp8-tp2.refit \
## Prototype
### Checkpoint Pruner
-The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT-LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/latest/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.
+The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/latest/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.
-When building an engine with a pruned checkpoint, TensorRT-LLM fills in the missing weights with random ones. These weights should later be [refit](#engine-refitter) with the original weights to preserve the intended behavior.
+When building an engine with a pruned checkpoint, TensorRT LLM fills in the missing weights with random ones. These weights should later be [refit](#engine-refitter) with the original weights to preserve the intended behavior.
Building an engine from a pruned checkpoint will also allow the engine to be [refit](#engine-refitter).
-#### Pruning a TensorRT-LLM Checkpoint
+#### Pruning a TensorRT LLM Checkpoint
1. Install [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md) either through [pip](https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md#installation) or [from the source](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/build-from-source-linux.md).
-2. Download a model of your choice and convert it to a TensorRT-LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/llama/README.md#usage)).
+2. Download a model of your choice and convert it to a TensorRT LLM checkpoint ([llama instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/llama/README.md#usage)).
3. (Optional) Run the `trtllm-prune` command.
```bash
# Prunes the TRT-LLM checkpoint at ${CHECKPOINT_DIR}, and stores it in the directory ${CHECKPOINT_DIR}.pruned
diff --git a/examples/wide_ep/slurm_scripts/README.md b/examples/wide_ep/slurm_scripts/README.md
index 3bd5e926b21..611737a875e 100644
--- a/examples/wide_ep/slurm_scripts/README.md
+++ b/examples/wide_ep/slurm_scripts/README.md
@@ -1,6 +1,6 @@
-# TensorRT-LLM Wide-EP Benchmark Scripts
+# TensorRT LLM Wide-EP Benchmark Scripts
-This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
+This directory contains scripts for benchmarking TensorRT LLM wide-ep performance using SLURM job scheduler.
## ⚠️ DISCLAIMER
@@ -28,7 +28,7 @@ Note that, core implementation of the slurm scripts are included in `examples/di
Before running the scripts, ensure you have:
- Access to a SLURM cluster
-- Container image with TensorRT-LLM installed
+- Container image with TensorRT LLM installed
- Model files accessible on the cluster
- Required environment variables set
diff --git a/tensorrt_llm/_torch/auto_deploy/custom_ops/README.md b/tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
index 6bef175199b..53bd48d29d5 100644
--- a/tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
+++ b/tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
@@ -38,5 +38,5 @@ The table below lists the operators ordered by their backend.
| `torch.ops.auto_deploy.triton_attention_fused_flattened_mla_with_cache` | Triton fused flattened Multi-head Latent Attention with cache support |
| `torch.ops.auto_deploy.triton_rope_on_flattened_inputs` | Triton RoPE on flattened inputs |
| `torch.ops.auto_deploy.triton_rope_with_input_pos` | Triton RoPE with input positions |
-| `torch.ops.auto_deploy.trtllm_moe_fused` | TensorRT-LLM fused MoE implementation |
-| `torch.ops.auto_deploy.trtllm_dist_fused_linear_all_reduce` | TensorRT-LLM fused linear layer followed by all-reduce operation |
+| `torch.ops.auto_deploy.trtllm_moe_fused` | TensorRT LLM fused MoE implementation |
+| `torch.ops.auto_deploy.trtllm_dist_fused_linear_all_reduce` | TensorRT LLM fused linear layer followed by all-reduce operation |
diff --git a/tensorrt_llm/scaffolding/README.md b/tensorrt_llm/scaffolding/README.md
index 5f81141928c..ff6adc0672e 100644
--- a/tensorrt_llm/scaffolding/README.md
+++ b/tensorrt_llm/scaffolding/README.md
@@ -23,7 +23,7 @@ Now Scaffolding is a module in TensorRT-LLM, so users just need to install Tenso
``` bash
python examples/scaffolding/run_basic_generation.py --model_dir PATH/TO/MODEL
```
-This example run the generation with TensorRT-LLM backend. It shows the step of using Scaffolding. Users firstly need to create `Controller` and `Worker` instance, then map the worker tag to the worker instance, finally create the `ScaffoldingLlm` instance and run the request. It also shows how to run scaffolding on asyncio and run the batched request.
+This example run the generation with TensorRT LLM backend. It shows the step of using Scaffolding. Users firstly need to create `Controller` and `Worker` instance, then map the worker tag to the worker instance, finally create the `ScaffoldingLlm` instance and run the request. It also shows how to run scaffolding on asyncio and run the batched request.
[More examples](../../examples/scaffolding)
These examples shows how to run more complex methods including majority voting and best-of-n, how to static the output tokens with the decorator, how to run the dataset on concurrency and static the results.
diff --git a/tests/README.md b/tests/README.md
index 69c39e9a24b..7e6d439ea45 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -9,7 +9,7 @@ Unit test should be small, fast, and test only for specific function.
If you need to run them locally, the only dependencies are `requirements-dev.txt`.
```bash
-# in tensorrt-llm source repo root dir
+# in TensorRT LLM source repo root dir
# use editable install, such that your local changes will be used immedietely in the tests w/o another install
# see https://setuptools.pypa.io/en/latest/userguide/development_mode.html
pip install -e ./
diff --git a/tests/integration/README.md b/tests/integration/README.md
index 1cdb69cd2d0..48999db930f 100644
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@@ -26,7 +26,7 @@ All the perf test names are in the form of `perf/test_perf.py::test_perf[...]` w
Below are some specific pytest options used for perf tests
```bash
-# execute these in the tensorrt-llm source repo root dir.
+# execute these in the TensorRT LLM source repo root dir.
# install dependencies, do not need to do it every time if already installed.
pip install -r requirements-dev.txt
diff --git a/tests/integration/defs/perf/README_release_test.md b/tests/integration/defs/perf/README_release_test.md
index 2fe42147c7d..0fdf4eaa855 100644
--- a/tests/integration/defs/perf/README_release_test.md
+++ b/tests/integration/defs/perf/README_release_test.md
@@ -1,12 +1,12 @@
-# TensorRT-LLM Performance Test Flow (Default PyTorch Flow)
+# TensorRT LLM Performance Test Flow (Default PyTorch Flow)
## Overview
-This document describes the complete TensorRT-LLM performance testing workflow, particularly for the default PyTorch backend testing process for release testing.
+This document describes the complete TensorRT LLM performance testing workflow, particularly for the default PyTorch backend testing process for release testing.
## 1. Test Scripts
### Main Test Script
-The main script for TensorRT-LLM performance testing is `test_perf.py`, which is responsible for executing all performance test cases.
+The main script for TensorRT LLM performance testing is `test_perf.py`, which is responsible for executing all performance test cases.
### Performance Metrics
For trtllm-bench, the test extracts the following key performance metrics from logs:
diff --git a/triton_backend/all_models/disaggregated_serving/README.md b/triton_backend/all_models/disaggregated_serving/README.md
index 51998a324a0..9ebb917c752 100644
--- a/triton_backend/all_models/disaggregated_serving/README.md
+++ b/triton_backend/all_models/disaggregated_serving/README.md
@@ -26,7 +26,7 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
!-->
-# Running Disaggregated Serving with Triton TensorRT-LLM Backend
+# Running Disaggregated Serving with Triton TensorRT LLM Backend
## Overview
diff --git a/triton_backend/all_models/disaggregated_serving/disaggregated_serving.md b/triton_backend/all_models/disaggregated_serving/disaggregated_serving.md
index 51998a324a0..9ebb917c752 100644
--- a/triton_backend/all_models/disaggregated_serving/disaggregated_serving.md
+++ b/triton_backend/all_models/disaggregated_serving/disaggregated_serving.md
@@ -26,7 +26,7 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
!-->
-# Running Disaggregated Serving with Triton TensorRT-LLM Backend
+# Running Disaggregated Serving with Triton TensorRT LLM Backend
## Overview
diff --git a/triton_backend/ci/README.md b/triton_backend/ci/README.md
index a54b3675243..55cff0967de 100644
--- a/triton_backend/ci/README.md
+++ b/triton_backend/ci/README.md
@@ -26,7 +26,7 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
-# Testing TensorRT-LLM backend
+# Testing TensorRT LLM backend
Tests in this CI directory can be run manually to provide extensive testing.
@@ -66,7 +66,7 @@ requests to the deployed `ensemble` model.
Ensemble model is ensembled by three models: `preprocessing`, `tensorrt_llm` and `postprocessing`:
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
-- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
+- "tensorrt_llm": This model is a wrapper of your TensorRT LLM model and is used for inferencing
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
The end to end latency includes the total latency of the three parts of an ensemble model.