dominicshanshan
diff --git a/‎README.md‎
Lines changed: 35 additions & 35 deletions b/‎README.md‎
Lines changed: 35 additions & 35 deletions
diff --git a/‎cpp/CMakeLists.txt‎
Lines changed: 3 additions & 3 deletions b/‎cpp/CMakeLists.txt‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎cpp/cmake/modules/cuda_configuration.cmake‎
Lines changed: 1 addition & 1 deletion b/‎cpp/cmake/modules/cuda_configuration.cmake‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docker/README.md‎
Lines changed: 4 additions & 4 deletions b/‎docker/README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docker/develop.md‎
Lines changed: 16 additions & 16 deletions b/‎docker/develop.md‎
Lines changed: 16 additions & 16 deletions
diff --git a/‎docker/release.md‎
Lines changed: 10 additions & 10 deletions b/‎docker/release.md‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎docs/source/blogs/Falcon180B-H200.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/source/blogs/Falcon180B-H200.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/blogs/XQA-kernel.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/XQA-kernel.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/quantization-in-TRT-LLM.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/blogs/quantization-in-TRT-LLM.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source/developer-guide/dev-containers.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/developer-guide/dev-containers.md‎
Lines changed: 2 additions & 2 deletions
@@ -68,7 +68,7 @@ else()
   message(STATUS "NVTX is enabled")
 endif()
 
-# Add TensorRT-LLM Gen export interface and CUDA support
+# Add TensorRT LLM Gen export interface and CUDA support
 add_compile_definitions("TLLM_GEN_EXPORT_INTERFACE")
 add_compile_definitions("TLLM_ENABLE_CUDA")
 
@@ -138,9 +138,9 @@ execute_process(
   OUTPUT_STRIP_TRAILING_WHITESPACE)
 
 if(TRTLLM_VERSION_RESULT EQUAL 0)
-  message(STATUS "TensorRT-LLM version: ${TRTLLM_VERSION}")
+  message(STATUS "TensorRT LLM version: ${TRTLLM_VERSION}")
 else()
-  message(FATAL_ERROR "Failed to determine Tensorrt-LLM version")
+  message(FATAL_ERROR "Failed to determine TensorRT LLM version")
 endif()
 
 configure_file(
 
@@ -116,7 +116,7 @@ function(setup_cuda_architectures)
     unset(CMAKE_CUDA_ARCHITECTURES_RAW)
     message(
       STATUS
-        "Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT-LLM optimized for, "
+        "Setting CMAKE_CUDA_ARCHITECTURES to all enables all architectures TensorRT LLM optimized for, "
         "not all architectures CUDA compiler supports.")
   elseif(CMAKE_CUDA_ARCHITECTURES_RAW STREQUAL "all-major")
     message(
 
@@ -2,12 +2,12 @@
 
 ## Multi-stage Builds with Docker
 
-TensorRT-LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
+TensorRT LLM can be compiled in Docker using a multi-stage build implemented in [`Dockerfile.multi`](Dockerfile.multi).
 The following build stages are defined:
 
 * `devel`: this image provides all dependencies for building TensorRT-LLM.
 * `wheel`: this image contains the source code and the compiled binary distribution.
-* `release`: this image has the binaries installed and contains TensorRT-LLM examples in `/app/tensorrt_llm`.
+* `release`: this image has the binaries installed and contains TensorRT LLM examples in `/app/tensorrt_llm`.
 
 ## Building Docker Images with GNU `make`
 
@@ -19,7 +19,7 @@ separated by `_`. The following actions are available:
 * `<stage>_push`: pushes the docker image for the stage to a docker registry (implies `<stage>_build`).
 * `<stage>_run`: runs the docker image for the stage in a new container.
 
-For example, the `release` stage is built and pushed from the top-level directory of TensorRT-LLM as follows:
+For example, the `release` stage is built and pushed from the top-level directory of TensorRT LLM as follows:
 
 ```bash
 make -C docker release_push
@@ -178,4 +178,4 @@ a corresponding message. The heuristics can be overridden by specifying
 Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs
  (typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide
 with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers
-started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively.
+started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT LLM working copy and the user home directory, respectively.
@@ -1,37 +1,37 @@
 # Description
 
-TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
-state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
+TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
+state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
 create Python and C++ runtimes that orchestrate the inference execution in a performant way.
 
 # Overview
 
-## TensorRT-LLM Develop Container
+## TensorRT LLM Develop Container
 
-The TensorRT-LLM Develop container includes all necessary dependencies to build TensorRT-LLM from source. It is
-specifically designed to be used alongside the source code cloned from the official TensorRT-LLM repository:
+The TensorRT LLM Develop container includes all necessary dependencies to build TensorRT LLM from source. It is
+specifically designed to be used alongside the source code cloned from the official TensorRT LLM repository:
 
 [GitHub Repository - NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
 
-Full instructions for cloning the TensorRT-LLM repository can be found in
-the [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
+Full instructions for cloning the TensorRT LLM repository can be found in
+the [TensorRT LLM Documentation](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html).
 
 > **Note:**  
-> This container does not contain a pre-built binary release of `TensorRT-LLM` or tools like `trtllm-serve`.
+> This container does not contain a pre-built binary release of `TensorRT LLM` or tools like `trtllm-serve`.
 
-### Running the TensorRT-LLM Develop Container Using Docker
+### Running the TensorRT LLM Develop Container Using Docker
 
-With the top-level directory of the TensorRT-LLM repository cloned to your local machine, you can run the following
+With the top-level directory of the TensorRT LLM repository cloned to your local machine, you can run the following
 command to start the development container:
 
 ```bash
 make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z
 ```
 
-where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
+where `x.y.z` is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the
 NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The
-local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
-integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not
+local source code of TensorRT LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless
+integration. Ensure that the image version matches the version of TensorRT LLM in your currently checked out local git branch. Not
 specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be
 accompanied by a development container. In that case, use the latest version preceding the version of your development
 branch.
@@ -53,9 +53,9 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  \
 Note that this will start the container with the user `root`, which may leave files with root ownership in your local
 checkout.
 
-### Building the TensorRT-LLM Wheel within the Container
+### Building the TensorRT LLM Wheel within the Container
 
-You can build the TensorRT-LLM Python wheel inside the development container using the following command:
+You can build the TensorRT LLM Python wheel inside the development container using the following command:
 
 ```bash
 ./scripts/build_wheel.py --clean --use_ccache --cuda_architectures=native
@@ -81,7 +81,7 @@ The wheel will be built in the `build` directory and can be installed using `pip
 pip install ./build/tensorrt_llm*.whl
 ```
 
-For additional information on building the TensorRT-LLM wheel, refer to
+For additional information on building the TensorRT LLM wheel, refer to
 the [official documentation on building from source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-full-build-with-c-compilation).
 
 ### Security CVEs
 
@@ -1,18 +1,18 @@
 # Description
 
-TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
-state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to
+TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports
+state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to
 create Python and C++ runtimes that orchestrate the inference execution in a performant way.
 
 # Overview
 
-## TensorRT-LLM Release Container
+## TensorRT LLM Release Container
 
-The TensorRT-LLM Release container provides a pre-built environment for running TensorRT-LLM.
+The TensorRT LLM Release container provides a pre-built environment for running TensorRT-LLM.
 
 Visit the [official GitHub repository](https://github.com/NVIDIA/TensorRT-LLM) for more details.
 
-### Running TensorRT-LLM Using Docker
+### Running TensorRT LLM Using Docker
 
 A typical command to launch the container is:
 
@@ -21,16 +21,16 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpu
     		nvcr.io/nvidia/tensorrt-llm/release:x.y.z
 ```
 
-where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
+where x.y.z is the version of the TensorRT LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command:
 
 ```bash
 python3 -c "import tensorrt_llm"
 ```
 
-This command will print the TensorRT-LLM version if everything is working correctly. After verification, you can explore
+This command will print the TensorRT LLM version if everything is working correctly. After verification, you can explore
 and try the example scripts included in `/app/tensorrt_llm/examples`.
 
-Alternatively, if you have already cloned the TensorRT-LLM repository, you can use the following convenient command to
+Alternatively, if you have already cloned the TensorRT LLM repository, you can use the following convenient command to
 run the container:
 
 ```bash
@@ -43,8 +43,8 @@ container, and launches it with full GPU support.
 For comprehensive information about TensorRT-LLM, including documentation, source code, examples, and installation
 guidelines, visit the following official resources:
 
-- [TensorRT-LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
-- [TensorRT-LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
+- [TensorRT LLM GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM)
+- [TensorRT LLM Online Documentation](https://nvidia.github.io/TensorRT-LLM/latest/index.html)
 
 ### Security CVEs
 
 
@@ -21,7 +21,7 @@ memory footprint, allows for great performance on Falcon-180B on a single GPU.
 
 <sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
 <sup>
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
 Falcon-180B |
 1xH200 TP1 |
 INT4 AWQ |
@@ -38,7 +38,7 @@ while maintaining high accuracy.
 
 <sup>Preliminary measured accuracy, subject to change. </sup>
 <sup>
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
 Falcon-180B |
 1xH200 TP1 |
 INT4 AWQ
@@ -84,20 +84,20 @@ than A100.
 
 <sup>Preliminary measured performance, subject to change. </sup>
 <sup>
-TensorRT-LLM v0.7a |
+TensorRT LLM v0.7a |
 Llama2-70B |
 1xH200 = TP1, 8xH200 = max TP/PP/DP config |
 FP8 |
 BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
 
 
-**TensorRT-LLM GQA now 2.4x faster on H200**
+**TensorRT LLM GQA now 2.4x faster on H200**
 
 <img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_DecvOct.png?raw=true" alt="Llama-70B H200 December vs Oct." width="400" height="auto">
 
 <sup>Preliminary measured performance, subject to change.</sup>
 <sup>
-TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
+TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
 Llama2-70B |
 1xH200 TP1 |
 FP8 |
 
@@ -10,7 +10,7 @@ Looking at the Throughput-Latency curves below, we see that the enabling of XQA
 
 <img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">
 
-<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub>
+<sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT LLM v0.8a</sub>
 
 
 ## Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
 
@@ -5,7 +5,7 @@ The deployment and inference speed of LLMs are often impeded by limitations in m
 In this blog, we provide an overview of the quantization features in TensorRT-LLM, share benchmark, and offer best practices of selecting the appropriate quantization methods tailored to your specific use case.
 
 ## Quantization in TensorRT-LLM
-TensorRT-LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
+TensorRT LLM offers a best-in-class unified quantization toolkit to significantly speedup DL/GenAI deployment on NVIDIA hardware, while maintaining model accuracy. This toolkit is designed with easy-of-use in mind. You can follow [this user guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) to quantize [supported LLMs](../reference/support-matrix.md#models) with a few lines of codes. We currently focus on providing SOTA **Post-Training Quantization (PTQ)** and will soon expand to more model optimization techniques in the near future.
 
 ## Benchmark
 
@@ -63,7 +63,7 @@ Based on specific use cases, users might have different tolerances on accuracy i
 \* The performance and impact are measured on 10+ popular LLMs. We'll follow up with more data points.
 ** Calibration time is subject to the actual model size.
 
-We note that TensorRT-LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
+We note that TensorRT LLM also offers INT8 and FP8 quantization for KV cache. KV cache differs from normal activation because it occupies non-negligible persistent memory under scenarios like large batch sizes or long context lengths. If you're using KV cache on Hopper & Ada GPUs, We recommend using FP8 KV cache over Int8 because the former has a lower accuracy impact than the latter in most tested cases. When switching from FP16 KV cache to FP8 KV cache, it also enables you to run 2-3x larger batch size on H100 machine for models like GPT-J which further brings about 1.5x performance benefit.
 
 ## What’s coming next
-TensorRT-LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
+TensorRT LLM continues to make improvements on our quantization features, such as Int4-FP8 AWQ (W4A8) public examples and more model supports. Please stay tuned for our upcoming releases.
@@ -1,10 +1,10 @@
 # Using Dev Containers
 
-The TensorRT-LLM repository contains a [Dev Containers](https://containers.dev/)
+The TensorRT LLM repository contains a [Dev Containers](https://containers.dev/)
 configuration in `.devcontainer`. These files are intended for
 use with [Visual Studio Code](https://code.visualstudio.com/).
 
-Due to the various container options supported by TensorRT-LLM (see
+Due to the various container options supported by TensorRT LLM (see
 [](/installation/build-from-source-linux.md) and
 <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>), the Dev
 Container configuration also offers some degree of customization.