-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … #7850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … #7850
Conversation
…remaining docs. Signed-off-by: nv-guomingz <[email protected]>
📝 WalkthroughWalkthroughThis PR standardizes branding from “TensorRT-LLM” to “TensorRT LLM” across documentation and build messages, and expands several READMEs with new guidance and examples (e.g., DTM, NGram, Medusa, BERT, quantization, docker develop). No code, APIs, or control flow are changed; only documentation text and message strings are updated. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Client as Client (API/CLI)
participant Triton as Triton Server
rect rgba(224,240,255,0.5)
note right of Triton: Speculative decoding with Draft/Target engines
participant Draft as Draft Model
participant Target as Target Model
end
User->>Client: Submit prompt
Client->>Triton: generate(request)
Triton->>Draft: Propose draft tokens (k)
Draft-->>Triton: Draft tokens
Triton->>Target: Validate draft tokens
alt Draft accepted
Target-->>Triton: Accept/commit tokens
Triton-->>Client: Stream committed tokens
else Partial/none accepted
Target-->>Triton: Fallback/partial accept
Triton-->>Client: Stream accepted + new token
end
loop Until stop criteria
Triton->>Draft: Next proposals
Draft-->>Triton: Draft tokens
Triton->>Target: Validate
end
Triton-->>Client: Finished response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/models/contrib/mpt/README.md (2)
150-151
: Typo breaks command: --tp_szie → --tp_sizeThis will fail if copy-pasted.
-python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16 +python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_size 4 --dtype float16
174-176
: Engine path inconsistency with build stepBuild output_dir is
./trt_engines/mpt-30b/fp16_tp4
, but run example uses./trt_engines/mpt-30b/fp16/4-gpu/
. Align to avoid confusion.- --engine_dir ./trt_engines/mpt-30b/fp16/4-gpu/ \ + --engine_dir ./trt_engines/mpt-30b/fp16_tp4 \
🧹 Nitpick comments (54)
docs/source/developer-guide/perf-benchmarking.md (4)
117-117
: Fix terminology: input_ids are token IDs, not logitsUpdate description to reflect token IDs.
-| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. | +| `input_ids` | Y* | List[Integer] | List of token IDs that encode the request prompt. |
121-123
: Clarify mutual exclusivity and wordingReplace “prompts and logits” with precise fields and tighten grammar.
-\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`) -defined at the same time. If you specify `input_ids`, the `prompt` entry is ignored for request generation. +\* Specifying `prompt` or `input_ids` is required. However, you cannot provide both `prompt` and `input_ids` +at the same time. If you specify `input_ids`, the `prompt` entry is ignored during request generation.
134-134
: Terminology consistency in example headerUse “input_ids” instead of “logits”.
-- Entries which contain logits. +- Entries which contain `input_ids`.
212-219
: Show how to actually enable streamingThe section says “When enabling streaming…” but the example command doesn’t include a streaming flag. Add the exact flag used by
trtllm-bench
(e.g.,--streaming
) so users can reproduce TTFT/ITL.tensorrt_llm/scaffolding/README.md (1)
26-26
: Tighten grammar and style.Minor edits for clarity and correctness.
-This example run the generation with TensorRT LLM backend. It shows the step of using Scaffolding. Users firstly need to create `Controller` and `Worker` instance, then map the worker tag to the worker instance, finally create the `ScaffoldingLlm` instance and run the request. It also shows how to run scaffolding on asyncio and run the batched request. +This example runs generation with the TensorRT LLM backend. It shows the steps of using Scaffolding: first create `Controller` and `Worker` instances, then map the worker tag to the worker instance, and finally create the `ScaffoldingLlm` instance to run the request. It also shows how to run Scaffolding with asyncio and how to run batched requests.There’s an earlier brand mention that still says “TensorRT-LLM” (Line 19). Consider updating it for consistency with this PR’s goal.
examples/disaggregated/README.md (1)
145-145
: Minor doc polish: fix a nearby typo and header.While unrelated to this exact line, two small doc issues nearby can confuse users:
- Line 131: “refersh_interval” → “refresh_interval”.
- Line 196: “Know Issues” → “Known Issues”.
triton_backend/ci/README.md (1)
70-70
: LGTM — wording consistent with prior rename.One nearby consistency follow-up: later text still says “TensorRT-LLM” when describing latency (Lines 91–93). Consider updating that to “TensorRT LLM.”
examples/apps/README.md (1)
21-21
: Fix duplicated "LLM LLM" → "LLM"Small branding typo introduced during the rename; remove the duplicated "LLM" in the files below.
- examples/apps/README.md:21 — apply the fix:
-NOTE: This FastAPI-based server is only an example for demonstrating the usage -of TensorRT LLM LLM API. It is not intended for production use. +NOTE: This FastAPI-based server is only an example for demonstrating the usage +of the TensorRT LLM API. It is not intended for production use.
- Also fix similar occurrences:
- examples/apps/fastapi_server.py:3
- docs/source/performance/performance-tuning-guide/benchmarking-default-performance.md:9
- .github/CODEOWNERS:114
examples/models/contrib/chatglm-6b/README.md (1)
30-30
: Address the markdown style inconsistency.The static analysis tool correctly identified that the list uses asterisk bullets instead of the expected dash format for consistency with the rest of the repository.
Apply this diff to fix the list style:
-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. +- [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.examples/models/core/nemotron_nas/README.md (1)
17-17
: Address the markdown style inconsistency.The static analysis tool correctly identified that the list uses asterisk bullets instead of the expected dash format for consistency with the rest of the repository.
Apply this diff to fix the list style:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into TensorRT LLM checkpoint format. +- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into TensorRT LLM checkpoint format.examples/models/core/granite/README.md (1)
9-9
: Fix link fragment to match actual heading.The link fragment references the old heading format but the actual heading on line 23 uses a different case. The link should point to the correct anchor.
Based on the static analysis hint and examining the actual heading on line 23, apply this diff to fix the link fragment:
- - [Convert weights from HF Transformers to TensorRT LLM format](#Convert-weights-from-HF-Transformers-to-TensorRT-LLM-format) + - [Convert weights from HF Transformers to TensorRT LLM format](#convert-weights-from-hf-transformers-to-tensorrt-llm-format)examples/models/contrib/stdit/README.md (1)
8-8
: Minor formatting improvement needed.The static analysis correctly identifies a markdown formatting inconsistency at Line 33.
Apply this diff to fix the markdown formatting:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into TensorRT LLM checkpoint format. +- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into TensorRT LLM checkpoint format.examples/models/core/nemotron/README.md (1)
17-17
: Minor formatting consistency issue.The static analysis correctly identifies a markdown list formatting inconsistency.
Apply this diff to fix the markdown formatting:
-* [`run.py`](../../../run.py) to run the inference on an input text; +- [`run.py`](../../../run.py) to run the inference on an input text;examples/models/core/gpt/README.md (1)
42-42
: Minor formatting consistency issue.The static analysis correctly identifies a markdown list formatting inconsistency at Line 42.
Apply this diff to fix the markdown formatting:
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. +- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.examples/models/contrib/internlm/README.md (1)
21-21
: Minor formatting consistency issue.The static analysis correctly identifies a markdown list formatting inconsistency at Line 21.
Apply this diff to fix the markdown formatting:
-* [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT LLM checkpoint. +- [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT LLM checkpoint.examples/models/core/internlm2/README.md (5)
58-58
: Typo: “BUild” → “Build”.Minor capitalization fix in the comment line.
-# BUild the InternLM2 7B model using a single GPU +# Build the InternLM2 7B model using a single GPU
63-63
: Grammar/punctuation nit.Double period at the end; drop the extra dot.
-# Convert the InternLM2 7B model using a single GPU and apply INT8 weight-only quantization.. +# Convert the InternLM2 7B model using a single GPU and apply INT8 weight-only quantization.
49-50
: Flag name clarity.Use the actual flag form in the tip.
-# Try use_gemm_plugin to prevent accuracy issue. +# Try `--gemm_plugin` to prevent accuracy issues.
86-96
: 20B build uses 7B checkpoint path.
trtllm-build
for 20B points to a 7B bf16/2‑gpu directory. Fix paths to the 20B checkpoint.-trtllm-build --checkpoint_dir ./internlm2-chat-7b/trt_engines/bf16/2-gpu/ \ +trtllm-build --checkpoint_dir ./internlm2-chat-20b/trt_engines/bf16/2-gpu/ \ --output_dir ./engine_outputs \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16
165-171
: 20B run example uses 7B tokenizer/engine paths.Point tokenizer_dir and engine_dir to 20B.
- --tokenizer_dir ./internlm2-chat-7b/ \ - --engine_dir=./internlm2-chat-7b/trt_engines/bf16/4-gpu/ + --tokenizer_dir ./internlm2-chat-20b/ \ + --engine_dir=./internlm2-chat-20b/trt_engines/bf16/4-gpu/examples/models/contrib/deepseek_v2/README.md (5)
53-53
: Leftover branding.Change “TensorRT-LLM” → “TensorRT LLM”.
-Below is the step-by-step to run Deepseek-v2 with TensorRT-LLM. +Below is the step-by-step to run Deepseek-v2 with TensorRT LLM.
55-56
: Grammar cleanup.“Tense/wording” and passive vs. active voice.
-First the checkpoint will be converted to the TensorRT LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT LLM checkpoint. +First, convert the checkpoint to the TensorRT LLM checkpoint format by applying [`convert_checkpoint.py`](./convert_checkpoint.py). After that, build the TensorRT engine(s) from the TensorRT LLM checkpoint.
37-39
: Typo: dataset name.“cnn_dailmail” → “cnn_dailymail”.
-* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT LLM model. +* [`../../../summarize.py`](../../../summarize.py) to summarize articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset; it can run summarization with both the HF model and the TensorRT LLM model.
73-74
: Wording/units (minutes) and readability.Use “minutes” and clarify subject.
-We observe use GPUs(8xH200) the checkpoint conversion time took ~ 34 mints, while use CPUs took ~ 21 mints and CPU memory required >= 770GB. +Using 8× H200 GPUs, checkpoint conversion took ~34 minutes; using CPUs, it took ~21 minutes (requires ≥770 GB CPU memory).
102-121
: Add fenced code languages to logs.Label log/code fences (e.g., “text”) to satisfy MD040.
-``` +```text ... -``` +```text ...Also applies to: 136-161
examples/models/contrib/dbrx/README.md (2)
195-202
: Incorrect relative path to run.py.From this folder,
run.py
is at../../../run.py
.-mpirun -n 8 \ - python3 ../run.py --engine_dir dbrx/trt_engines/bf16/tp8 \ +mpirun -n 8 \ + python3 ../../../run.py --engine_dir dbrx/trt_engines/bf16/tp8 \ --tokenizer_dir dbrx-base \ --max_output_len 10 \ --input_text "What is AGI?"
214-219
: Incorrect relative path to summarize.py.Adjust to
../../../summarize.py
.-mpirun -n 8 \ - python ../summarize.py --engine_dir dbrx/trt_engines/bf16/tp8 \ +mpirun -n 8 \ + python ../../../summarize.py --engine_dir dbrx/trt_engines/bf16/tp8 \ --hf_model_dir dbrx-base \ --test_trt_llmexamples/models/core/qwen/README.md (4)
219-228
: Mismatched checkpoint dir names (INT8 KV cache example).Build step uses
./tllm_checkpoint_1gpu_sq
but earlier output is./tllm_checkpoint_1gpu_fp16_int8kv
.-python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \ - --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv +python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \ + --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv \ --dtype float16 \ --int8_kv_cache - -trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq \ +trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_int8kv \ --output_dir ./engine_outputs \ --gemm_plugin float16
235-236
: Terminology capitalization.Use “SmoothQuant” as a proper name.
-The smoothquant supports Qwen models. +SmoothQuant supports Qwen models.
873-874
: Config filename mismatch (.yml vs .yaml).Earlier, the guide writes
disagg-config.yml
; here it’sdisagg-config.yaml
. Align to one.-trtllm-serve disaggregated -c disagg-config.yaml +trtllm-serve disaggregated -c disagg-config.yml
927-928
: Branding in link text.Keep URL as-is but update visible text to “TensorRT LLM”.
-Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md) +Dynamo supports TensorRT LLM as one of its inference engines. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)examples/models/contrib/gptneox/README.md (1)
21-26
: markdownlint MD004: consistent list markers.Use dashes instead of asterisks to satisfy repo linting.
-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. +- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.examples/models/contrib/jais/README.md (2)
19-23
: Parent link and wording.
- The link labeled “examples” points to
../
, which lands inexamples/models/contrib
. Consider linking to the repo’s examples root or drop the link.- Minor grammar.
-The TensorRT LLM support for Jais is based on the GPT model, the implementation can be found in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the [GPT example code](../../../gpt) for Jais, +The TensorRT LLM support for Jais is based on the GPT model; the implementation is in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). The Jais model closely resembles GPT (alibi embedding, embedding scale, SwiGLU, logits scale), so we reuse the [GPT example code](../../../gpt) for Jais.-In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation: +In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:
41-43
: Grammar.“Tense” and article.
-Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint +Run the following commands. TRT-LLM first transforms an HF model into its checkpoint format, then builds a TRT engine from that checkpoint.examples/models/contrib/dit/README.md (3)
6-9
: Consistency: branding and count.
- Keep branding consistent within the sentence.
- Specify “two” main files.
-The TensorRT LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT LLM DiT example code is located in [`examples/dit`](./). There are main files to build and run DiT with TensorRT-LLM: +The TensorRT LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT LLM DiT example code is located in [`examples/dit`](./). There are two main files to build and run DiT with TensorRT LLM:
31-45
: Output dir and terminology alignment.
- Add
--output_dir
for the first build to mirror later examples and the text below.- Prefer “TensorRT LLM” in comments for consistency.
-# Convert to TRT-LLM with float16(by default) +# Convert to TensorRT LLM with float16 (default) python convert_checkpoint.py trtllm-build --checkpoint_dir ./tllm_checkpoint/ \ --max_batch_size 8 \ --remove_input_padding disable \ - --bert_attention_plugin disable + --bert_attention_plugin disable \ + --output_dir ./engine_outputs/ -# Convert to TRT-LLM with float8 +# Convert to TensorRT LLM with float8 python convert_checkpoint.py --fp8_linear --timm_ckpt=</path/to/quantized_ckpt> --output_dir=tllm_checkpoint_fp8 trtllm-build --checkpoint_dir ./tllm_checkpoint_fp8/ \ --output_dir ./engine_outputs_fp8/ \
60-61
: Directory name mismatch.Text says
./engine_output
, but commands produce./engine_outputs/
(andengine_outputs_fp8
). Align wording.-After build, we can find a `./engine_output` directory, it is ready for running DiT model with TensorRT LLM now. +After build, the `./engine_outputs/` (or `./engine_outputs_fp8/`) directory is ready for running the DiT model with TensorRT LLM.docs/source/features/disagg-serving.md (1)
83-84
: Anchor case: fix link fragment (MD051).Use lower-case fragment to match the header id.
-Please refer to the following section for details [Environment Variables](#Environment-Variables). +Please refer to the following section for details [Environment Variables](#environment-variables).examples/openai_triton/README.md (1)
3-4
: Branding looks good; minor grammar nitKeep the rename; adjust “Specially” → “Especially” for clarity.
-The typical approach to integrate a kernel into TensorRT LLM is to create TensorRT plugins. -Specially for integrating OpenAI Triton kernels, there are two methods: +The typical approach to integrate a kernel into TensorRT LLM is to create TensorRT plugins. +Especially for integrating OpenAI Triton kernels, there are two methods:examples/cpp/executor/README.md (4)
13-13
: Fix potentially broken link targetThe “source:” prefix in
[build_wheel.py](source:scripts/build_wheel.py)
won’t resolve in GitHub Markdown. Use a relative repo path.-To build the examples, you first need to build the TensorRT LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using the [`build_wheel.py`](source:scripts/build_wheel.py) script. Alternatively, if you have already build the TensorRT LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly. +To build the examples, you first need to build the TensorRT LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using [`scripts/build_wheel.py`](../../../scripts/build_wheel.py). Alternatively, if you have already built the TensorRT LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly.
15-16
: Add language to fenced block (markdownlint MD040)Annotate the shell block.
-Once the TensorRT LLM libraries are built, you can run - -``` +Once the TensorRT LLM libraries are built, you can run + +```bash mkdir build cd build cmake .. make -j--- `29-29`: **Leftover branding** “TRT-LLM engine” should be “TensorRT LLM engine” unless you intentionally refer to a CLI artifact. ```diff -Use `trtllm-build` to build the TRT-LLM engine. +Use `trtllm-build` to build the TensorRT LLM engine.
35-36
: Tiny grammar nits (“get run” → “run”)Optional polish; improves readability.
-From the `examples/cpp/executor/build` folder, you can get run the `executorExampleBasic` example with: +From the `examples/cpp/executor/build` folder, you can run the `executorExampleBasic` example with:(repeat for the other two occurrences)
Also applies to: 45-46, 112-116
examples/models/contrib/chatglm2-6b/README.md (1)
29-33
: List marker style (markdownlint MD004)The file predominantly uses “-” list markers; switch the “*” item to “-” for consistency.
-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format. +- [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.examples/models/contrib/mpt/README.md (1)
15-16
: Duplicate subsection numbering (“1.6” twice)Renumber to keep the table of contents and anchors clean.
Also applies to: 100-107
examples/models/core/recurrentgemma/README.md (1)
112-113
: Tiny grammar nit“After getting checkpoint” → “After getting the checkpoint(s)”.
-After getting checkpoint, we can use `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints. +After getting the checkpoint(s), we can use the `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints.examples/models/core/multimodal/README.md (1)
595-596
: Mixed branding in one sentenceUse “TensorRT LLM” consistently; keep “trtllm” only for CLI.
-[LLaVA](...) and [VILA](...) are both visual language models (VLM) that can be deployed in TensorRT LLM with many quantization options. [LLaVA‑NeXT](...) is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](...) and [ Nous‑Hermes‑2‑Yi‑34B](...) variant of LLaVA-NeXT. [LLaVA-OneVision](...) is another extension of LLaVA. +[LLaVA](...) and [VILA](...) are both visual language models (VLM) that can be deployed in TensorRT LLM with many quantization options. [LLaVA‑NeXT](...) is an extension of LLaVA. TensorRT LLM currently supports [Mistral‑7b](...) and [Nous‑Hermes‑2‑Yi‑34B](...) variants of LLaVA‑NeXT. [LLaVA‑OneVision](...) is another extension of LLaVA.examples/models/core/deepseek_v3/README.md (4)
3-9
: One leftover “TensorRT‑LLM” and acronym in this intro.Prefer “TensorRT LLM” in prose; keep commands/paths as-is.
-**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**. +**DeepSeek-R1 and DeepSeek-V3 share the exact same model architecture other than weight differences, and share the same code path in TensorRT LLM. For brevity, we provide one model example; use the example commands interchangeably by replacing the model name.**Also consider:
-... build TensorRT LLM from source and start a TRT-LLM docker container. +... build TensorRT LLM from source and start a TRT LLM Docker container.
393-394
: Grammar + branding in Dynamo section.Use article “an” and align link text with branding.
-Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md) +Dynamo supports TensorRT LLM as one of its inference engines. For details on how to use TensorRT LLM with Dynamo, please refer to [LLM Deployment Examples using TensorRT LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md).
608-619
: Minor wording cleanup (optional).“e2e” → “end‑to‑end” for external docs readability.
-TensorRT LLM uses DeepGEMM for DeepSeek-V3/R1, which provides significant e2e performance boost on Hopper GPUs. +TensorRT LLM uses DeepGEMM for DeepSeek‑V3/R1, which provides significant end‑to‑end performance boost on Hopper GPUs.
720-721
: Fix markdownlint MD050: use asterisks for bold, not underscores.Also remove inline emphasis redundancy.
-TensorRT LLM supports W(INT)4-A(FP)8 for DeepSeek on __Hopper__. Activations and weights are quantized at per-tensor and per-group (1x128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers. +TensorRT LLM supports W(INT)4‑A(FP)8 for DeepSeek on **Hopper**. Activations and weights are quantized at per‑tensor and per‑group (1×128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers.examples/models/contrib/mmdit/README.md (1)
6-10
: Path label/target mismatch and small grammar fix.The link label points to sd3/model.py but the target is mmdit_sd3/model.py. Also “There are main files” → “There are two main files”.
-The TensorRT LLM implementation of MMDiT can be found in [tensorrt_llm/models/sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are main files to build and run MMDiT with TensorRT-LLM: +The TensorRT LLM implementation of MMDiT can be found in [tensorrt_llm/models/mmdit_sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are two main files to build and run MMDiT with TensorRT LLM:docs/source/blogs/quantization-in-TRT-LLM.md (1)
8-8
: Typo: “easy‑of‑use” → “ease of use”.-This toolkit is designed with easy-of-use in mind. +This toolkit is designed with ease of use in mind.examples/redrafter/README.md (1)
8-8
: Fix the inconsistent link reference.The link text references "model.py" but the URL points to "drafter.py". This creates confusion about the actual location of the drafter component.
Apply this fix to correct the link:
-The TensorRT-LLM's ReDrafter implementation can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/model.py), which combines the base model and the drafter definition which can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/drafter.py). +The TensorRT LLM's ReDrafter implementation can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/model.py), which combines the base model and the drafter definition which can be found in [tensorrt_llm/models/redrafter/drafter.py](../../tensorrt_llm/models/redrafter/drafter.py).
/bot run |
PR_Github #19298 [ run ] triggered by Bot |
PR_Github #19298 [ run ] completed with state |
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
#7850) Signed-off-by: nv-guomingz <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
NVIDIA#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
#7850) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…remaining docs.
Summary by CodeRabbit
Documentation
Style
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.