Skip to content

Conversation

nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Sep 19, 2025

…remaining docs.

Summary by CodeRabbit

  • Documentation

    • Rebranded all docs from “TensorRT-LLM” to “TensorRT LLM.”
    • Added “Partial Deprecation” to Deprecation Policy.
    • Expanded guides: Docker develop (build flags, run options), DTM (LLaMA, TP, fast logits), Lookahead config, NGram workflows/hyperparams.
    • Enhanced model docs (e.g., Medusa support matrix/Qwen2 note, BERT remove_input_padding/FMHA flags, multiple contrib/core README updates).
    • Updated quantization, benchmarking, and serving docs; numerous links/titles reflect new naming.
  • Style

    • Updated build/runtime status messages to display “TensorRT LLM” in CMake and CUDA configuration outputs.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@nv-guomingz nv-guomingz requested a review from a team as a code owner September 19, 2025 02:01
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

📝 Walkthrough

Walkthrough

This PR standardizes branding from “TensorRT-LLM” to “TensorRT LLM” across documentation and build messages, and expands several READMEs with new guidance and examples (e.g., DTM, NGram, Medusa, BERT, quantization, docker develop). No code, APIs, or control flow are changed; only documentation text and message strings are updated.

Changes

Cohort / File(s) Change summary
Branding rename: top-level/docs
README.md, docs/source/... (blogs, features, developer-guide, torch, models, performance), docker/README.md, docker/release.md, triton_backend/.../README*.md
Replace “TensorRT-LLM” with “TensorRT LLM” in titles, prose, captions, and example outputs. Links generally preserved; display text updated. No functional changes.
CMake/CUDA message strings
cpp/CMakeLists.txt, cpp/cmake/modules/cuda_configuration.cmake
Update message literals from “TensorRT-LLM” to “TensorRT LLM” for version/status/fatal logs. No logic changes.
Docker develop guidance expansion
docker/develop.md
Rename branding; add build flags explanation, explicit docker run alternative, version matching notes.
Torch out-of-tree example addition
docs/source/torch/adding_new_model.md
Rename branding; add runnable out-of-tree usage example and reference to example path.
DTM (Draft/Target Model) README expansion
examples/draft_target_model/README.md
Rename branding; add extensive usage guidance (llama paths, required flags, TP mode, config edits), fast logits D2D transfer, Triton workflow notes.
NGram workflows and hyperparams
examples/ngram/README.md
Rename branding; add V1/V2 workflows, hyperparameters, concrete examples, and step-by-step build/run commands.
Lookahead clarifications
examples/lookahead/README.md
Rename branding; add config tuple explanation, server-level scope note, updated max_draft_len formula.
Medusa updates
examples/medusa/README.md
Rename branding; add Support Matrix; add flags (--hf_model_dir, --tokenizer_dir, --use_py_session); new Qwen2 note.
BERT feature notes
examples/models/core/bert/README.md
Rename branding; add guidance for remove_input_padding and FMHA-related flags; new example.
Quantization README edits
examples/quantization/README.md
Rename branding; remove bullet about Python APIs; text updates around checkpoint formats and commands.
Additional documentation tweaks
examples/redrafter/README.md, examples/models/contrib/*/README.md, examples/models/core/*/README.md, examples/.../README.md, tests/**/*.md, tensorrt_llm/_torch/auto_deploy/custom_ops/README.md, tensorrt_llm/scaffolding/README.md
Rename branding across numerous example and test READMEs; minor content additions/notes in some files (e.g., BLOOM shared files, commandr nav, mmdit sample note). No code changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Client as Client (API/CLI)
  participant Triton as Triton Server
  rect rgba(224,240,255,0.5)
    note right of Triton: Speculative decoding with Draft/Target engines
    participant Draft as Draft Model
    participant Target as Target Model
  end

  User->>Client: Submit prompt
  Client->>Triton: generate(request)
  Triton->>Draft: Propose draft tokens (k)
  Draft-->>Triton: Draft tokens
  Triton->>Target: Validate draft tokens
  alt Draft accepted
    Target-->>Triton: Accept/commit tokens
    Triton-->>Client: Stream committed tokens
  else Partial/none accepted
    Target-->>Triton: Fallback/partial accept
    Triton-->>Client: Stream accepted + new token
  end
  loop Until stop criteria
    Triton->>Draft: Next proposals
    Draft-->>Triton: Draft tokens
    Triton->>Target: Validate
  end
  Triton-->>Client: Finished response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

1.0_doc

Suggested reviewers

  • laikhtewari
  • QiJune

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR body contains only the repository PR template and the placeholder "…remaining docs." and does not provide the required content from the template: the Description and Test Coverage sections are empty and there is no concise summary of what changed, why, or how it was validated. As a result the PR description is largely incomplete and does not meet the repository's required PR description template. Please complete the PR description per the repository template: add a short Description summarizing the changes and rationale, list the key files/areas modified and any migration or user-impact notes, and populate Test Coverage with the relevant tests/CI stages that validate the change (or explicitly mark it as "documentation-only"). Ensure the PR title follows the required format (e.g., [TRTLLM-1234][doc] or [None][doc] ...) or use the @coderabbitai helper, confirm checklist items, and add appropriate reviewers/CODEOWNERS if ownership changed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the …" directly reflects the primary change in the PR (a documentation/branding rename from "TensorRT-LLM" to "TensorRT LLM") and follows the repository's expected prefix format ([None][doc]). The trailing ellipsis truncates the scope but does not obscure the main intent for reviewers scanning history. Overall, the title adequately summarizes the main change in this changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nv-guomingz
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/models/contrib/mpt/README.md (2)

150-151: Typo breaks command: --tp_szie → --tp_size

This will fail if copy-pasted.

-python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16
+python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_size 4 --dtype float16

174-176: Engine path inconsistency with build step

Build output_dir is ./trt_engines/mpt-30b/fp16_tp4, but run example uses ./trt_engines/mpt-30b/fp16/4-gpu/. Align to avoid confusion.

-                     --engine_dir ./trt_engines/mpt-30b/fp16/4-gpu/ \
+                     --engine_dir ./trt_engines/mpt-30b/fp16_tp4 \
🧹 Nitpick comments (54)
docs/source/developer-guide/perf-benchmarking.md (4)

117-117: Fix terminology: input_ids are token IDs, not logits

Update description to reflect token IDs.

-| `input_ids`     |    Y*    | List[Integer] | List of logits that make up the request prompt. |
+| `input_ids`     |    Y*    | List[Integer] | List of token IDs that encode the request prompt. |

121-123: Clarify mutual exclusivity and wording

Replace “prompts and logits” with precise fields and tighten grammar.

-\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`)
-defined at the same time. If you specify `input_ids`, the `prompt` entry is ignored for request generation.
+\* Specifying `prompt` or `input_ids` is required. However, you cannot provide both `prompt` and `input_ids`
+at the same time. If you specify `input_ids`, the `prompt` entry is ignored during request generation.

134-134: Terminology consistency in example header

Use “input_ids” instead of “logits”.

-- Entries which contain logits.
+- Entries which contain `input_ids`.

212-219: Show how to actually enable streaming

The section says “When enabling streaming…” but the example command doesn’t include a streaming flag. Add the exact flag used by trtllm-bench (e.g., --streaming) so users can reproduce TTFT/ITL.

tensorrt_llm/scaffolding/README.md (1)

26-26: Tighten grammar and style.

Minor edits for clarity and correctness.

-This example run the generation with TensorRT LLM backend. It shows the step of using Scaffolding. Users firstly need to create `Controller` and `Worker` instance, then map the worker tag to the worker instance, finally create the `ScaffoldingLlm` instance and run the request. It also shows how to run scaffolding on asyncio and run the batched request.
+This example runs generation with the TensorRT LLM backend. It shows the steps of using Scaffolding: first create `Controller` and `Worker` instances, then map the worker tag to the worker instance, and finally create the `ScaffoldingLlm` instance to run the request. It also shows how to run Scaffolding with asyncio and how to run batched requests.

There’s an earlier brand mention that still says “TensorRT-LLM” (Line 19). Consider updating it for consistency with this PR’s goal.

examples/disaggregated/README.md (1)

145-145: Minor doc polish: fix a nearby typo and header.

While unrelated to this exact line, two small doc issues nearby can confuse users:

  • Line 131: “refersh_interval” → “refresh_interval”.
  • Line 196: “Know Issues” → “Known Issues”.
triton_backend/ci/README.md (1)

70-70: LGTM — wording consistent with prior rename.

One nearby consistency follow-up: later text still says “TensorRT-LLM” when describing latency (Lines 91–93). Consider updating that to “TensorRT LLM.”

examples/apps/README.md (1)

21-21: Fix duplicated "LLM LLM" → "LLM"

Small branding typo introduced during the rename; remove the duplicated "LLM" in the files below.

  • examples/apps/README.md:21 — apply the fix:
-NOTE: This FastAPI-based server is only an example for demonstrating the usage
-of TensorRT LLM LLM API. It is not intended for production use.
+NOTE: This FastAPI-based server is only an example for demonstrating the usage
+of the TensorRT LLM API. It is not intended for production use.
  • Also fix similar occurrences:
    • examples/apps/fastapi_server.py:3
    • docs/source/performance/performance-tuning-guide/benchmarking-default-performance.md:9
    • .github/CODEOWNERS:114
examples/models/contrib/chatglm-6b/README.md (1)

30-30: Address the markdown style inconsistency.

The static analysis tool correctly identified that the list uses asterisk bullets instead of the expected dash format for consistency with the rest of the repository.

Apply this diff to fix the list style:

-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
+- [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
examples/models/core/nemotron_nas/README.md (1)

17-17: Address the markdown style inconsistency.

The static analysis tool correctly identified that the list uses asterisk bullets instead of the expected dash format for consistency with the rest of the repository.

Apply this diff to fix the list style:

-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into TensorRT LLM checkpoint format.
+- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the model into TensorRT LLM checkpoint format.
examples/models/core/granite/README.md (1)

9-9: Fix link fragment to match actual heading.

The link fragment references the old heading format but the actual heading on line 23 uses a different case. The link should point to the correct anchor.

Based on the static analysis hint and examining the actual heading on line 23, apply this diff to fix the link fragment:

-  - [Convert weights from HF Transformers to TensorRT LLM format](#Convert-weights-from-HF-Transformers-to-TensorRT-LLM-format)
+  - [Convert weights from HF Transformers to TensorRT LLM format](#convert-weights-from-hf-transformers-to-tensorrt-llm-format)
examples/models/contrib/stdit/README.md (1)

8-8: Minor formatting improvement needed.

The static analysis correctly identifies a markdown formatting inconsistency at Line 33.

Apply this diff to fix the markdown formatting:

-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into TensorRT LLM checkpoint format.
+- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the STDiT model into TensorRT LLM checkpoint format.
examples/models/core/nemotron/README.md (1)

17-17: Minor formatting consistency issue.

The static analysis correctly identifies a markdown list formatting inconsistency.

Apply this diff to fix the markdown formatting:

-* [`run.py`](../../../run.py) to run the inference on an input text;
+- [`run.py`](../../../run.py) to run the inference on an input text;
examples/models/core/gpt/README.md (1)

42-42: Minor formatting consistency issue.

The static analysis correctly identifies a markdown list formatting inconsistency at Line 42.

Apply this diff to fix the markdown formatting:

-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
+- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
examples/models/contrib/internlm/README.md (1)

21-21: Minor formatting consistency issue.

The static analysis correctly identifies a markdown list formatting inconsistency at Line 21.

Apply this diff to fix the markdown formatting:

-* [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT LLM checkpoint.
+- [`convert_checkpoint.py`](../../../llama/convert_checkpoint.py) converts the Huggingface Model of InternLM into TensorRT LLM checkpoint.
examples/models/core/internlm2/README.md (5)

58-58: Typo: “BUild” → “Build”.

Minor capitalization fix in the comment line.

-# BUild the InternLM2 7B model using a single GPU
+# Build the InternLM2 7B model using a single GPU

63-63: Grammar/punctuation nit.

Double period at the end; drop the extra dot.

-# Convert the InternLM2 7B model using a single GPU and apply INT8 weight-only quantization..
+# Convert the InternLM2 7B model using a single GPU and apply INT8 weight-only quantization.

49-50: Flag name clarity.

Use the actual flag form in the tip.

-# Try use_gemm_plugin to prevent accuracy issue.
+# Try `--gemm_plugin` to prevent accuracy issues.

86-96: 20B build uses 7B checkpoint path.

trtllm-build for 20B points to a 7B bf16/2‑gpu directory. Fix paths to the 20B checkpoint.

-trtllm-build --checkpoint_dir ./internlm2-chat-7b/trt_engines/bf16/2-gpu/ \
+trtllm-build --checkpoint_dir ./internlm2-chat-20b/trt_engines/bf16/2-gpu/ \
              --output_dir ./engine_outputs \
              --gpt_attention_plugin bfloat16  \
              --gemm_plugin bfloat16

165-171: 20B run example uses 7B tokenizer/engine paths.

Point tokenizer_dir and engine_dir to 20B.

-                     --tokenizer_dir ./internlm2-chat-7b/ \
-                     --engine_dir=./internlm2-chat-7b/trt_engines/bf16/4-gpu/
+                     --tokenizer_dir ./internlm2-chat-20b/ \
+                     --engine_dir=./internlm2-chat-20b/trt_engines/bf16/4-gpu/
examples/models/contrib/deepseek_v2/README.md (5)

53-53: Leftover branding.

Change “TensorRT-LLM” → “TensorRT LLM”.

-Below is the step-by-step to run Deepseek-v2 with TensorRT-LLM.
+Below is the step-by-step to run Deepseek-v2 with TensorRT LLM.

55-56: Grammar cleanup.

“Tense/wording” and passive vs. active voice.

-First the checkpoint will be converted to the TensorRT LLM checkpoint format by apply [`convert_checkpoint.py`](./convert_checkpoint.py). After that, the TensorRT engine(s) can be build with TensorRT LLM checkpoint.
+First, convert the checkpoint to the TensorRT LLM checkpoint format by applying [`convert_checkpoint.py`](./convert_checkpoint.py). After that, build the TensorRT engine(s) from the TensorRT LLM checkpoint.

37-39: Typo: dataset name.

“cnn_dailmail” → “cnn_dailymail”.

-* [`../../../summarize.py`](../../../summarize.py) to summarize the article from [cnn_dailmail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, it can running the summarize from HF model and TensorRT LLM model.
+* [`../../../summarize.py`](../../../summarize.py) to summarize articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset; it can run summarization with both the HF model and the TensorRT LLM model.

73-74: Wording/units (minutes) and readability.

Use “minutes” and clarify subject.

-We observe use GPUs(8xH200) the checkpoint conversion time took ~ 34 mints, while use CPUs took ~ 21 mints and CPU memory required >= 770GB.
+Using 8× H200 GPUs, checkpoint conversion took ~34 minutes; using CPUs, it took ~21 minutes (requires ≥770 GB CPU memory).

102-121: Add fenced code languages to logs.

Label log/code fences (e.g., “text”) to satisfy MD040.

-```
+```text
...
-```
+```text
...

Also applies to: 136-161

examples/models/contrib/dbrx/README.md (2)

195-202: Incorrect relative path to run.py.

From this folder, run.py is at ../../../run.py.

-mpirun -n 8 \
-    python3 ../run.py --engine_dir dbrx/trt_engines/bf16/tp8 \
+mpirun -n 8 \
+    python3 ../../../run.py --engine_dir dbrx/trt_engines/bf16/tp8 \
         --tokenizer_dir dbrx-base \
         --max_output_len 10 \
         --input_text "What is AGI?"

214-219: Incorrect relative path to summarize.py.

Adjust to ../../../summarize.py.

-mpirun -n 8 \
-    python ../summarize.py --engine_dir dbrx/trt_engines/bf16/tp8 \
+mpirun -n 8 \
+    python ../../../summarize.py --engine_dir dbrx/trt_engines/bf16/tp8 \
         --hf_model_dir dbrx-base \
         --test_trt_llm
examples/models/core/qwen/README.md (4)

219-228: Mismatched checkpoint dir names (INT8 KV cache example).

Build step uses ./tllm_checkpoint_1gpu_sq but earlier output is ./tllm_checkpoint_1gpu_fp16_int8kv.

-python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/   \
-                             --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv
+python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/   \
+                             --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv \
                              --dtype float16  \
                              --int8_kv_cache
-
-trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq \
+trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_int8kv \
              --output_dir ./engine_outputs \
              --gemm_plugin float16

235-236: Terminology capitalization.

Use “SmoothQuant” as a proper name.

-The smoothquant supports Qwen models.
+SmoothQuant supports Qwen models.

873-874: Config filename mismatch (.yml vs .yaml).

Earlier, the guide writes disagg-config.yml; here it’s disagg-config.yaml. Align to one.

-trtllm-serve disaggregated -c disagg-config.yaml
+trtllm-serve disaggregated -c disagg-config.yml

927-928: Branding in link text.

Keep URL as-is but update visible text to “TensorRT LLM”.

-Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+Dynamo supports TensorRT LLM as one of its inference engines. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
examples/models/contrib/gptneox/README.md (1)

21-26: markdownlint MD004: consistent list markers.

Use dashes instead of asterisks to satisfy repo linting.

-* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
+- [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
examples/models/contrib/jais/README.md (2)

19-23: Parent link and wording.

  • The link labeled “examples” points to ../, which lands in examples/models/contrib. Consider linking to the repo’s examples root or drop the link.
  • Minor grammar.
-The TensorRT LLM support for Jais is based on the GPT model, the implementation can be found in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the [GPT example code](../../../gpt) for Jais,
+The TensorRT LLM support for Jais is based on the GPT model; the implementation is in [tensorrt_llm/models/gpt/model.py](../../../../tensorrt_llm/models/gpt/model.py). The Jais model closely resembles GPT (alibi embedding, embedding scale, SwiGLU, logits scale), so we reuse the [GPT example code](../../../gpt) for Jais.
-In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
+In addition, there are two shared files in the parent folder [`examples`](../../../) for inference and evaluation:

41-43: Grammar.

“Tense” and article.

-Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint
+Run the following commands. TRT-LLM first transforms an HF model into its checkpoint format, then builds a TRT engine from that checkpoint.
examples/models/contrib/dit/README.md (3)

6-9: Consistency: branding and count.

  • Keep branding consistent within the sentence.
  • Specify “two” main files.
-The TensorRT LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT LLM DiT example code is located in [`examples/dit`](./). There are main files to build and run DiT with TensorRT-LLM:
+The TensorRT LLM DiT implementation can be found in [tensorrt_llm/models/dit/model.py](../../../../tensorrt_llm/models/dit/model.py). The TensorRT LLM DiT example code is located in [`examples/dit`](./). There are two main files to build and run DiT with TensorRT LLM:

31-45: Output dir and terminology alignment.

  • Add --output_dir for the first build to mirror later examples and the text below.
  • Prefer “TensorRT LLM” in comments for consistency.
-# Convert to TRT-LLM with float16(by default)
+# Convert to TensorRT LLM with float16 (default)
 python convert_checkpoint.py
 trtllm-build --checkpoint_dir ./tllm_checkpoint/ \
                 --max_batch_size 8 \
                 --remove_input_padding disable \
-                --bert_attention_plugin disable
+                --bert_attention_plugin disable \
+                --output_dir ./engine_outputs/
 
-# Convert to TRT-LLM with float8
+# Convert to TensorRT LLM with float8
 python convert_checkpoint.py --fp8_linear --timm_ckpt=</path/to/quantized_ckpt> --output_dir=tllm_checkpoint_fp8
 trtllm-build --checkpoint_dir ./tllm_checkpoint_fp8/ \
              --output_dir ./engine_outputs_fp8/ \

60-61: Directory name mismatch.

Text says ./engine_output, but commands produce ./engine_outputs/ (and engine_outputs_fp8). Align wording.

-After build, we can find a `./engine_output` directory, it is ready for running DiT model with TensorRT LLM now.
+After build, the `./engine_outputs/` (or `./engine_outputs_fp8/`) directory is ready for running the DiT model with TensorRT LLM.
docs/source/features/disagg-serving.md (1)

83-84: Anchor case: fix link fragment (MD051).

Use lower-case fragment to match the header id.

-Please refer to the following section for details [Environment Variables](#Environment-Variables).
+Please refer to the following section for details [Environment Variables](#environment-variables).
examples/openai_triton/README.md (1)

3-4: Branding looks good; minor grammar nit

Keep the rename; adjust “Specially” → “Especially” for clarity.

-The typical approach to integrate a kernel into TensorRT LLM is to create TensorRT plugins.
-Specially for integrating OpenAI Triton kernels, there are two methods:
+The typical approach to integrate a kernel into TensorRT LLM is to create TensorRT plugins.
+Especially for integrating OpenAI Triton kernels, there are two methods:
examples/cpp/executor/README.md (4)

13-13: Fix potentially broken link target

The “source:” prefix in [build_wheel.py](source:scripts/build_wheel.py) won’t resolve in GitHub Markdown. Use a relative repo path.

-To build the examples, you first need to build the TensorRT LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using the [`build_wheel.py`](source:scripts/build_wheel.py) script. Alternatively, if you have already build the TensorRT LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly.
+To build the examples, you first need to build the TensorRT LLM C++ shared libraries (`libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so`) using [`scripts/build_wheel.py`](../../../scripts/build_wheel.py). Alternatively, if you have already built the TensorRT LLM libraries, you can modify the provided `CMakeLists.txt` such that the `libtensorrt_llm.so` and `libnvinfer_plugin_tensorrt_llm.so` are imported properly.

15-16: Add language to fenced block (markdownlint MD040)

Annotate the shell block.

-Once the TensorRT LLM libraries are built, you can run
-
-```
+Once the TensorRT LLM libraries are built, you can run
+
+```bash
 mkdir build
 cd build
 cmake ..
 make -j

---

`29-29`: **Leftover branding**

“TRT-LLM engine” should be “TensorRT LLM engine” unless you intentionally refer to a CLI artifact.



```diff
-Use `trtllm-build` to build the TRT-LLM engine.
+Use `trtllm-build` to build the TensorRT LLM engine.

35-36: Tiny grammar nits (“get run” → “run”)

Optional polish; improves readability.

-From the `examples/cpp/executor/build` folder, you can get run the `executorExampleBasic` example with:
+From the `examples/cpp/executor/build` folder, you can run the `executorExampleBasic` example with:

(repeat for the other two occurrences)

Also applies to: 45-46, 112-116

examples/models/contrib/chatglm2-6b/README.md (1)

29-33: List marker style (markdownlint MD004)

The file predominantly uses “-” list markers; switch the “*” item to “-” for consistency.

-* [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
+- [`examples/models/core/glm-4-9b/convert_checkpoint.py`](../../../glm-4-9b/convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT LLM format.
examples/models/contrib/mpt/README.md (1)

15-16: Duplicate subsection numbering (“1.6” twice)

Renumber to keep the table of contents and anchors clean.

Also applies to: 100-107

examples/models/core/recurrentgemma/README.md (1)

112-113: Tiny grammar nit

“After getting checkpoint” → “After getting the checkpoint(s)”.

-After getting checkpoint, we can use `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints.
+After getting the checkpoint(s), we can use the `trtllm-build` command to build TensorRT LLM engines from TensorRT LLM checkpoints.
examples/models/core/multimodal/README.md (1)

595-596: Mixed branding in one sentence

Use “TensorRT LLM” consistently; keep “trtllm” only for CLI.

-[LLaVA](...) and [VILA](...) are both visual language models (VLM) that can be deployed in TensorRT LLM with many quantization options. [LLaVA‑NeXT](...) is an extension of LLaVA. TRT-LLM currently supports [Mistral-7b](...) and [ Nous‑Hermes‑2‑Yi‑34B](...) variant of LLaVA-NeXT. [LLaVA-OneVision](...) is another extension of LLaVA.
+[LLaVA](...) and [VILA](...) are both visual language models (VLM) that can be deployed in TensorRT LLM with many quantization options. [LLaVA‑NeXT](...) is an extension of LLaVA. TensorRT LLM currently supports [Mistral‑7b](...) and [Nous‑Hermes‑2‑Yi‑34B](...) variants of LLaVA‑NeXT. [LLaVA‑OneVision](...) is another extension of LLaVA.
examples/models/core/deepseek_v3/README.md (4)

3-9: One leftover “TensorRT‑LLM” and acronym in this intro.

Prefer “TensorRT LLM” in prose; keep commands/paths as-is.

-**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
+**DeepSeek-R1 and DeepSeek-V3 share the exact same model architecture other than weight differences, and share the same code path in TensorRT LLM. For brevity, we provide one model example; use the example commands interchangeably by replacing the model name.** 

Also consider:

-... build TensorRT LLM from source and start a TRT-LLM docker container.
+... build TensorRT LLM from source and start a TRT LLM Docker container.

393-394: Grammar + branding in Dynamo section.

Use article “an” and align link text with branding.

-Dynamo supports TensorRT LLM as one of its inference engine. For details on how to use TensorRT LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+Dynamo supports TensorRT LLM as one of its inference engines. For details on how to use TensorRT LLM with Dynamo, please refer to [LLM Deployment Examples using TensorRT LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md).

608-619: Minor wording cleanup (optional).

“e2e” → “end‑to‑end” for external docs readability.

-TensorRT LLM uses DeepGEMM for DeepSeek-V3/R1, which provides significant e2e performance boost on Hopper GPUs.
+TensorRT LLM uses DeepGEMM for DeepSeek‑V3/R1, which provides significant end‑to‑end performance boost on Hopper GPUs.

720-721: Fix markdownlint MD050: use asterisks for bold, not underscores.

Also remove inline emphasis redundancy.

-TensorRT LLM supports W(INT)4-A(FP)8 for DeepSeek on __Hopper__. Activations and weights are quantized at per-tensor and per-group (1x128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers.
+TensorRT LLM supports W(INT)4‑A(FP)8 for DeepSeek on **Hopper**. Activations and weights are quantized at per‑tensor and per‑group (1×128) granularity respectively for MoE, and FP8 block scaling is preserved for dense layers.
examples/models/contrib/mmdit/README.md (1)

6-10: Path label/target mismatch and small grammar fix.

The link label points to sd3/model.py but the target is mmdit_sd3/model.py. Also “There are main files” → “There are two main files”.

-The TensorRT LLM implementation of MMDiT can be found in [tensorrt_llm/models/sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are main files to build and run MMDiT with TensorRT-LLM:
+The TensorRT LLM implementation of MMDiT can be found in [tensorrt_llm/models/mmdit_sd3/model.py](../../../../tensorrt_llm/models/mmdit_sd3/model.py). The TensorRT LLM MMDiT (SD 3/3.5) example code is located in [`examples/models/contrib/mmdit`](./). There are two main files to build and run MMDiT with TensorRT LLM:
docs/source/blogs/quantization-in-TRT-LLM.md (1)

8-8: Typo: “easy‑of‑use” → “ease of use”.

-This toolkit is designed with easy-of-use in mind.
+This toolkit is designed with ease of use in mind.
examples/redrafter/README.md (1)

8-8: Fix the inconsistent link reference.

The link text references "model.py" but the URL points to "drafter.py". This creates confusion about the actual location of the drafter component.

Apply this fix to correct the link:

-The TensorRT-LLM's ReDrafter implementation can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/model.py), which combines the base model and the drafter definition which can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/drafter.py).
+The TensorRT LLM's ReDrafter implementation can be found in [tensorrt_llm/models/redrafter/model.py](../../tensorrt_llm/models/redrafter/model.py), which combines the base model and the drafter definition which can be found in [tensorrt_llm/models/redrafter/drafter.py](../../tensorrt_llm/models/redrafter/drafter.py).

@nv-guomingz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19298 [ run ] triggered by Bot

@nv-guomingz nv-guomingz added the Release Blocker PRs that blocking the final release build or branching out the release branch label Sep 19, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19298 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #423 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@nv-guomingz nv-guomingz merged commit af3ea37 into NVIDIA:release/1.0 Sep 19, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 23, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 23, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 23, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 25, 2025
chzblych pushed a commit that referenced this pull request Sep 25, 2025
@nv-guomingz nv-guomingz deleted the user/guomingz/update_homepage_brand_new branch September 30, 2025 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release Blocker PRs that blocking the final release build or branching out the release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants