Skip to content

Conversation

dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Sep 8, 2025

Summary by CodeRabbit

  • New Features

    • Introduced trtllm-eval CLI for offline accuracy evaluation.
    • Added a public BuildConfig for engine build configuration.
    • Exposed LoRARequest in the LLM API.
  • Documentation

    • Rebranded “TensorRT-LLM” to “TensorRT LLM” and overhauled navigation/installation.
    • Added/expanded guides: AutoDeploy (and advanced), Disaggregated Serving, Attention backends, KV Cache, Long-sequence, Overlap Scheduler, IFB/Scheduler, Parallel Strategies, Quantization, Sampling, Speculative Decoding, Multi‑modality.
    • New model support/feature matrices, deployment recipes, and benchmarking/profiling guides.
    • Updated LoRA docs with DoRA scales in the weights format and expanded usage examples.
    • Refreshed architecture overview and blogs.
  • Known Issues

    • Unresolved merge markers present in the Llama 4 Scout quick‑start doc.

Description

Only cherry-pick #6696, #7549, #7554 @nv-guomingz for massive doc change in release/1.0 branch.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

📝 Walkthrough

Walkthrough

Extensive documentation reorganization and branding updates across the docs tree, addition of many new feature/how-to pages, and updates to deployment guides. Code changes introduce a new public BuildConfig dataclass in tensorrt_llm/builder.py and add LoRARequest to tensorrt_llm.llmapi exports.

Changes

Cohort / File(s) Summary
Branding rename (TensorRT‑LLM → TensorRT LLM)
docs/source/**/*.md, docs/source/**/*.rst, docs/source/blogs/*, docs/source/torch/*.md
Consistent renaming in titles, headings, prose, and captions; links and anchors adjusted where applicable. No logic changes.
Docs IA overhaul & index/nav updates
docs/source/index.rst, docs/source/overview.md, docs/source/installation/index.rst, docs/source/installation/containers.md, docs/source/installation/linux.md, docs/source/installation/build-from-source-linux.md, docs/source/reference/support-matrix.md
Reworked ToC, consolidated installation path, added anchors, moved sections into nested toctrees; minor anchor removal in support-matrix.
Architecture & Advanced docs refresh
docs/source/architecture/*.md, docs/source/advanced/*.md
Architecture overview rewritten; multiple pages updated/retargeted links; DoRA LoRA weights shape documented; speculative decoding cross-links adjusted.
New features and deep-dives (PyTorch backend)
docs/source/features/attention.md, .../kvcache.md, .../long-sequence.md, .../sampling.md, .../speculative-decoding.md, .../paged-attention-ifb-scheduler.md, .../parallel-strategy.md, .../quantization.md, .../multi-modality.md, .../feature-combination-matrix.md, .../disagg-serving.md, .../checkpoint-loading.md, .../overlap-scheduler.md
Added comprehensive guides covering attention backends, KV cache system, long-sequence methods, sampling, speculative decoding, scheduler, parallel strategies, quantization, multimodality, disaggregated serving, checkpoint loading, and feature matrix.
AutoDeploy docs (new)
docs/source/features/auto_deploy/auto-deploy.md, .../advanced/*.md, docs/source/examples/dynamo_k8s_example.rst
Introduces AutoDeploy overview, support matrix, advanced workflow/config/logging, benchmarking with trtllm-bench, and example runs; adds Kubernetes example.
KV cache examples (new)
docs/source/examples/kvcacheconfig.md, docs/source/examples/kvcacheretentionconfig.md
New example guides for KvCacheConfig and KvCacheRetentionConfig with usage snippets.
Deployment guides and recipes
docs/source/deployment-guide/index.rst, .../quick-start-recipe-for-deepseek-r1-on-trtllm.md, .../quick-start-recipe-for-llama3.3-70b-on-trtllm.md, .../quick-start-recipe-for-llama4-scout-on-trtllm.md
Adds Model Recipes section; content updates and formatting; one file shows unresolved merge-conflict markers.
Command docs
docs/source/commands/trtllm-bench.rst, docs/source/commands/trtllm-eval.rst, docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
Branding updates; new trtllm-eval CLI doc with tasks, usage, and Click directive; benchmark guide wording updates.
Blogs updates
docs/source/blogs/*.md, docs/source/blogs/tech_blog/*.md
Branding changes; select content additions (e.g., FP4 MoE notes), ToC reshuffles; link target updates; future work items added in disaggregated serving blog.
Config/Conf minor
docs/source/conf.py
Trailing comma removed in myst_substitutions entry.
Examples link fix
examples/wide_ep/README.md
Updated disaggregated serving troubleshooting link to new docs path.
Quick Start & Torch landing
docs/source/quick-start-guide.md, docs/source/torch.md, docs/source/architecture/checkpoint.md, docs/source/architecture/overview.md
Quick Start restructured, ports exposed in example; checkpoint/overview branding and narrative revamp.
LLM API export change
tensorrt_llm/llmapi/__init__.py
Adds LoRARequest to public exports.
Build pipeline config (new public dataclass)
tensorrt_llm/builder.py
Adds BuildConfig with many fields and helpers (serialization, defaults, KV cache type reconciliation, updates, JSON loading). Intended as structured input for engine building.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

1.0_doc, Documentation

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 38

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (18)
docs/source/features/checkpoint-loading.md (1)

319-327: Typo: “asscoiated” → “associated”.

Fix spelling.

-By setting the model name, the registered mapper will be asscoiated with the specific model.
+By setting the model name, the registered mapper will be associated with the specific model.
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)

79-98: Low‑latency 1x GPU example uses undefined ${num_gpus} and incorrect TP/EP.

For a 1x GPU block, set TP=1 and EP=1; don’t reference ${num_gpus} here.

Apply:

-    --tp ${num_gpus} \
-    --ep 1 \
+    --tp 1 \
+    --ep 1 \

Optional: show max_batch_size=1 for true minimal latency.


138-146: Max‑throughput block label vs content mismatch.

This block title says “1x B200/GB200/H200” but uses num_gpus=8 and --tp/--ep ${num_gpus}. Rename the summary to “8x …” or set num_gpus=1 and adjust flags.

Apply (if intended to be 8 GPUs):

-<details open> <summary>1x B200/GB200/H200</summary>
+<details open> <summary>8x B200/GB200/H200</summary>

172-191: Section title branding + single‑rank serve example uses TP=8/EP=8 with mpirun -n 1.

  • Use “TensorRT LLM” branding.
  • With -n 1, --tp_size and --ep_size must both be 1. Either increase -n to tp*ep or reduce sizes to 1 for the single‑GPU example.

Apply:

-## Launch the TensorRT-LLM Server
+## Launch the TensorRT LLM Server
@@
-mpirun -n 1 --oversubscribe --allow-run-as-root \
-trtllm-serve  openai/gpt-oss-120b \
+mpirun -n 1 --oversubscribe --allow-run-as-root \
+trtllm-serve openai/gpt-oss-120b \
@@
-  --tp_size 8 \
-  --ep_size 8 \
-  --max_batch_size 640 \
+  --tp_size 1 \
+  --ep_size 1 \
+  --max_batch_size ${max_batch_size} \

If you want to show an 8‑GPU serve example, add a separate details block with mpirun -n 8 and --tp_size/--ep_size that multiply to 8.


268-336: Sanitize the example response (remove internal “analysis” content).

The sample JSON embeds non‑API “analysis” text and very long content. Replace with a short, realistic OpenAI‑compatible response.

Apply:

-```bash
-{ ... very long object with internal analysis ... }
-```
+```json
+{
+  "id": "chatcmpl-123",
+  "object": "chat.completion",
+  "created": 1754358426,
+  "model": "openai/gpt-oss-120b",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "NVIDIA’s inference advantage comes from Tensor Cores, an optimized software stack (TensorRT + Triton), and high-bandwidth interconnects (NVLink/NVSwitch) that deliver low latency and high throughput at scale."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 17,
+    "completion_tokens": 42,
+    "total_tokens": 59
+  }
+}
+```

344-356: Remove outdated, contradictory MoE section (duplicates with conflicting guidance).

This “(H200/H100 Only)” block contradicts earlier guidance (CUTLASS used for throughput; TRITON recommended for H200). It should be deleted or reconciled in one canonical section above.

Apply:

-## (H200/H100 Only) Using OpenAI Triton Kernels for MoE
-...
-  backend: TRITON
-```
+<!-- Removed: duplicate/contradictory MoE section. See "(H200 Only) Using OpenAI Triton Kernels for MoE" above. -->
docs/source/examples/customization.md (2)

7-13: Import LLM in the quantization snippet

LLM is used but not imported in this snippet.

-from tensorrt_llm.llmapi import QuantConfig, QuantAlgo
+from tensorrt_llm.llmapi import LLM, QuantConfig, QuantAlgo

90-96: Inconsistent API: skip_tokenizer_init belongs to LLM(), not generate()

The text says to pass skip_tokenizer_init=True when creating LLM, but the code passes it to generate(). Align the example with the actual API.

-llm = LLM(<llama_model_path>)
-for output in llm.generate([[32, 12]], skip_tokenizer_init=True):
+llm = LLM(<llama_model_path>, skip_tokenizer_init=True)
+for output in llm.generate([[32, 12]]):
     print(output)
docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (1)

230-231: Fix “Placement” typos in headings/body.

There are split words “Placemen t” in headings/body. Replace with “Placement”.

-  * Orchestrate the process (**Update Weights \& Placemen**t component)
+  * Orchestrate the process (**Update Weights & Placement** component)
-For the **Update Weights \& Placemen**t component, we identified two design choices:
+For the **Update Weights & Placement** component, we identified two design choices:

Also applies to: 241-247

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (1)

161-169: Sentence fragment in “Explanation” bullet.

Complete the sentence for clarity.

-- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
+- `trtllm-bench`: A CLI benchmarking utility that helps users reproduce our officially published results. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
docs/source/models/adding-new-model.md (1)

199-204: Update example path to reflect actual directory
In docs/source/models/adding-new-model.md (lines 199–204), replace

python examples/pytorch/out_of_tree_example/main.py

with

python examples/llm-api/out_of_tree_example/main.py

and confirm that examples/llm-api/out_of_tree_example/main.py defines a main() entrypoint and runs with the current API.

tensorrt_llm/builder.py (4)

520-552: Fix dataclass types and defaults (bools, Optional).

Current types use int for booleans and non-Optional annotations with None defaults.

-    max_seq_len: int = None
+    max_seq_len: Optional[int] = None
@@
-    kv_cache_type: KVCacheType = None
-    gather_context_logits: int = False
-    gather_generation_logits: int = False
+    kv_cache_type: Optional[KVCacheType] = None
+    gather_context_logits: bool = False
+    gather_generation_logits: bool = False
@@
-    force_num_profiles: Optional[int] = None
+    force_num_profiles: Optional[int] = None
@@
-    input_timing_cache: str = None
+    input_timing_cache: Optional[str] = None
@@
-    visualize_network: str = None
+    visualize_network: Optional[str] = None

734-743: Make update_from_dict robust: convert enums and merge nested configs.

Avoids breaking types when users pass strings/ints or nested dicts.

-    def update_from_dict(self, config: dict):
-        for name, value in config.items():
-            if not hasattr(self, name):
-                raise AttributeError(
-                    f"{self.__class__} object has no attribute {name}")
-            setattr(self, name, value)
+    def update_from_dict(self, config: dict):
+        for name, value in config.items():
+            if name == "plugin_config" and isinstance(value, dict):
+                self.plugin_config.update_from_dict(value)
+                continue
+            if name == "lora_config" and isinstance(value, dict):
+                self.lora_config.update_from_dict(value)
+                continue
+            if name == "auto_parallel_config" and isinstance(value, dict):
+                self.auto_parallel_config.update_from_dict(value)
+                continue
+            if name == "kv_cache_type":
+                if value is None or isinstance(value, KVCacheType):
+                    self.kv_cache_type = value
+                else:
+                    self.kv_cache_type = KVCacheType.from_string(str(value))
+                continue
+            if name == "speculative_decoding_mode":
+                if isinstance(value, SpeculativeDecodingMode):
+                    self.speculative_decoding_mode = value
+                else:
+                    # accept int or name
+                    from enum import IntEnum
+                    try:
+                        self.speculative_decoding_mode = SpeculativeDecodingMode(value)
+                    except Exception:
+                        self.speculative_decoding_mode = SpeculativeDecodingMode[str(value)]
+                continue
+            if not hasattr(self, name):
+                raise AttributeError(f"{self.__class__} object has no attribute {name}")
+            setattr(self, name, value)

1081-1097: Add missing dtype mapping for managed weights deserialization.

Reading int8 managed weights fails with “Unsupported dtype: I8”.

-            elif dtype == "I32":
+            elif dtype == "I32":
                 dtype = np.int32
+            elif dtype == "I8":
+                dtype = np.int8
             else:
                 raise RuntimeError(f"Unsupported dtype: {dtype}")

1247-1259: Remove outdated quantization restrictions for SM≥100
Blackwell (SM≥100) now supports INT8/INT4 weight-only and SmoothQuant workflows in TensorRT-LLM (added in v0.17). Drop or revise these RuntimeError checks in tensorrt_llm/builder.py (lines 1247–1259) to allow these quant modes.

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (2)

233-258: Fix inconsistent sample response and branding.

The prose claims the response begins “New York is a state ...” but the JSON shows unrelated text. This confuses users validating their setup.

Replace the example payload with output that matches the prompt and keep it short:

-Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
+Here is an example response showing the TensorRT LLM server completion for the prompt.
-{"id":"cmpl-...","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
+{"id":"cmpl-...","object":"text_completion","created": 1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":"New York is a state in the northeastern United States.","finish_reason":"length"}],"usage":{"prompt_tokens":6,"completion_tokens":16,"total_tokens":22}}

263-263: Fix PyTorch docs URL.

The path has “docs” twice.

-https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
+https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)

33-44: Update to latest published NGC image tag
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (lines 33–44, 46–53): replace

nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6

with

nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3

1.1.0rc3 is the current latest published release on NGC (catalog.ngc.nvidia.com, github.com)

@dominicshanshan dominicshanshan force-pushed the mi-release-1.0-4 branch 4 times, most recently from b1ba8ae to d22f5df Compare September 8, 2025 13:06
Copy link
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. @dominicshanshan please trigger the weekly release process once those comments addressed, thanks.

@dominicshanshan dominicshanshan force-pushed the mi-release-1.0-4 branch 4 times, most recently from 660c9fc to e736627 Compare September 9, 2025 03:15
@nv-guomingz
Copy link
Collaborator

/bot skip --comment "docs change only"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18140 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18140 [ skip ] completed with state SUCCESS
Skipping testing for commit b6d67ad

@nv-guomingz nv-guomingz merged commit 7f3f658 into NVIDIA:main Sep 9, 2025
5 checks passed
nv-guomingz added a commit to nv-guomingz/TensorRT-LLM that referenced this pull request Sep 9, 2025
chzblych pushed a commit that referenced this pull request Sep 9, 2025
gergely-magyar pushed a commit to gergely-magyar/TensorRT-LLM that referenced this pull request Sep 9, 2025
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants