-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][chore] Mass integration of release/1.0 - 4th (release/1.0 doc change mainly) #7607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8e6ccaa
to
2537515
Compare
📝 WalkthroughWalkthroughExtensive documentation reorganization and branding updates across the docs tree, addition of many new feature/how-to pages, and updates to deployment guides. Code changes introduce a new public BuildConfig dataclass in tensorrt_llm/builder.py and add LoRARequest to tensorrt_llm.llmapi exports. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 38
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (18)
docs/source/features/checkpoint-loading.md (1)
319-327
: Typo: “asscoiated” → “associated”.Fix spelling.
-By setting the model name, the registered mapper will be asscoiated with the specific model. +By setting the model name, the registered mapper will be associated with the specific model.docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)
79-98
: Low‑latency 1x GPU example uses undefined ${num_gpus} and incorrect TP/EP.For a 1x GPU block, set TP=1 and EP=1; don’t reference ${num_gpus} here.
Apply:
- --tp ${num_gpus} \ - --ep 1 \ + --tp 1 \ + --ep 1 \Optional: show
max_batch_size=1
for true minimal latency.
138-146
: Max‑throughput block label vs content mismatch.This block title says “1x B200/GB200/H200” but uses
num_gpus=8
and--tp/--ep ${num_gpus}
. Rename the summary to “8x …” or setnum_gpus=1
and adjust flags.Apply (if intended to be 8 GPUs):
-<details open> <summary>1x B200/GB200/H200</summary> +<details open> <summary>8x B200/GB200/H200</summary>
172-191
: Section title branding + single‑rank serve example uses TP=8/EP=8 with mpirun -n 1.
- Use “TensorRT LLM” branding.
- With
-n 1
,--tp_size
and--ep_size
must both be 1. Either increase-n
to tp*ep or reduce sizes to 1 for the single‑GPU example.Apply:
-## Launch the TensorRT-LLM Server +## Launch the TensorRT LLM Server @@ -mpirun -n 1 --oversubscribe --allow-run-as-root \ -trtllm-serve openai/gpt-oss-120b \ +mpirun -n 1 --oversubscribe --allow-run-as-root \ +trtllm-serve openai/gpt-oss-120b \ @@ - --tp_size 8 \ - --ep_size 8 \ - --max_batch_size 640 \ + --tp_size 1 \ + --ep_size 1 \ + --max_batch_size ${max_batch_size} \If you want to show an 8‑GPU serve example, add a separate details block with
mpirun -n 8
and--tp_size/--ep_size
that multiply to 8.
268-336
: Sanitize the example response (remove internal “analysis” content).The sample JSON embeds non‑API “analysis” text and very long content. Replace with a short, realistic OpenAI‑compatible response.
Apply:
-```bash -{ ... very long object with internal analysis ... } -``` +```json +{ + "id": "chatcmpl-123", + "object": "chat.completion", + "created": 1754358426, + "model": "openai/gpt-oss-120b", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "NVIDIA’s inference advantage comes from Tensor Cores, an optimized software stack (TensorRT + Triton), and high-bandwidth interconnects (NVLink/NVSwitch) that deliver low latency and high throughput at scale." + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 17, + "completion_tokens": 42, + "total_tokens": 59 + } +} +```
344-356
: Remove outdated, contradictory MoE section (duplicates with conflicting guidance).This “(H200/H100 Only)” block contradicts earlier guidance (CUTLASS used for throughput; TRITON recommended for H200). It should be deleted or reconciled in one canonical section above.
Apply:
-## (H200/H100 Only) Using OpenAI Triton Kernels for MoE -... - backend: TRITON -``` +<!-- Removed: duplicate/contradictory MoE section. See "(H200 Only) Using OpenAI Triton Kernels for MoE" above. -->docs/source/examples/customization.md (2)
7-13
: Import LLM in the quantization snippet
LLM
is used but not imported in this snippet.-from tensorrt_llm.llmapi import QuantConfig, QuantAlgo +from tensorrt_llm.llmapi import LLM, QuantConfig, QuantAlgo
90-96
: Inconsistent API: skip_tokenizer_init belongs to LLM(), not generate()The text says to pass
skip_tokenizer_init=True
when creatingLLM
, but the code passes it togenerate()
. Align the example with the actual API.-llm = LLM(<llama_model_path>) -for output in llm.generate([[32, 12]], skip_tokenizer_init=True): +llm = LLM(<llama_model_path>, skip_tokenizer_init=True) +for output in llm.generate([[32, 12]]): print(output)docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (1)
230-231
: Fix “Placement” typos in headings/body.There are split words “Placemen t” in headings/body. Replace with “Placement”.
- * Orchestrate the process (**Update Weights \& Placemen**t component) + * Orchestrate the process (**Update Weights & Placement** component) -For the **Update Weights \& Placemen**t component, we identified two design choices: +For the **Update Weights & Placement** component, we identified two design choices:Also applies to: 241-247
docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (1)
161-169
: Sentence fragment in “Explanation” bullet.Complete the sentence for clarity.
-- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details. +- `trtllm-bench`: A CLI benchmarking utility that helps users reproduce our officially published results. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.docs/source/models/adding-new-model.md (1)
199-204
: Update example path to reflect actual directory
In docs/source/models/adding-new-model.md (lines 199–204), replacepython examples/pytorch/out_of_tree_example/main.pywith
python examples/llm-api/out_of_tree_example/main.pyand confirm that
examples/llm-api/out_of_tree_example/main.py
defines amain()
entrypoint and runs with the current API.tensorrt_llm/builder.py (4)
520-552
: Fix dataclass types and defaults (bools, Optional).Current types use int for booleans and non-Optional annotations with None defaults.
- max_seq_len: int = None + max_seq_len: Optional[int] = None @@ - kv_cache_type: KVCacheType = None - gather_context_logits: int = False - gather_generation_logits: int = False + kv_cache_type: Optional[KVCacheType] = None + gather_context_logits: bool = False + gather_generation_logits: bool = False @@ - force_num_profiles: Optional[int] = None + force_num_profiles: Optional[int] = None @@ - input_timing_cache: str = None + input_timing_cache: Optional[str] = None @@ - visualize_network: str = None + visualize_network: Optional[str] = None
734-743
: Make update_from_dict robust: convert enums and merge nested configs.Avoids breaking types when users pass strings/ints or nested dicts.
- def update_from_dict(self, config: dict): - for name, value in config.items(): - if not hasattr(self, name): - raise AttributeError( - f"{self.__class__} object has no attribute {name}") - setattr(self, name, value) + def update_from_dict(self, config: dict): + for name, value in config.items(): + if name == "plugin_config" and isinstance(value, dict): + self.plugin_config.update_from_dict(value) + continue + if name == "lora_config" and isinstance(value, dict): + self.lora_config.update_from_dict(value) + continue + if name == "auto_parallel_config" and isinstance(value, dict): + self.auto_parallel_config.update_from_dict(value) + continue + if name == "kv_cache_type": + if value is None or isinstance(value, KVCacheType): + self.kv_cache_type = value + else: + self.kv_cache_type = KVCacheType.from_string(str(value)) + continue + if name == "speculative_decoding_mode": + if isinstance(value, SpeculativeDecodingMode): + self.speculative_decoding_mode = value + else: + # accept int or name + from enum import IntEnum + try: + self.speculative_decoding_mode = SpeculativeDecodingMode(value) + except Exception: + self.speculative_decoding_mode = SpeculativeDecodingMode[str(value)] + continue + if not hasattr(self, name): + raise AttributeError(f"{self.__class__} object has no attribute {name}") + setattr(self, name, value)
1081-1097
: Add missing dtype mapping for managed weights deserialization.Reading int8 managed weights fails with “Unsupported dtype: I8”.
- elif dtype == "I32": + elif dtype == "I32": dtype = np.int32 + elif dtype == "I8": + dtype = np.int8 else: raise RuntimeError(f"Unsupported dtype: {dtype}")
1247-1259
: Remove outdated quantization restrictions for SM≥100
Blackwell (SM≥100) now supports INT8/INT4 weight-only and SmoothQuant workflows in TensorRT-LLM (added in v0.17). Drop or revise these RuntimeError checks intensorrt_llm/builder.py
(lines 1247–1259) to allow these quant modes.docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (2)
233-258
: Fix inconsistent sample response and branding.The prose claims the response begins “New York is a state ...” but the JSON shows unrelated text. This confuses users validating their setup.
Replace the example payload with output that matches the prompt and keep it short:
-Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence. +Here is an example response showing the TensorRT LLM server completion for the prompt.-{"id":"cmpl-...","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null} +{"id":"cmpl-...","object":"text_completion","created": 1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":"New York is a state in the northeastern United States.","finish_reason":"length"}],"usage":{"prompt_tokens":6,"completion_tokens":16,"total_tokens":22}}
263-263
: Fix PyTorch docs URL.The path has “docs” twice.
-https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf +https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-confdocs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)
33-44
: Update to latest published NGC image tag
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (lines 33–44, 46–53): replacenvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6with
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc31.1.0rc3 is the current latest published release on NGC (catalog.ngc.nvidia.com, github.com)
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md
Show resolved
Hide resolved
b1ba8ae
to
d22f5df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. @dominicshanshan please trigger the weekly release process once those comments addressed, thanks.
660c9fc
to
e736627
Compare
e736627
to
8e63a3f
Compare
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
8e63a3f
to
b6d67ad
Compare
/bot skip --comment "docs change only" |
PR_Github #18140 [ skip ] triggered by Bot |
PR_Github #18140 [ skip ] completed with state |
Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Gergely Magyar <[email protected]>
Signed-off-by: nv-guomingz <[email protected]>
Summary by CodeRabbit
New Features
Documentation
Known Issues
Description
Only cherry-pick #6696, #7549, #7554 @nv-guomingz for massive doc change in release/1.0 branch.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.