-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][doc] Create deployment guide for Llama4 Scout FP8 and NVFP4 #6550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Create deployment guide for Llama4 Scout FP8 and NVFP4 #6550
Conversation
📝 WalkthroughWalkthroughA new quick start guide for deploying and benchmarking the Llama4 Scout 17B model using TensorRT-LLM has been added. Additionally, the pre-commit configuration was updated to exclude Markdown files from the trailing-whitespace check, in addition to patch files. Changes
Sequence Diagram(s)Not applicable: changes are documentation and config-related only. Estimated code review effort🎯 2 (Simple) | ⏱️ ~7 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (5)
25-30
: Model bullet wording is inconsistent with link targetThe bullet claims “NVFP4 model” but the link points to a plain “FP4” weight. Either rename the anchor to match the file name or change the descriptive text so the term used is consistent across the doc.
-* NVFP4 model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4) +* FP4 (NVFP4) model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)
64-71
: YAML example hard-codesdtype: fp8
The sample config is under the “recommended performance settings” section, yet the guide targets both FP8 and NVFP4. Consider either noting that the snippet is FP8-specific or parameterising the value.
- dtype: fp8 +# For NVFP4 replace with: dtype: fp4 + dtype: fp8
98-106
: Heading level jumps violate MD001
####
immediately follows an##
header, skipping###
. Markdown-lint flags this. Dropping one#
fixes all option sub-sections.-#### `--tp_size` +### `--tp_size`Apply the same adjustment to every option block that currently uses four
#
symbols.
180-184
: Dangling back-tick at end of sentenceThere is an unmatched back-tick after
extra_llm_api_options
, creating a rendering glitch.-... options which can be used in the extra\_llm\_api\_options`.` +... options which can be used in the `extra_llm_api_options`.
322-349
: Missing fenced-code language tag (MD040)Add a language identifier so syntax highlighting engines don’t complain.
text
ornone
is sufficient.-``` +```text
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
[grammar] ~13-~13: Ensure spelling is correct
Context: ...xecution. # Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~121-~121: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rocessing. #### --max_num_of_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~129-~129: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~158-~158: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option. batch_sizes: A specific list of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~174-~174: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~180-~180: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [https://g...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~301-~301: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
98-98: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4
(MD001, heading-increment)
322-322: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (4)
95-104
: Heading level skips H3 – breaks MD001 rule
#### --tp_size
(Line 98) is an H4 directly under an H2 (## Configs and Parameters
, Line 95), skipping the H3 level.
To satisfy common Markdown conventions and automated linters, downgrade these option headings to H3 (###
) or insert an intermediate H3 section.-#### `--tp_size` +### `--tp_size`Apply the same change to all sibling option headings in this section.
322-349
: Add language identifier to fenced code blockThe benchmark sample output block (Line 322) lacks a language hint, triggering MD040 and disabling syntax highlighting in some renderers.
-``` +```text
181-185
: Extraneous back-tick in hyperlink sentenceLine 183 ends with “
extra_llm_api_options
.`` — the trailing back-tick before the period is stray and renders oddly.-... extra_llm_api_options`.` +... `extra_llm_api_options`.
1-1
: File name contains spaces – consider renamingHaving spaces in the Markdown filename (
Deployment Guide for TRT-LLM + Llama4 Scout.md
) complicates URL generation and shell scripting. Replacing spaces with dashes or underscores (e.g.,deployment_guide_trt_llm_llama4_scout.md
) improves portability.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
[grammar] ~13-~13: Ensure spelling is correct
Context: ...xecution. # Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~121-~121: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~129-~129: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~158-~158: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option. batch_sizes: A specific list of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~174-~174: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~180-~180: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [https://g...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~301-~301: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
98-98: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4
(MD001, heading-increment)
322-322: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)
78-86
: Correct flag name restored – thanks for fixing the earlier typoThe launch snippet now correctly uses
--max_num_tokens
(Line 84) instead of the previously-misspelled--max_num_of_tokens
. No further action needed.
254-258
: MODEL_PATH now points to FP4 – issue resolvedThe evaluation command for NVFP4 inference (Line 256) now references
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4
, eliminating the earlier mismatch with FP8 weights.
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)
76-88
: Consider showing a TensorRT backend example for maximum performance
--backend pytorch
is easier to demo but leaves a lot of perf on the table. A one-liner alternative with--backend tensorrt
(or instructions to convert weights) would help production users.No code change required if you intentionally keep it minimal.
318-322
: Add a language tag to the fenced block to satisfy MD040 and improve rendering-``` +```textApplies to this and any other plain fenced blocks without a language spec.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.pre-commit-config.yaml
(1 hunks)examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option.   
batch_sizes`: A specific list of...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [https://g...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
319-319: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
ca16dd0
to
6e23a92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (3)
15-19
: Prerequisite lines still break list structure and misstate NVFP4 hardware scopeThe hardware + software requirements are still plain text, so Markdown rendering collapses them.
Additionally, “Blackwell or Hopper” still suggests Hopper can run NVFP4.-*GPU: NVIDIA Blackwell or Hopper Architecture -*OS: Linux -*Drivers: CUDA Driver 575 or Later -*Docker with NVIDIA Container Toolkit installed -*Python3 and python3-pip (Optional, for accuracy evaluation only) +* **GPU** + * FP8: Blackwell **or** Hopper + * NVFP4: **Blackwell only** +* **OS:** Linux +* **Driver:** CUDA ≥ 575 +* **Docker:** NVIDIA Container Toolkit installed +* **Python 3 (optional):** required only for accuracy evaluation
67-68
: Hard-codingdtype: fp8
makes the YAML unusable for NVFP4 runsLeaving
dtype
fixed to FP8 silently converts FP4 checkpoints and wastes memory.-kv_cache_config: - dtype: fp8 +kv_cache_config: + # Use `auto` to match the checkpoint, or set `fp4` / `fp8` explicitly. + dtype: auto
154-156
: Parameter name typo – should bemax_batch_size
, notmax_batch_sizes
The codebase recognizes the singular form; the plural variant will be ignored and mislead readers.
-  `max_batch_sizes`: Sets the maximum batch size for which a CUDA graph will be created. +  `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.
🧹 Nitpick comments (1)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1)
320-346
: Add a language identifier to the final fenced code blockMarkdown-lint (
MD040
) flags this block. Addingtext
(orshell
/console
) preserves syntax highlighting and silences the linter.-``` +```text
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.pre-commit-config.yaml
(1 hunks)examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option.   
batch_sizes`: A specific list of...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [https://g...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
319-319: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)
15-19
: Prerequisites list is still unstructured and conflates NVFP4-only vs FP8 hardware
The section remains plain text and still lists “Blackwell or Hopper” as a single bullet, which incorrectly implies NVFP4 is Hopper-capable and breaks Markdown list consistency.
61-68
:dtype
hard-coded tofp8
, making the YAML unusable for NVFP4 users
Previous feedback recommendeddtype: auto
(or explicit fp4) plus explanatory text. The issue is unchanged and will silently cast FP4 checkpoints to FP8.
🧹 Nitpick comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)
268-272
: Avoid linking to an externalbench.sh
domain—use inline code instead
The Markdown link[bench.sh](http://bench.sh)
points to a real external site; readers expect an inline file name.-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper script named `bench.sh`.
319-346
: Code fence lacks a language identifier; triggers MD040 and loses syntax highlightingAdd
text
(ornone
) after the opening back-ticks:-``` +```text
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
(1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.
Applied to files:
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option.   
batch_sizes`: A specific list of...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [https://g...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
319-319: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
PR_Github #13925 [ run ] triggered by Bot |
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
Outdated
Show resolved
Hide resolved
PR_Github #13925 [ run ] completed with state |
033a77b
to
fce89cc
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (4)
24-27
: Inconsistent precision terminology (NVFP4 vs FP4)
The guide introduces the second model as “NVFP4” (Line 26) but the bullet label uses “FP4” (Line 24-25). Mixing names can confuse users searching on NGC/HF. Align the terminology (prefer “NVFP4”) throughout the doc.
48-49
: Minor command typo – missing space & flag
mkdir ~/.cache
is shown asmkdir `~/.cache`
. Drop the back-ticks & extra space:-If the `~/.cache` directory doesn’t exist please create it using mkdir `~/.cache`. +If the `~/.cache` directory doesn’t exist, create it with: +```shell +mkdir -p ~/.cache +```
271-297
: bench.sh script lacks she-bang & safe-defaults
Without#!/usr/bin/env bash
,bench.sh
may execute under/bin/sh
on some systems (dash, busybox) and fail on Bash-isms (e.g.,$((…))
). Add she-bang andset -euo pipefail
to abort on errors:-cat <<EOF > bench.sh +cat <<'EOF' > bench.sh +#!/usr/bin/env bash +set -euo pipefail
318-346
: Missing language identifier on fenced block breaks markdownlint (MD040)
Addtext
(ornone
) after the opening back-ticks to silence lint and enable proper rendering:-``` +```text
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.pre-commit-config.yaml
(1 hunks)examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.308Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.308Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~147-~147: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance. Options: `enable_paddi...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option.   
batch_sizes`: A specific list of...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models. Options: backend
: Th...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [TorchLlmA...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
319-319: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
Show resolved
Hide resolved
PR_Github #14253 [ run ] triggered by Bot |
PR_Github #14253 [ run ] completed with state |
Signed-off-by: Chenfei Zhang <[email protected]>
fce89cc
to
2ee3fe7
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1)
34-43
: Add image digest pinning and larger shared-memory size to Docker run commandLarge-model workloads frequently exceed Docker’s default 64 MiB
/dev/shm
, and unpinned tags can silently break when the image is updated. The same feedback was given on an earlier commit and still applies.
🧹 Nitpick comments (1)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1)
319-346
: Missing language hint on fenced block triggers markdown-lint (MD040)The sample benchmark output block lacks a language identifier and is currently flagged by CI tooling. Add
text
(ornone
) after the triple back-ticks:-``` +```text ============ Serving Benchmark Result ============ ...
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.pre-commit-config.yaml
(1 hunks)examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_size
Description: The maximum number of ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokens
Description: The maximum total numb...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_len
Description: The maximum possible s...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_code
Description: Allows TensorRT-LLM to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache. Default: auto (uses the data ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_config
Description: A section for configur...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~147-~147: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance. Options: `enable_paddi...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created. Default: 0 **Rec...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option.   
batch_sizes`: A specific list of...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for. Default: None #### `moe_conf...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models. Options: backend
: Th...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations. Default: CUTLASS #### `atten...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations. Default: TRTLLM See the [TorchLlmA...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
319-319: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
We create a YAML configuration file /tmp/config.yml for the TensorRT-LLM Server and populate it with the following recommended performance settings. | ||
|
||
```shell | ||
EXTRA_LLM_API_FILE=/tmp/config.yml | ||
|
||
cat << EOF > ${EXTRA_LLM_API_FILE} | ||
enable_attention_dp: false | ||
cuda_graph_config: | ||
enable_padding: true | ||
max_batch_size: 1024 | ||
kv_cache_config: | ||
dtype: fp8 | ||
EOF | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
kv_cache_config.dtype
hard-codes fp8
– breaks NVFP4 workflow
The YAML template always sets
kv_cache_config:
dtype: fp8
When the user follows the guide to launch an NVFP4 model, this override forces the KV-cache back to FP8 and defeats the purpose of FP4 quantization (it also wastes memory). Either omit the field (to inherit from the checkpoint) or parameterise it:
-kv_cache_config:
- dtype: fp8
+# For FP8 models keep fp8; for NVFP4 models set fp4 or omit to use checkpoint default
+kv_cache_config:
+ # dtype: auto
🤖 Prompt for AI Agents
In examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
around lines 56 to 69, the YAML configuration hard-codes kv_cache_config.dtype
to fp8, which breaks the NVFP4 workflow by overriding the intended FP4
quantization. To fix this, remove the dtype: fp8 line from the YAML so it
inherits the dtype from the checkpoint, or modify the script to parameterize
this field so it can be set appropriately based on the model being used.
PR_Github #14287 [ run ] triggered by Bot |
Bypassing and merge now. The doc change won't affect CI. |
PR_Github #14287 [ run ] completed with state |
…VIDIA#6550) Signed-off-by: Chenfei Zhang <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>
Summary by CodeRabbit
Documentation
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.