Skip to content

Conversation

chenfeiz0326
Copy link
Collaborator

@chenfeiz0326 chenfeiz0326 commented Aug 1, 2025

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive quick start guide for deploying and benchmarking Llama4 Scout 17B models with TensorRT-LLM, including setup, configuration, API usage, evaluation, and troubleshooting instructions.
    • Expanded documentation coverage for quantization options (FP8, NVFP4) and performance benchmarking.
  • Chores

    • Updated pre-commit configuration to exclude Markdown files from the trailing whitespace check.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 1, 2025

📝 Walkthrough

Walkthrough

A new quick start guide for deploying and benchmarking the Llama4 Scout 17B model using TensorRT-LLM has been added. Additionally, the pre-commit configuration was updated to exclude Markdown files from the trailing-whitespace check, in addition to patch files.

Changes

Cohort / File(s) Change Summary
Pre-commit Exclude Update
.pre-commit-config.yaml
Expanded the trailing-whitespace hook exclusion to also ignore .md files, updating the regex accordingly.
Llama4 Scout Deployment Guide
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
Added a comprehensive quick start guide for deploying, evaluating, and benchmarking Llama4 Scout 17B with TRT-LLM.

Sequence Diagram(s)

Not applicable: changes are documentation and config-related only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~7 minutes

Possibly related PRs

Suggested labels

Documentation, Community want to contribute

Suggested reviewers

  • jiahanc
  • QiJune
  • nv-guomingz

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (5)

25-30: Model bullet wording is inconsistent with link target

The bullet claims “NVFP4 model” but the link points to a plain “FP4” weight. Either rename the anchor to match the file name or change the descriptive text so the term used is consistent across the doc.

-* NVFP4 model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)
+* FP4 (NVFP4) model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)

64-71: YAML example hard-codes dtype: fp8

The sample config is under the “recommended performance settings” section, yet the guide targets both FP8 and NVFP4. Consider either noting that the snippet is FP8-specific or parameterising the value.

-  dtype: fp8
+# For NVFP4 replace with:  dtype: fp4
+  dtype: fp8

98-106: Heading level jumps violate MD001

#### immediately follows an ## header, skipping ###. Markdown-lint flags this. Dropping one # fixes all option sub-sections.

-#### `--tp_size`
+### `--tp_size`

Apply the same adjustment to every option block that currently uses four # symbols.


180-184: Dangling back-tick at end of sentence

There is an unmatched back-tick after extra_llm_api_options, creating a rendering glitch.

-... options which can be used in the extra\_llm\_api\_options`.`
+... options which can be used in the `extra_llm_api_options`.

322-349: Missing fenced-code language tag (MD040)

Add a language identifier so syntax highlighting engines don’t complain.
text or none is sufficient.

-```
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac23f4a and 2bff3ab.

📒 Files selected for processing (1)
  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

[grammar] ~13-~13: Ensure spelling is correct
Context: ...xecution. # Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~121-~121: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rocessing. #### --max_num_of_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~129-~129: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~158-~158: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~174-~174: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~180-~180: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~301-~301: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

98-98: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


322-322: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (4)

95-104: Heading level skips H3 – breaks MD001 rule

#### --tp_size (Line 98) is an H4 directly under an H2 (## Configs and Parameters, Line 95), skipping the H3 level.
To satisfy common Markdown conventions and automated linters, downgrade these option headings to H3 (###) or insert an intermediate H3 section.

-#### `--tp_size`
+### `--tp_size`

Apply the same change to all sibling option headings in this section.


322-349: Add language identifier to fenced code block

The benchmark sample output block (Line 322) lacks a language hint, triggering MD040 and disabling syntax highlighting in some renderers.

-```
+```text

181-185: Extraneous back-tick in hyperlink sentence

Line 183 ends with “extra_llm_api_options.`` — the trailing back-tick before the period is stray and renders oddly.

-... extra_llm_api_options`.`
+... `extra_llm_api_options`.

1-1: File name contains spaces – consider renaming

Having spaces in the Markdown filename (Deployment Guide for TRT-LLM + Llama4 Scout.md) complicates URL generation and shell scripting. Replacing spaces with dashes or underscores (e.g., deployment_guide_trt_llm_llama4_scout.md) improves portability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bff3ab and 01a70ef.

📒 Files selected for processing (1)
  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

[grammar] ~13-~13: Ensure spelling is correct
Context: ...xecution. # Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~121-~121: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~129-~129: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~158-~158: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~174-~174: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~180-~180: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~301-~301: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

98-98: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


322-322: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)

78-86: Correct flag name restored – thanks for fixing the earlier typo

The launch snippet now correctly uses --max_num_tokens (Line 84) instead of the previously-misspelled --max_num_of_tokens. No further action needed.


254-258: MODEL_PATH now points to FP4 – issue resolved

The evaluation command for NVFP4 inference (Line 256) now references
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4, eliminating the earlier mismatch with FP8 weights.

@litaotju litaotju requested a review from hyukn August 2, 2025 00:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)

76-88: Consider showing a TensorRT backend example for maximum performance

--backend pytorch is easier to demo but leaves a lot of perf on the table. A one-liner alternative with --backend tensorrt (or instructions to convert weights) would help production users.

No code change required if you intentionally keep it minimal.


318-322: Add a language tag to the fenced block to satisfy MD040 and improve rendering

-```
+```text

Applies to this and any other plain fenced blocks without a language spec.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01a70ef and ca16dd0.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option. &emsp;&emsp;batch_sizes`: A specific list of...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

319-319: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@chenfeiz0326 chenfeiz0326 force-pushed the feat/llama-4-deployment-guide branch from ca16dd0 to 6e23a92 Compare August 4, 2025 03:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (3)

15-19: Prerequisite lines still break list structure and misstate NVFP4 hardware scope

The hardware + software requirements are still plain text, so Markdown rendering collapses them.
Additionally, “Blackwell or Hopper” still suggests Hopper can run NVFP4.

-*GPU: NVIDIA Blackwell or Hopper Architecture  
-*OS: Linux  
-*Drivers: CUDA Driver 575 or Later  
-*Docker with NVIDIA Container Toolkit installed  
-*Python3 and python3-pip (Optional, for accuracy evaluation only)
+* **GPU**  
+  * FP8: Blackwell **or** Hopper  
+  * NVFP4: **Blackwell only**
+* **OS:** Linux  
+* **Driver:** CUDA ≥ 575  
+* **Docker:** NVIDIA Container Toolkit installed  
+* **Python 3 (optional):** required only for accuracy evaluation

67-68: Hard-coding dtype: fp8 makes the YAML unusable for NVFP4 runs

Leaving dtype fixed to FP8 silently converts FP4 checkpoints and wastes memory.

-kv_cache_config:
-  dtype: fp8
+kv_cache_config:
+  # Use `auto` to match the checkpoint, or set `fp4` / `fp8` explicitly.
+  dtype: auto

154-156: Parameter name typo – should be max_batch_size, not max_batch_sizes

The codebase recognizes the singular form; the plural variant will be ignored and mislead readers.

-&emsp;&emsp;`max_batch_sizes`: Sets the maximum batch size for which a CUDA graph will be created.
+&emsp;&emsp;`max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.
🧹 Nitpick comments (1)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1)

320-346: Add a language identifier to the final fenced code block

Markdown-lint (MD040) flags this block. Adding text (or shell / console) preserves syntax highlighting and silences the linter.

-``` 
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca16dd0 and 6e23a92.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option. &emsp;&emsp;batch_sizes`: A specific list of...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

319-319: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@chenfeiz0326
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)

15-19: Prerequisites list is still unstructured and conflates NVFP4-only vs FP8 hardware
The section remains plain text and still lists “Blackwell or Hopper” as a single bullet, which incorrectly implies NVFP4 is Hopper-capable and breaks Markdown list consistency.


61-68: dtype hard-coded to fp8, making the YAML unusable for NVFP4 users
Previous feedback recommended dtype: auto (or explicit fp4) plus explanatory text. The issue is unchanged and will silently cast FP4 checkpoints to FP8.

🧹 Nitpick comments (2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (2)

268-272: Avoid linking to an external bench.sh domain—use inline code instead
The Markdown link [bench.sh](http://bench.sh) points to a real external site; readers expect an inline file name.

-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper [bench.sh](http://bench.sh) script.
+To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper script named `bench.sh`.

319-346: Code fence lacks a language identifier; triggers MD040 and loses syntax highlighting

Add text (or none) after the opening back-ticks:

-```
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e23a92 and 033a77b.

📒 Files selected for processing (1)
  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md (1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in the tensorrt-llm waive list merging system, removed lines are always located at the end of the me...
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use the llvm clang-format tool for formatting your changes prio...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting your changes prior to submitting the PR.

Applied to files:

  • examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

[grammar] ~11-~11: Ensure spelling is correct
Context: ...ecution. ## Access & Licensing To use Llama4 Scout 17B, you must first agree to Meta...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option. &emsp;&emsp;batch_sizes`: A specific list of...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

319-319: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13925 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13925 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10487 completed with status: 'FAILURE'

@chenfeiz0326 chenfeiz0326 force-pushed the feat/llama-4-deployment-guide branch from 033a77b to fce89cc Compare August 6, 2025 06:16
@chenfeiz0326 chenfeiz0326 requested a review from a team as a code owner August 6, 2025 06:16
@chenfeiz0326
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (4)

24-27: Inconsistent precision terminology (NVFP4 vs FP4)
The guide introduces the second model as “NVFP4” (Line 26) but the bullet label uses “FP4” (Line 24-25). Mixing names can confuse users searching on NGC/HF. Align the terminology (prefer “NVFP4”) throughout the doc.


48-49: Minor command typo – missing space & flag
mkdir ~/.cache is shown as mkdir `~/.cache`. Drop the back-ticks & extra space:

-If the `~/.cache` directory doesn’t exist please create it using  mkdir `~/.cache`.
+If the `~/.cache` directory doesn’t exist, create it with:
+```shell
+mkdir -p ~/.cache
+```

271-297: bench.sh script lacks she-bang & safe-defaults
Without #!/usr/bin/env bash, bench.sh may execute under /bin/sh on some systems (dash, busybox) and fail on Bash-isms (e.g., $((…))). Add she-bang and set -euo pipefail to abort on errors:

-cat <<EOF >  bench.sh
+cat <<'EOF' > bench.sh
+#!/usr/bin/env bash
+set -euo pipefail

318-346: Missing language identifier on fenced block breaks markdownlint (MD040)
Add text (or none) after the opening back-ticks to silence lint and enable proper rendering:

-```
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 033a77b and fce89cc.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.308Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.308Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md

[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~147-~147: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   `enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option. &emsp;&emsp;batch_sizes`: A specific list of...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: Th...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md

319-319: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14253 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14253 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10760 completed with status: 'FAILURE'

Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326 chenfeiz0326 force-pushed the feat/llama-4-deployment-guide branch from fce89cc to 2ee3fe7 Compare August 6, 2025 09:02
@chenfeiz0326
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1)

34-43: Add image digest pinning and larger shared-memory size to Docker run command

Large-model workloads frequently exceed Docker’s default 64 MiB /dev/shm, and unpinned tags can silently break when the image is updated. The same feedback was given on an earlier commit and still applies.

🧹 Nitpick comments (1)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1)

319-346: Missing language hint on fenced block triggers markdown-lint (MD040)

The sample benchmark output block lacks a language identifier and is currently flagged by CI tooling. Add text (or none) after the triple back-ticks:

-```
+```text
============ Serving Benchmark Result ============
...
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fce89cc and 2ee3fe7.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .pre-commit-config.yaml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
🪛 LanguageTool
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md

[style] ~114-~114: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~122-~122: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~126-~126: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~147-~147: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   `enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...--max_batch_size command-line option. &emsp;&emsp;batch_sizes`: A specific list of...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: Th...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~171-~171: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~298-~298: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace `--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md

319-319: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Comment on lines +56 to +69
We create a YAML configuration file /tmp/config.yml for the TensorRT-LLM Server and populate it with the following recommended performance settings.

```shell
EXTRA_LLM_API_FILE=/tmp/config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
enable_padding: true
max_batch_size: 1024
kv_cache_config:
dtype: fp8
EOF
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

kv_cache_config.dtype hard-codes fp8 – breaks NVFP4 workflow

The YAML template always sets

kv_cache_config:
  dtype: fp8

When the user follows the guide to launch an NVFP4 model, this override forces the KV-cache back to FP8 and defeats the purpose of FP4 quantization (it also wastes memory). Either omit the field (to inherit from the checkpoint) or parameterise it:

-kv_cache_config:
-  dtype: fp8
+# For FP8 models keep fp8; for NVFP4 models set fp4 or omit to use checkpoint default
+kv_cache_config:
+  # dtype: auto
🤖 Prompt for AI Agents
In examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md
around lines 56 to 69, the YAML configuration hard-codes kv_cache_config.dtype
to fp8, which breaks the NVFP4 workflow by overriding the intended FP4
quantization. To fix this, remove the dtype: fp8 line from the YAML so it
inherits the dtype from the checkpoint, or modify the script to parameterize
this field so it can be set appropriately based on the model being used.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14287 [ run ] triggered by Bot

@litaotju
Copy link
Collaborator

litaotju commented Aug 6, 2025

Bypassing and merge now. The doc change won't affect CI.

@litaotju litaotju merged commit a16ba64 into NVIDIA:main Aug 6, 2025
3 checks passed
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14287 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10790 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants