Skip to content

Conversation

xinhe-nv
Copy link
Collaborator

@xinhe-nv xinhe-nv commented Aug 18, 2025

Summary by CodeRabbit

  • Tests

    • Expanded coverage for Llama 4 Maverick and new Scout variants, adding TP4 configurations, CUDA Graph permutations, FP8/FP4, and chunked prefill scenarios.
    • Introduced device- and memory-aware gating to improve test stability across diverse hardware.
    • Updated sanity and QA test matrices with broader permutations and IDs.
  • Documentation

    • Updated accuracy references with new FP8 results for Maverick on MMLU (86.40) and GSM8K (90.20).

Description

Test Coverage

https://prod.blsm.nvidia.com/swqa-tensorrt-qa-test/job/DEBUG_LLM_FUNCTION_CLUSTER_TEST/88/testReport/
https://prod.blsm.nvidia.com/swqa-tensorrt-qa-test/view/TRT-LLM-Function-Pipelines/job/DEBUG_LLM_FUNCTION_CLUSTER_TEST/89/testReport/

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Replaces MPI world-size gating with device-count and memory checks in PyTorch LLM accuracy tests, introduces KvCacheConfig and CudaGraphConfig in LLM instantiation, adds a new Scout test class, expands tp4/tp8 parameterizations, adds test list entries, and augments MMLU/GSM8K accuracy reference YAMLs. Adds get_device_count/get_device_memory utilities.

Changes

Cohort / File(s) Summary
PyTorch LLM accuracy tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Updated gating to device-count and memory checks; added KvCacheConfig (free_gpu_memory_fraction=0.8) and CudaGraphConfig usage; expanded parameterizations (tp4/tp8, ep variants, cuda_graph flags); updated FP8/FP4 model paths; added TestLlama4ScoutInstruct with analogous tests and gating; adjusted chunked_prefill tests; included Eagle3 references in FP8 tests.
Test utilities
tests/integration/conftest.py
Added public helpers: get_device_count(), get_device_memory(); imported by accuracy tests for gating.
QA test lists
tests/integration/test_lists/qa/llm_function_full.txt, tests/integration/test_lists/qa/llm_function_sanity.txt
Added entries for new tp4/tp8 parameterizations across Maverick/Scout auto_dtype, FP8/FP4, and chunked_prefill tests; included Eagle3 FP8 variants.
Accuracy references
tests/integration/defs/accuracy/references/mmlu.yaml, tests/integration/defs/accuracy/references/gsm8k.yaml
Appended FP8 entries for meta-llama/Llama-4-Maverick-17B-128E-Instruct with kv_cache_quant_algo: FP8; recorded accuracies (MMLU 86.40; GSM8K 90.20).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor QA as QA Runner
  participant T as PyTorch Accuracy Test
  participant U as Test Utils (conftest)
  participant L as LLM Factory

  QA->>T: Parametrized test (tp/ep, cuda_graph)
  T->>U: get_device_count(), get_device_memory()
  alt Gating not satisfied
    T-->>QA: skip (device-count/memory/host-memory)
  else Gating satisfied
    T->>L: Instantiate LLM(kv_cache_config, cuda_graph_config)
    L-->>T: LLM instance
    T->>L: Run generation/prefill
    L-->>T: Outputs
    T-->>QA: Assertions pass/fail
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • litaotju
  • crazydemo
  • LarryXFly
  • yilin-void

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4e8dd72 and 05936d2.

📒 Files selected for processing (5)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tests/integration/defs/conftest.py (3)
  • get_device_count (1941-1943)
  • get_device_memory (1946-1977)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (929-1024)
  • CudaGraphConfig (106-163)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1079-1095)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/test_lists/qa/llm_function_sanity.txt (1)

92-100: Scout tp4 chunked_prefill entries — confirm reference coverage (if measured) or mark as not requiring references.

If these chunked_prefill tests participate in accuracy assertions, ensure references are present. If they are functional-only, confirm the harness won’t require a baseline.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (10)

30-33: LGTM: imports centralize device/memory utilities.

Using get_device_count/get_device_memory from the shared conftest keeps gating logic consistent across tests.


512-518: LGTM: TP4/EP permutations added to Maverick auto_dtype.

The parameterization looks correct and aligns with the new QA entries.


524-533: LGTM: pass KvCacheConfig to reduce OOM probability.

Using free_gpu_memory_fraction=0.8 for CI is a sensible WAR.


558-564: LGTM: FP8 permutations and memory pre-skip for Maverick.

Expansion to TP4/EP variants looks consistent with QA lists.


577-578: LGTM: standardized CUDA graph usage via CudaGraphConfig.

Shifting away from boolean flags to CudaGraphConfig keeps the API consistent.

Also applies to: 601-602, 687-689


641-643: LGTM: class-level memory gating for Scout.

skip_less_device_memory/host_memory markers provide clear preconditions; good placement at class scope.


648-651: LGTM: Scout TP4 permutations align with Maverick coverage.

Parameterization mirrors Maverick and matches QA test lists.


653-655: LGTM: device-count verification per parameter set.

Explicit check avoids running with partial allocations; consistent with the new gating approach.


676-678: LGTM: per-config device-count check for Scout FP8.

Matches auto_dtype gating; no additional changes needed.


735-736: LGTM: chunked_prefill now uses CudaGraphConfig.

This addresses earlier divergence (use_cuda_graph vs cuda_graph_config) and keeps the interface uniform.

Also applies to: 758-759

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

748-763: Align CUDA graph parameter naming for LLM instantiation

Replace the deprecated use_cuda_graph argument with the new cuda_graph_config across this test to prevent runtime errors once use_cuda_graph is removed.

• File: tests/integration/defs/accuracy/test_llm_api_pytorch.py
• Lines: 748–763

     def test_fp4_chunked_prefill(self, cuda_graph, tp_size, pp_size, ep_size):
-        with LLM(
+        with LLM(
                 f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP4",
                 tensor_parallel_size=tp_size,
                 pipeline_parallel_size=pp_size,
                 moe_expert_parallel_size=ep_size,
                 max_seq_len=22000,
                 enable_chunked_prefill=True,
                 max_num_tokens=256,
-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
             assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4
             assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8
             task = MMLU(self.MODEL_NAME)
             task.evaluate(llm)
             task = GSM8K(self.MODEL_NAME)
             task.evaluate(llm)
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

652-654: Consider adding per-config memory gating for Scout auto_dtype

Unlike Maverick, Scout auto_dtype lacks a mem guard. To avoid CI OOM on smaller cards, mirror the param-aware gating used for Maverick.

Example:

     def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
-        if get_device_count() != tp_size * pp_size:
+        if get_device_count() != tp_size * pp_size:
             pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}"
+            )

726-741: Update deprecated use_cuda_graph usage in integration tests
Two occurrences of the now-removed use_cuda_graph arg remain in
tests/integration/defs/accuracy/test_llm_api_pytorch.py. Replace each with
cuda_graph_config=CudaGraphConfig() if cuda_graph else None for consistency with the rest of the file.

• At line 735, change:

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

• At line 757, change:

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

[request_review]

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 55f4f2d and b93fc7f.

📒 Files selected for processing (3)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🔇 Additional comments (12)
tests/integration/test_lists/qa/llm_function_sanity.txt (3)

73-75: Maverick: tp4 variants added — IDs align with parametrize IDs

The new entries for tp4/tp4ep2/tp4ep4 match the IDs defined in tests/integration/defs/accuracy/test_llm_api_pytorch.py. Looks consistent with the expanded matrix.


78-86: Maverick FP8 matrix expansion looks consistent

The additional FP8 and chunked_prefill permutations (tp4/tp8 with ep variants) align with the upstream test parameterization and cuda_graph toggles. No issues spotted.


91-93: Scout: tp4 variants added — consistent with defs

The newly listed tp4/tp4ep2/tp4ep4 variants correspond to the new param sets in the test definitions. All good.

tests/integration/test_lists/qa/llm_function_full.txt (3)

464-466: Maverick: tp4 auto_dtype variants added — matches defs

These additions for tp4/tp4ep2/tp4ep4 mirror the new parameterizations. No concerns.


472-474: Maverick: tp4 FP8 variants added — matches defs

FP8 permutations for tp4 are listed correctly and align with test IDs in the defs.


482-484: Scout: tp4 auto_dtype variants added — matches defs

Scout’s added tp4 permutations look correct with respect to the updated parameterization.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)

30-33: New runtime gating imports — confirm availability and units

Good move to centralize resource gating. Please confirm:

  • get_device_count() and get_device_memory() are available from tests/conftest and stable on CI.
  • get_device_memory()’s unit (MiB vs MB vs GiB) and whether it’s per-device or total. Thresholds elsewhere (e.g., 80_000, 140_000, 270_000) implicitly assume specific units.

If needed, add a brief comment on expected units to prevent future regressions.

Would you like a quick repo scan script to verify the marker and helpers exist and to grep for their usage?


512-518: Param expansion to include EP — LGTM

The expanded tp/pp/ep grid and IDs for Maverick auto_dtype are clear and consistent with the test list entries.


524-533: KvCacheConfig added — sensible default

free_gpu_memory_fraction=0.8 is a reasonable starting point to avoid build/warmup OOM; matches other tests. No changes needed.


601-603: CUDA graph control via cuda_graph_config — LGTM

Consistent with new API; aligns with migration away from use_cuda_graph.


647-651: Scout param expansion — LGTM

The tp/pp/ep grid and IDs mirror Maverick’s expansion. Looks good.


676-678: World-size check — LGTM

Per-config world-size validation is correct and prevents misconfigured runs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)

519-523: Per-config resource gating to avoid over-skip and potential OOMs

The current check can skip valid 4-GPU runs and won’t guard low-memory 8-GPU runs. Gate by param set: verify world-size first, then apply a tp_size-based per-device memory threshold.

Apply:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Gate per-configuration to avoid over-skipping and OOMs.
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(f"Not enough device memory: require >= {required_mem} MiB")

558-558: Use repository-standard memory skip marker

pytest.mark.skip_device_memory is not a registered marker here; use skip_less_device_memory so the skip actually applies.

Apply:

-    @pytest.mark.skip_device_memory(80000)
+    @pytest.mark.skip_less_device_memory(80000)

565-568: Refactor FP8 gating to be param-aware and ordered

Same gating issue as auto_dtype. Validate world-size first, then guard using tp_size-based per-device memory threshold.

Apply:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(f"Not enough device memory: require >= {required_mem} MiB")

641-641: Fix class-level memory skip marker on Scout

Replace unregistered skip_device_memory marker with skip_less_device_memory to ensure pytest enforces it.

Apply:

-@pytest.mark.skip_device_memory(80000)
+@pytest.mark.skip_less_device_memory(80000)
🧹 Nitpick comments (3)
tests/integration/defs/accuracy/references/mmlu.yaml (1)

74-75: Align reference with FP8 KV-cache used by tests

The new FP8 entry lacks kv_cache_quant_algo. Given tests assert FP8 for both weights and KV cache on Maverick, consider specifying KV8 explicitly to avoid selector ambiguity during accuracy matching.

Apply:

-  - quant_algo: FP8
-    accuracy: 86.40
+  - quant_algo: FP8
+    kv_cache_quant_algo: FP8
+    accuracy: 86.40

Please confirm the evaluator’s matching logic; if it requires exact matches on kv_cache_quant_algo when present in the run, this change is necessary to prevent falling back to a less-specific reference.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

733-736: Migrate from use_cuda_graph to cuda_graph_config for chunked prefill (Scout FP8)

Other tests use CudaGraphConfig; keep this consistent and avoid deprecated flags.

Apply:

-                enable_chunked_prefill=True,
-                max_num_tokens=256,
-                use_cuda_graph=cuda_graph) as llm:
+                enable_chunked_prefill=True,
+                max_num_tokens=256,
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

754-757: Migrate from use_cuda_graph to cuda_graph_config for chunked prefill (Scout FP4)

Consistency with the newer CUDA graph configuration API.

Apply:

-                max_seq_len=22000,
-                enable_chunked_prefill=True,
-                max_num_tokens=256,
-                use_cuda_graph=cuda_graph) as llm:
+                max_seq_len=22000,
+                enable_chunked_prefill=True,
+                max_num_tokens=256,
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b93fc7f and ce2a074.

📒 Files selected for processing (4)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tests/integration/defs/conftest.py (4)
  • get_device_count (1941-1943)
  • get_device_memory (1946-1977)
  • llm_models_root (77-83)
  • parametrize_with_ids (1786-1811)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (929-1024)
  • CudaGraphConfig (106-163)
tests/integration/defs/accuracy/accuracy_core.py (1)
  • LlmapiAccuracyTestHarness (768-779)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/test_lists/qa/llm_function_full.txt (2)

464-474: Added tp4 variants for Maverick FP8/auto_dtype look correct

The new entries’ parameter IDs (tp4, tp4ep2, tp4ep4 with cuda_graph toggles) match the parameterization in TestLlama4MaverickInstruct.


482-484: Added tp4 variants for Scout auto_dtype look correct

The new tp4, tp4ep2, tp4ep4 cases and CUDA graph toggle align with TestLlama4ScoutInstruct parameter IDs.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

30-33: Consolidated imports from test conftest are appropriate

Importing get_device_count/get_device_memory/parametrize_with_ids here keeps resource gating and parametrization consistent with the shared test utilities.


514-517: Expanding to tp4/ep2/ep4 improves coverage

Adding 4-GPU expert-parallel combinations to Maverick auto_dtype meaningfully extends the matrix without touching runtime logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)

519-523: Fix memory gating: current condition over-skips 4-GPU runs and under-protects 8-GPU runs

The check get_device_memory() < 270000 and get_device_count() < 8 will skip all <8-GPU jobs (even valid tp4 cases) and won’t guard 8-GPU low-memory runs. Gate per configuration and check device-count alignment first.

Apply:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # 1) Ensure exact world-size match for this config
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        # 2) Per-config memory threshold
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}"
+            )

558-558: Replace non-registered marker skip_device_memory with repo-standard skip_less_device_memory

@pytest.mark.skip_device_memory isn’t a standard marker in this repo and likely won’t take effect. Use skip_less_device_memory for consistency.

Apply:

-    @pytest.mark.skip_device_memory(80000)
+    @pytest.mark.skip_less_device_memory(80000)

You can verify marker usage/registration with:

#!/bin/bash
set -euo pipefail
rg -n --no-ignore -C2 '@pytest\.mark\.skip_device_memory\('
rg -n --no-ignore -C2 '@pytest\.mark\.skip_less_device_memory\('
rg -n --no-ignore -C2 'skip_device_memory' pytest.ini setup.cfg tox.ini **/conftest.py || true
rg -n --no-ignore -C2 'skip_less_device_memory' pytest.ini setup.cfg tox.ini **/conftest.py || true

565-569: Fix FP8 memory gating: avoid blanket skip and guard 8-GPU runs

Same gating issue as auto_dtype. Check world-size first, then a tp_size-based memory threshold.

Apply:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}"
+            )

641-641: Replace class-level skip_device_memory marker

Same marker issue as above; please switch to skip_less_device_memory or register the custom marker.

Apply:

-@pytest.mark.skip_device_memory(80000)
+@pytest.mark.skip_less_device_memory(80000)
🧹 Nitpick comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

652-654: Consider adding per-config memory gating (or rely on a fixed class-level marker)

Device-count check is correct. Once the class-level marker is fixed to skip_less_device_memory, memory pressure is mitigated. If CI still OOMs, gate per tp_size as done for Maverick.


726-736: Align chunked prefill to CudaGraphConfig (avoid deprecated use_cuda_graph)

The rest of this file migrated to cuda_graph_config. Update this remaining usage.

Apply:

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

747-757: Align FP4 chunked prefill to CudaGraphConfig

Mirror the change suggested for FP8 chunked prefill for consistency and to avoid deprecated flags.

Apply:

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ce2a074 and ddd3e69.

📒 Files selected for processing (4)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (8)

30-33: LGTM: imports expanded for resource-aware gating and utilities

Bringing in get_device_count/get_device_memory aligns with the new per-test gating logic and broader parametrization helpers. No issues spotted.


512-518: LGTM: expanded TP/EP matrix for Llama-4 Maverick

The added parameterization covers the tp4 ep2/ep4 cases and keeps IDs consistent with existing conventions.


524-533: LGTM: inject KvCacheConfig to control memory footprint

Passing KvCacheConfig(free_gpu_memory_fraction=0.8) is sensible to reduce OOM risk during warmup for these large-context tests.


561-564: LGTM: expanded TP/EP matrix for Maverick FP8

Matrix mirrors auto_dtype coverage; IDs remain consistent.


577-579: LGTM: switched to CudaGraphConfig

Using cuda_graph_config=CudaGraphConfig() iff enabled matches the newer API surface and avoids deprecated flags.


601-603: LGTM: chunked prefill uses CudaGraphConfig toggling

Consistent with the rest of the file.


647-651: LGTM: Scout auto_dtype matrix adds tp4 variants

IDs mirror Maverick’s naming; consistent parameterization.


687-689: LGTM: Scout FP8 uses consistent kv_cache_config/cuda_graph_config

Asserting QuantAlgo.FP8 and enabling graphs conditionally is consistent with the rest of the suite.

tests/integration/test_lists/qa/llm_function_full.txt (3)

464-466: LGTM: add tp4 variants for Llama4 Maverick auto_dtype

Entries match the new parametrization IDs ([tp4], [tp4ep2], [tp4ep4]) and combine with the cuda_graph toggle as expected.


472-474: LGTM: add tp4 variants for Llama4 Maverick FP8

These align with the added test matrix and ID naming in the test file.


482-484: LGTM: add tp4 variants for Llama4 Scout auto_dtype

IDs mirror the test parameterization; inclusion here ensures QA coverage for the new matrix.

@xinhe-nv xinhe-nv force-pushed the user/xinhe/test branch 2 times, most recently from 68026e9 to c0d099c Compare August 19, 2025 02:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

734-735: Deprecated arg use_cuda_graph — switch to cuda_graph_config

The LLM API has moved to cuda_graph_config; keep Scout FP8 chunked_prefill aligned with other tests.

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

756-756: Deprecated arg use_cuda_graph — switch to cuda_graph_config (FP4)

Mirror the FP8 fix for the FP4 chunked_prefill test.

-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

519-523: Fix memory gating: avoid over-skipping <8-GPU runs and under-guarding 8-GPU runs

This global check can skip valid 4-GPU configs and won’t protect low-memory 8-GPU runs. Gate by world-size first, then use a per-tp_size memory threshold.

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Ensure exact world size
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        # Per-config device memory threshold (MiB)
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}"
+            )

565-569: Refactor FP8 gating to per-tp_size thresholds

Same gating issue as auto_dtype: the “< 8 GPUs” clause can misgate runs. Check world size first, then apply tp-aware memory thresholds.

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}"
+            )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

18-18: Unify imports from conftest for consistency

Use a single relative import from ..conftest for all helpers to avoid mixed absolute/relative imports.

-from defs.conftest import get_sm_version
+from ..conftest import get_sm_version
@@
-from ..conftest import (get_device_count, get_device_memory, llm_models_root,
-                        parametrize_with_ids, skip_no_hopper,
-                        skip_post_blackwell, skip_pre_ada, skip_pre_blackwell,
-                        skip_pre_hopper)
+from ..conftest import (
+    get_device_count,
+    get_device_memory,
+    llm_models_root,
+    parametrize_with_ids,
+    skip_no_hopper,
+    skip_post_blackwell,
+    skip_pre_ada,
+    skip_pre_blackwell,
+    skip_pre_hopper,
+    get_sm_version,
+)

Also applies to: 30-33

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 68026e9 and c0d099c.

📒 Files selected for processing (5)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (20)
tests/integration/test_lists/qa/llm_function_full.txt (3)

464-466: Maverick tp4 auto_dtype additions — LGTM

tp4/tp4ep2/tp4ep4 variants expand coverage appropriately and match the new parameterization in the test file. No issues spotted.


472-474: Maverick tp4 FP8 additions — LGTM

Adding tp4 permutations for FP8 keeps parity with tp8 entries and aligns with CI gating in the tests.


482-484: Scout tp4 auto_dtype additions — LGTM

tp4 variants are consistent with the test implementation’s parameter IDs. Looks good.

tests/integration/test_lists/qa/llm_function_sanity.txt (4)

74-76: Maverick tp4 auto_dtype in sanity — LGTM

Good to have light-weight tp4 sanity coverage mirroring the full list.


79-86: Maverick FP8 entries in sanity — LGTM; confirm CI capacity

These FP8 permutations mirror the full list. Ensure the CI stage gating/device filters select suitable GPU SKUs for these cases to avoid timeouts.


92-94: Scout tp4 auto_dtype in sanity — LGTM

Parameter IDs align with the test definitions; no discrepancies found.


99-100: Scout chunked_prefill sanity entries: ensure CUDA graph API alignment

These tests will exercise CUDA graph enabled paths. The implementation currently passes use_cuda_graph rather than cuda_graph_config in the chunked_prefill tests. Please update the test code to use cuda_graph_config to avoid divergence with the rest of the suite.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (13)

512-518: Maverick tp8/tp4(+EP) matrix expansion — LGTM

Broadened matrix is coherent and matches the IDs used in the test lists.


524-533: KV cache budget wired in — LGTM

Passing KvCacheConfig(free_gpu_memory_fraction=0.8) helps avoid warmup OOM. Looks correct.


558-563: Added explicit device-memory skip for FP8 — LGTM

Marker-based gating is consistent with repo conventions. Good addition.


577-578: CUDA graph toggling via config — LGTM

Correct migration to cuda_graph_config for FP8 path.


601-603: FP8 chunked_prefill uses cuda_graph_config — LGTM

Matches the new API; consistent with other tests.


641-641: Class-level device-memory guard on Scout — LGTM

Using skip_less_device_memory is consistent with the rest of the suite.


647-650: Scout tp8/tp4(+EP) matrix expansion — LGTM

IDs align with the sanity/full lists.


652-654: World-size check — LGTM

Simple and effective. Keeps runs from misconfiguring MPI world size.


675-677: World-size check in FP8 — LGTM

Consistent with auto_dtype; good.


686-688: KV cache + CUDA graph config in FP8 — LGTM

Right knobs for memory and CG control. No issues.


700-702: World-size check in FP4 — LGTM

Matches FP8 gating; OK.


711-713: FP4 CUDA graph usage — LGTM

Using cuda_graph_config toggling is correct.


743-744: MPI world-size skip for FP4 chunked_prefill — LGTM

Appropriate gating for 4-GPU configuration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
tests/integration/test_lists/qa/llm_function_sanity.txt (1)

99-100: Switch chunked_prefill to pass cuda_graph_config instead of use_cuda_graph in tests

These entries are fine, but the corresponding tests in tests/integration/defs/accuracy/test_llm_api_pytorch.py still pass the deprecated use_cuda_graph flag for chunked_prefill (Scout FP8/FP4). Please update the test code to pass cuda_graph_config=CudaGraphConfig() when cuda_graph=True to match the rest of the suite.

Apply the diff suggested in the test file comments for the two occurrences.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)

519-522: Refine gating: order checks and gate by tp_size-specific memory thresholds

Current check can behave unexpectedly across environments (e.g., it never skips 8-GPU runs even if total device memory is insufficient). Prefer verifying device-count alignment first, then gating memory per tp_size against total device memory returned by get_device_memory().

Apply this diff:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Ensure exact world size for this parameter set
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        # Gate by total device memory heuristics per tp_size
+        required_mem = 270000 if tp_size >= 8 else 140000  # MiB (total across GPUs)
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough total GPU memory: require >= {required_mem} MiB, "
+                f"got {get_device_memory()} MiB"
+            )

558-566: Remove redundant runtime memory check; rely on decorators + world-size check

This test already has @pytest.mark.skip_less_device_memory(80000). Keep the device-count check and drop the extra runtime memory condition to avoid double/uneven gating logic.

Apply this diff:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")

725-735: Replace deprecated use_cuda_graph with cuda_graph_config in FP8 chunked_prefill

These calls still pass use_cuda_graph. Update to the new CudaGraphConfig API like the other tests in this file to avoid divergence and potential deprecation issues.

Apply this diff:

-        with LLM(
+        with LLM(
                 f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP8",
                 tensor_parallel_size=tp_size,
                 max_seq_len=22000,
                 pipeline_parallel_size=pp_size,
                 moe_expert_parallel_size=ep_size,
                 enable_chunked_prefill=True,
                 max_num_tokens=256,
-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:

748-756: Replace deprecated use_cuda_graph with cuda_graph_config in FP4 chunked_prefill

Same issue as FP8 chunked_prefill: switch to cuda_graph_config.

Apply this diff:

-        with LLM(
+        with LLM(
                 f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP4",
                 tensor_parallel_size=tp_size,
                 pipeline_parallel_size=pp_size,
                 moe_expert_parallel_size=ep_size,
                 max_seq_len=22000,
                 enable_chunked_prefill=True,
                 max_num_tokens=256,
-                use_cuda_graph=cuda_graph) as llm:
+                cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
🧹 Nitpick comments (1)
tests/integration/test_lists/qa/llm_function_full.txt (1)

482-484: Commented-out Scout tp4 auto_dtype in “full” list — confirm intent

Unlike the sanity list, tp4-based Scout auto_dtype entries are commented out here. If the omission is intentional for resource gating, consider adding a brief comment at the top of this file clarifying why some entries are elided in the “full” list. Otherwise, enable them to keep the full matrix consistent.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c0d099c and b84884d.

📒 Files selected for processing (5)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (2 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/references/gsm8k.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/utils.py (1)
  • get_device_count (83-84)
tests/integration/defs/conftest.py (4)
  • get_device_count (1941-1943)
  • get_device_memory (1946-1977)
  • llm_models_root (77-83)
  • parametrize_with_ids (1786-1811)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (929-1024)
  • CudaGraphConfig (106-163)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1079-1095)
🔇 Additional comments (14)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)

74-86: Maverick tp4/tp4ep{2,4} additions align with new parameterization

The added entries match the expanded permutations in TestLlama4MaverickInstruct (tp4, tp4ep2, tp4ep4) and reflect the cuda_graph variants present in the tests.


92-94: Scout tp4-based auto_dtype entries look consistent

These entries mirror the new tp4-based permutations for TestLlama4ScoutInstruct::test_auto_dtype in the test file.

tests/integration/test_lists/qa/llm_function_full.txt (3)

464-466: Add tp4-based permutations for Maverick auto_dtype — good

These entries are consistent with the new parameterization in the test code.


472-474: Add tp4-based permutations for Maverick FP8 — good

The new FP8 tp4/tp4ep{2,4} entries align with tests and improve coverage.


493-493: DeepSeekV3Lite bfloat16 entry updated — OK

The updated test id reflects current parameters (no stray enable_chunked_prefill here). Looks synchronized with the test definition.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (9)

30-33: Importing device/memory utilities and helpers — OK

Using get_device_count/get_device_memory with parametrize_with_ids and skip markers aligns with the new gating approach.


512-517: Expanded Maverick tp4/tp4ep{2,4} parameterization — OK

Accurately reflects added configurations; ids are clear and consistent.


524-533: KvCacheConfig usage in Maverick auto_dtype — good

Passing KvCacheConfig(free_gpu_memory_fraction=0.8) is consistent with the rest of the suite and helps avoid warmup OOMs.


577-578: cuda_graph_config conditional — OK

This mirrors the new API usage and matches patterns elsewhere in the file.


641-651: Class-level device-memory gating for Scout — OK

Good use of @skip_less_device_memory with clear tp/ep permutations.


652-654: World-size check for Scout auto_dtype — OK

Consistent with other tests’ gating approach.


686-689: KvCacheConfig in Scout FP8 — sensible default

Using free_gpu_memory_fraction=0.8 is consistent with similar tests; ensures headroom.


701-702: World-size check for Scout FP4 — OK

Keeps parameterized runs aligned with available devices.


743-744: MPI world-size gating for FP4 chunked_prefill — OK

Appropriate guard for tp4ep4 config.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b84884d and e2dc37a.

📒 Files selected for processing (5)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (2 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
🔇 Additional comments (2)
tests/integration/test_lists/qa/llm_function_full.txt (2)

464-466: Llama-4 Maverick tp4 auto_dtype additions verified

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py:
    TestLlama4MaverickInstruct::test_auto_dtype defines the new TP4 variants (tp4, tp4ep2, tp4ep4) alongside existing TP8 patterns.

482-484: Confirmed tp4 auto_dtype permutations are present in llm_function_full.txt

All six tensor-parallel variants—tp8, tp8ep4, tp8ep8 and tp4, tp4ep2, tp4ep4—with their respective cuda_graph flags appear in tests/integration/test_lists/qa/llm_function_full.txt (lines 479–484), matching the parametrization in TestLlama4ScoutInstruct. No further changes needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

519-523: Global memory gate can skip valid 4-GPU runs and fail to protect low-mem 8-GPU runs

As previously noted, the condition get_device_memory() < 270000 and get_device_count() < 8 over-skips <8-GPU runs and under-protects 8-GPU runs. Gate per configuration based on tp_size and check device-count alignment first.

Apply:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Gate by per-config resources to avoid over/under-skipping.
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem} MiB per device, "
+                f"got {get_device_memory()}"
+            )

565-569: Same gating issue as auto_dtype — use tp_size-based memory threshold

Order checks: first device-count alignment, then per-tp memory threshold. This avoids over-skipping tp4 and protects low-mem tp8.

Apply:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem} MiB per device, "
+                f"got {get_device_memory()}"
+            )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

577-578: Conditional CUDA Graph config — nit: consider DRY-ing

Pattern CudaGraphConfig() if cuda_graph else None repeats in many tests. Minor: assign a local cg = CudaGraphConfig() if cuda_graph else None to reduce duplication and ease future tuning.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e2dc37a and a5a29d2.

📒 Files selected for processing (6)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (18)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)

24-26: FP8 GSM8K baseline added for Maverick — MMLU counterpart confirmed

The new FP8 + FP8 KV entry for meta-llama/Llama-4-Maverick-17B-128E-Instruct also exists in the MMLU references:

  • tests/integration/defs/accuracy/references/mmlu.yaml: lines 74–75 and 77–79

LGTM.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (14)

30-33: Importing device gating utilities — good addition

Pulling in get_device_count/get_device_memory and common markers centralizes resource gating. No issues.


514-517: Expanded Maverick permutations (tp8/tp4 with EP variants) — good coverage

The added parameter space aligns with the PR goal (tp4). Test IDs are clear and consistent.


524-533: KvCacheConfig injection for Maverick auto_dtype — good

Setting free_gpu_memory_fraction=0.8 is sensible to reduce OOM risk at warmup and aligns with surrounding tests.


558-558: Standardized memory marker — good

Using @pytest.mark.skip_less_device_memory(80000) matches repo conventions and ensures the skip is honored.


561-563: Expanded FP8 permutations for Maverick — good

Covering both tp8 and tp4 with EP variants increases confidence in FP8 coverage.


601-602: Chunked prefill FP8 (Maverick) — conditional CUDA Graph looks correct

Matches the updated API surface (cuda_graph_config) and other tests.


641-643: Class-level device + host memory gating for Scout — good

These broad gates help keep per-test logic lean. The per-test device-count checks complement them well.


648-651: Scout: tp8/tp4 with EP variants — good parity with Maverick

Mirrors Maverick coverage, which is helpful for cross-model comparisons.


653-655: Device-count alignment gate — good

Checking get_device_count() == tp_size * pp_size up front prevents misconfigured runs.


676-678: Device-count alignment for Scout FP8 — good

Consistent gating across tests; keeps CI predictable.


687-687: Scout FP8: KvCacheConfig added — good

Using free_gpu_memory_fraction=0.8 matches Maverick and reduces OOM risk.


735-736: Scout FP8 chunked prefill: conditional CUDA Graph — good

Matches the updated CUDA graph configuration pattern.


745-746: World-size gate for Scout FP4 chunked prefill — good

skip_less_mpi_world_size(4) is appropriate for tp4-only parametrization here.


758-759: Scout FP4 chunked prefill: conditional CUDA Graph — good

Consistent with other tests.

tests/integration/test_lists/qa/llm_function_full.txt (3)

470-472: QA list: added Maverick tp4 auto_dtype entries — aligned with defs

IDs and cuda_graph flags match the parameterization in TestLlama4MaverickInstruct::test_auto_dtype.


478-480: QA list: added Maverick tp4 FP8 entries — aligned with defs

Reflects new FP8 permutations for tp4; good to keep the QA list in sync.


488-490: QA list: added Scout tp4 auto_dtype entries — aligned with defs

Parallels Maverick’s additions; consistent IDs and graph flags.

@xinhe-nv xinhe-nv changed the title [None][feat] add llama4 tp4 tests [TRTLLM-7205][feat] add llama4 tp4 tests Aug 19, 2025
@xinhe-nv xinhe-nv marked this pull request as ready for review August 19, 2025 08:26
@xinhe-nv xinhe-nv enabled auto-merge (squash) August 19, 2025 08:26
@xinhe-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15747 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

512-523: Memory/device gating can skip valid 4-GPU runs and miss low-mem 8-GPU OOMs

The condition get_device_memory() < 270000 and get_device_count() < 8 skips all <8-GPU runners regardless of per-config requirements, and never guards 8-GPU nodes with insufficient memory. This risks OOMs and over-skipping.

Refactor to:

  • Check device-count alignment first.
  • Gate memory per tp_size (heuristic thresholds; tune as needed).

Apply:

 @skip_pre_blackwell
 @parametrize_with_ids("cuda_graph", [False, True])
 @pytest.mark.parametrize(
     "tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4), (8, 1, 8), (4, 1, 1),
                                 (4, 1, 2), (4, 1, 4)],
     ids=["tp8", "tp8ep4", "tp8ep8", "tp4", "tp4ep2", "tp4ep4"])
 def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Ensure world-size alignment first
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        # Per-config memory threshold (heuristic)
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem} MiB, got {get_device_memory()} MiB"
+            )
 
         kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8)

558-569: Same gating issue in FP8 — fix ordering and thresholds to avoid OOM/over-skips

Mirror the auto_dtype fix: first check device-count alignment, then apply a tp_size-based memory threshold. The current ... and get_device_count() < 8 leaves 8-GPU low-memory nodes unguarded and can over-skip 4-GPU configs.

Apply:

 @skip_pre_hopper
 @pytest.mark.skip_less_device_memory(80000)
 @parametrize_with_ids("cuda_graph", [False, True])
 @pytest.mark.parametrize(
     "tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4), (8, 1, 8), (4, 1, 1),
                                 (4, 1, 2), (4, 1, 4)],
     ids=["tp8", "tp8ep4", "tp8ep8", "tp4", "tp4ep2", "tp4ep4"])
 def test_fp8(self, cuda_graph, tp_size, pp_size, ep_size):
-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # World-size alignment first
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        # Memory threshold per tp_size (tune if needed)
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require >= {required_mem} MiB, got {get_device_memory()} MiB"
+            )
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

721-737: Consider adding a device-memory gate for chunked_prefill tp4ep4

max_seq_len=22000 with chunked_prefill and ep4 can be memory-intensive. Adding a device-memory skip reduces CI OOM risk on smaller nodes.

Add this decorator above the test:

@pytest.mark.skip_less_device_memory(80000)

745-759: Same recommendation for FP4 chunked_prefill

Same rationale as FP8; add a device-memory guard for safety on memory-constrained runners.

Add:

@pytest.mark.skip_less_device_memory(80000)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a5a29d2 and 85079ec.

📒 Files selected for processing (6)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/test_lists/waives.txt
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tests/integration/defs/conftest.py (4)
  • get_device_count (1941-1943)
  • get_device_memory (1946-1977)
  • llm_models_root (77-83)
  • parametrize_with_ids (1786-1811)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (929-1024)
  • CudaGraphConfig (106-163)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1079-1095)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)

74-86: Nodeid Formatting Verified — No Updates Required

The decorator order in TestLlama4MaverickInstruct (@pytest.mark.parametrize for tp_size/pp_size/ep_size closest to the function, then @parametrize_with_ids for cuda_graph) produces nodeids in the form [tpX-cuda_graph=Y]. That matches the entries in llm_function_sanity.txt, so no changes are needed.


92-100: Verify Scout permutations & chunked_prefill nodeid order and cuda_graph_config usage

  • File tests/integration/test_lists/qa/llm_function_sanity.txt (lines 92–100): confirm that each entry for TestLlama4ScoutInstruct exactly matches the pytest-generated nodeid (including id text and order) for
    • test_auto_dtype
    • test_fp8
    • test_fp4
    • test_fp8_chunked_prefill
    • test_fp4_chunked_prefill
  • The underlying tests in tests/integration/defs/accuracy/test_llm_api_pytorch.py now all use cuda_graph_config (no deprecated use_cuda_graph remains).
  • Run locally to inspect collection order, for example:
    pytest -q --collect-only \
      tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype \
      | rg '::test_auto_dtype\['
    pytest -q --collect-only \
      tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp8_chunked_prefill \
      | rg '::test_fp8_chunked_prefill\['
  • If any id or ordering mismatches appear, update either the list entries or the decorator/argument order in the tests to keep them in sync.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

30-33: Importing get_device_count/get_device_memory for gating — LGTM

Using centralized helpers improves consistency across tests.


524-535: KV cache fraction injection and cuda_graph_config control — LGTM

Introducing KvCacheConfig(free_gpu_memory_fraction=0.8) and switching to cuda_graph_config makes the new permutations more robust and consistent with the API.


577-578: CUDA graph arg migration — LGTM

Using cuda_graph_config=CudaGraphConfig() instead of the deprecated flag is correct and consistent with the rest of the suite.

Also applies to: 601-602


641-643: Class-level resource gates for Scout — LGTM

skip_less_device_memory(80000) and skip_less_host_memory(100000) provide a useful coarse guard for the expanded matrix.


648-655: Scout auto_dtype permutations and device-count gate — LGTM

Param ids and the device-count alignment check look good for the new tp4/ep variants.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

519-523: Fix memory gating: current condition over-skips TP4 and under-gates TP8

The check get_device_memory() < 270000 and get_device_count() < 8 will:

  • Skip all <8-GPU runs (including valid TP4 configs), and
  • Never skip 8-GPU runs even if memory is insufficient.

Gate by tp_size and check device-count alignment first. Suggested change:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Ensure we have exactly tp_size × pp_size GPUs
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+
+        # Memory gate tuned per TP size (heuristics; adjust as needed)
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require ≥ {required_mem} MiB per GPU, "
+                f"got {get_device_memory()} MiB"
+            )

565-569: Fix FP8 memory gating: same issue as auto_dtype

Same over/under gating risk as the bf16 test. Apply tp_size-aware memory gating after verifying device count:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory: require ≥ {required_mem} MiB per GPU, "
+                f"got {get_device_memory()} MiB"
+            )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

745-746: Prefer device-based gating over MPI world-size marker for consistency

Elsewhere we’ve moved to device/memory gating; this test still uses skip_less_mpi_world_size(4). For consistency (and to decouple from MPI), consider switching to device-based gating:

-    @pytest.mark.skip_less_mpi_world_size(4)
+    @pytest.mark.skip_less_device(4)

If you want to harden it further, add a top-of-test guard:

if get_device_count() != tp_size * pp_size:
    pytest.skip("Device count mismatch with world size")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 85079ec and a93d565.

📒 Files selected for processing (6)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/integration/test_lists/waives.txt
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/references/gsm8k.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (8)

30-33: LGTM: centralizing device/memory helpers

Importing get_device_count/get_device_memory from defs.conftest is consistent with how the file imports get_sm_version and aligns with the new per-test resource gating.


514-518: Good coverage expansion: add TP/EP permutations incl. TP4

The expanded parameterization (tp8/ep{1,4,8} and tp4/ep{1,2,4}) matches the PR intent (tp4 scenarios) and broadens coverage meaningfully.


524-524: KV cache config added — sensible defaults for stability

Using KvCacheConfig(free_gpu_memory_fraction=0.8) is a reasonable starting point to mitigate warmup OOMs in CI while keeping prefill room. No issues here.

Also applies to: 532-533


558-564: LGTM: scoped skip marker and expanded permutations

  • skip_less_device_memory(80000) on FP8 is aligned with other tests.
  • Adding TP/EP permutations mirrors the auto_dtype expansion.

577-578: LGTM: switched to CudaGraphConfig consistently

Using cuda_graph_config=CudaGraphConfig() if cuda_graph else None is consistent across tests and avoids divergence from deprecated flags.

Also applies to: 601-602, 735-737, 758-759


641-643: Class-level host/device memory gating for Scout: good guardrails

skip_less_device_memory(80000) and skip_less_host_memory(100000) at class scope keeps individual tests simple and reduces repeated checks. Looks good.


649-651: Param expansion for Scout (TP8/TP4; EP variants) is appropriate

These mirror Maverick’s matrix and deliver TP4 coverage per PR goal.


653-655: Device-count alignment checks are correct

Validating get_device_count() == tp_size * pp_size is the right gate for these Scout tests, given class-level memory guards.

Also applies to: 676-678, 701-703

tests/integration/test_lists/qa/llm_function_sanity.txt (2)

74-86: Sanity list updated with Maverick TP4 and FP8 variants — good

The added tp4-{cg=False, ep2/ep4 cg=True} variants and FP8 permutations reflect the new parameterizations. This keeps the smoke surface representative while not exploding runtime.


92-100: ✅ New Llama4 TP4 and chunked_prefill entries verified in llm_function_sanity.txt

The following entries were found:

• tests/integration/test_lists/qa/llm_function_sanity.txt:
– Line 74: accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_auto_dtype[tp4-cuda_graph=False]
– Line 84: accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_fp8[tp4-cuda_graph=True]
– Line 92: accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype[tp4-cuda_graph=False]
– Line 99: accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp8_chunked_prefill[tp4ep4-cuda_graph=True]

All entries align with the new TestLlama4ScoutInstruct coverage. No further updates needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15747 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11836 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

519-523: Fix memory gating: current AND condition skips valid 4‑GPU runs and leaves 8‑GPU runs ungated (risking OOM)

  • get_device_memory() < 270000 and get_device_count() < 8 will never gate 8‑GPU runs even when memory is too low.
  • It can over‑skip 4‑GPU runs irrespective of their lower memory needs.

Order checks by device-count alignment first, then gate per tp_size.

Apply:

-        if get_device_memory() < 270000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        # Gate by per-config resources to avoid over/under-skipping.
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 270000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory for tp{tp_size}: "
+                f"require >= {required_mem} MiB, got {get_device_memory()} MiB"
+            )

565-569: Fix FP8 memory gating: same AND issue; gate per tp_size

As above, the current logic doesn’t protect 8‑GPU runs and can over‑skip 4‑GPU runs.

Apply:

-        if get_device_memory() < 140000 and get_device_count() < 8:
-            pytest.skip("Not enough memory for this test")
-        if get_device_count() != tp_size * pp_size:
-            pytest.skip("Device count mismatch with world size")
+        if get_device_count() != tp_size * pp_size:
+            pytest.skip("Device count mismatch with world size")
+        required_mem = 140000 if tp_size >= 8 else 80000
+        if get_device_memory() < required_mem:
+            pytest.skip(
+                f"Not enough device memory for tp{tp_size}: "
+                f"require >= {required_mem} MiB, got {get_device_memory()} MiB"
+            )
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

30-33: Use a single import style (relative vs package) to avoid confusion

This file mixes from defs.conftest ... (Line 18) with from ..conftest ... (Lines 30–33). Prefer one style consistently. Since this module already uses relative imports, pull get_sm_version into the same relative import and drop the package import at Line 18.

Apply within this hunk:

-from ..conftest import (get_device_count, get_device_memory, llm_models_root,
-                        parametrize_with_ids, skip_no_hopper,
-                        skip_post_blackwell, skip_pre_ada, skip_pre_blackwell,
-                        skip_pre_hopper)
+from ..conftest import (get_device_count, get_device_memory, llm_models_root,
+                        parametrize_with_ids, skip_no_hopper,
+                        skip_post_blackwell, skip_pre_ada, skip_pre_blackwell,
+                        skip_pre_hopper, get_sm_version)

Then remove the earlier absolute import at Line 18:

# Remove this after adding get_sm_version to the relative import:
# from defs.conftest import get_sm_version

712-713: Consider adding kv_cache_config to Scout FP4 for parity and OOM headroom

FP4 also benefits from capping KV allocation (FP8 KV is already asserted). Mirror the FP8 test by passing a KV cache config.

Apply:

                 moe_expert_parallel_size=ep_size,
+                kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.8),
                 cuda_graph_config=CudaGraphConfig()
                 if cuda_graph else None) as llm:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a93d565 and 4e8dd72.

📒 Files selected for processing (6)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (11 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/defs/accuracy/references/mmlu.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/test_lists/waives.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
tests/integration/defs/conftest.py (4)
  • get_device_count (1941-1943)
  • get_device_memory (1946-1977)
  • llm_models_root (77-83)
  • parametrize_with_ids (1786-1811)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (929-1024)
  • CudaGraphConfig (106-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (14)

514-517: LGTM: TP/EP matrix expansion for Llama‑4 Maverick

The added parameterization covers tp8 and tp4 variants with EP scaling. Aligns with the PR objective to add tp4 tests.


524-534: LGTM: Reduce KV cache pressure

Passing KvCacheConfig(free_gpu_memory_fraction=0.8) and toggling cuda_graph_config per case is a sensible way to reduce warmup/alloc OOMs and keeps the CG behavior explicit.


558-558: LGTM: Correct marker used

Switching to the repository-standard skip_less_device_memory(80000) makes the skip effective under pytest.


560-563: LGTM: FP8 matrix expansion

Added tp8 and tp4 with EP scaling; consistent with the Maverick coverage.


577-578: LGTM: CUDA Graph config toggle

Passing CudaGraphConfig() only when requested is clear and consistent with other tests.


641-643: LGTM: Class-level device and host memory guards for Scout

These thresholds help avoid node instability and match comments about system constraints.


648-651: LGTM: Scout TP/EP matrix expansion

Covers tp8/ep variants and tp4/ep variants. Looks good.


653-655: LGTM: Device-count alignment check

Early skip on world-size mismatch is correct for per-parameter gating.


676-678: LGTM: Device-count check for Scout FP8

Keeps the run constrained to the intended topology.


687-687: LGTM: Lower KV cache footprint for Scout FP8

Setting free_gpu_memory_fraction=0.8 mirrors Maverick and reduces OOM risk.


701-703: LGTM: Device-count gating for Scout FP4

Topology check is correct. One minor improvement below to align memory pressure handling with FP8.


735-736: LGTM: CUDA Graph config toggle (chunked prefill FP8)

Consistent with the rest of the suite.


745-746: LGTM: MPI world-size gating retained for this specific case

Given chunked prefill sensitivity, the explicit skip_less_mpi_world_size(4) here is acceptable.


758-759: LGTM: CUDA Graph config toggle (chunked prefill FP4)

Matches the FP8 pattern.

Signed-off-by: Xin He (SW-GPU) <[email protected]>
@xinhe-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15831 [ run ] triggered by Bot

@LarryXFly LarryXFly disabled auto-merge August 20, 2025 05:21
@LarryXFly LarryXFly merged commit 9e71b4f into NVIDIA:main Aug 20, 2025
3 checks passed
@xinhe-nv xinhe-nv deleted the user/xinhe/test branch August 20, 2025 05:22
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15831 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11901 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants