-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7205][feat] add llama4 tp4 tests #6989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReplaces MPI world-size gating with device-count and memory checks in PyTorch LLM accuracy tests, introduces KvCacheConfig and CudaGraphConfig in LLM instantiation, adds a new Scout test class, expands tp4/tp8 parameterizations, adds test list entries, and augments MMLU/GSM8K accuracy reference YAMLs. Adds get_device_count/get_device_memory utilities. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor QA as QA Runner
participant T as PyTorch Accuracy Test
participant U as Test Utils (conftest)
participant L as LLM Factory
QA->>T: Parametrized test (tp/ep, cuda_graph)
T->>U: get_device_count(), get_device_memory()
alt Gating not satisfied
T-->>QA: skip (device-count/memory/host-memory)
else Gating satisfied
T->>L: Instantiate LLM(kv_cache_config, cuda_graph_config)
L-->>T: LLM instance
T->>L: Run generation/prefill
L-->>T: Outputs
T-->>QA: Assertions pass/fail
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (2)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (1)📚 Learning: 2025-07-28T17:06:08.621Z
Applied to files:
🧬 Code Graph Analysis (1)tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (11)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
748-763
: Align CUDA graph parameter naming for LLM instantiationReplace the deprecated
use_cuda_graph
argument with the newcuda_graph_config
across this test to prevent runtime errors onceuse_cuda_graph
is removed.• File: tests/integration/defs/accuracy/test_llm_api_pytorch.py
• Lines: 748–763def test_fp4_chunked_prefill(self, cuda_graph, tp_size, pp_size, ep_size): - with LLM( + with LLM( f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP4", tensor_parallel_size=tp_size, pipeline_parallel_size=pp_size, moe_expert_parallel_size=ep_size, max_seq_len=22000, enable_chunked_prefill=True, max_num_tokens=256, - use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm: assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4 assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8 task = MMLU(self.MODEL_NAME) task.evaluate(llm) task = GSM8K(self.MODEL_NAME) task.evaluate(llm)
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
652-654
: Consider adding per-config memory gating for Scout auto_dtypeUnlike Maverick, Scout auto_dtype lacks a mem guard. To avoid CI OOM on smaller cards, mirror the param-aware gating used for Maverick.
Example:
def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size): - if get_device_count() != tp_size * pp_size: + if get_device_count() != tp_size * pp_size: pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}" + )
726-741
: Update deprecateduse_cuda_graph
usage in integration tests
Two occurrences of the now-removeduse_cuda_graph
arg remain in
tests/integration/defs/accuracy/test_llm_api_pytorch.py
. Replace each with
cuda_graph_config=CudaGraphConfig() if cuda_graph else None
for consistency with the rest of the file.• At line 735, change:
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:• At line 757, change:
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:[request_review]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py
(9 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🔇 Additional comments (12)
tests/integration/test_lists/qa/llm_function_sanity.txt (3)
73-75
: Maverick: tp4 variants added — IDs align with parametrize IDsThe new entries for tp4/tp4ep2/tp4ep4 match the IDs defined in tests/integration/defs/accuracy/test_llm_api_pytorch.py. Looks consistent with the expanded matrix.
78-86
: Maverick FP8 matrix expansion looks consistentThe additional FP8 and chunked_prefill permutations (tp4/tp8 with ep variants) align with the upstream test parameterization and cuda_graph toggles. No issues spotted.
91-93
: Scout: tp4 variants added — consistent with defsThe newly listed tp4/tp4ep2/tp4ep4 variants correspond to the new param sets in the test definitions. All good.
tests/integration/test_lists/qa/llm_function_full.txt (3)
464-466
: Maverick: tp4 auto_dtype variants added — matches defsThese additions for tp4/tp4ep2/tp4ep4 mirror the new parameterizations. No concerns.
472-474
: Maverick: tp4 FP8 variants added — matches defsFP8 permutations for tp4 are listed correctly and align with test IDs in the defs.
482-484
: Scout: tp4 auto_dtype variants added — matches defsScout’s added tp4 permutations look correct with respect to the updated parameterization.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)
30-33
: New runtime gating imports — confirm availability and unitsGood move to centralize resource gating. Please confirm:
- get_device_count() and get_device_memory() are available from tests/conftest and stable on CI.
- get_device_memory()’s unit (MiB vs MB vs GiB) and whether it’s per-device or total. Thresholds elsewhere (e.g., 80_000, 140_000, 270_000) implicitly assume specific units.
If needed, add a brief comment on expected units to prevent future regressions.
Would you like a quick repo scan script to verify the marker and helpers exist and to grep for their usage?
512-518
: Param expansion to include EP — LGTMThe expanded tp/pp/ep grid and IDs for Maverick auto_dtype are clear and consistent with the test list entries.
524-533
: KvCacheConfig added — sensible defaultfree_gpu_memory_fraction=0.8 is a reasonable starting point to avoid build/warmup OOM; matches other tests. No changes needed.
601-603
: CUDA graph control via cuda_graph_config — LGTMConsistent with new API; aligns with migration away from use_cuda_graph.
647-651
: Scout param expansion — LGTMThe tp/pp/ep grid and IDs mirror Maverick’s expansion. Looks good.
676-678
: World-size check — LGTMPer-config world-size validation is correct and prevents misconfigured runs.
b93fc7f
to
ce2a074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
519-523
: Per-config resource gating to avoid over-skip and potential OOMsThe current check can skip valid 4-GPU runs and won’t guard low-memory 8-GPU runs. Gate by param set: verify world-size first, then apply a tp_size-based per-device memory threshold.
Apply:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Gate per-configuration to avoid over-skipping and OOMs. + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip(f"Not enough device memory: require >= {required_mem} MiB")
558-558
: Use repository-standard memory skip markerpytest.mark.skip_device_memory is not a registered marker here; use skip_less_device_memory so the skip actually applies.
Apply:
- @pytest.mark.skip_device_memory(80000) + @pytest.mark.skip_less_device_memory(80000)
565-568
: Refactor FP8 gating to be param-aware and orderedSame gating issue as auto_dtype. Validate world-size first, then guard using tp_size-based per-device memory threshold.
Apply:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip(f"Not enough device memory: require >= {required_mem} MiB")
641-641
: Fix class-level memory skip marker on ScoutReplace unregistered skip_device_memory marker with skip_less_device_memory to ensure pytest enforces it.
Apply:
-@pytest.mark.skip_device_memory(80000) +@pytest.mark.skip_less_device_memory(80000)
🧹 Nitpick comments (3)
tests/integration/defs/accuracy/references/mmlu.yaml (1)
74-75
: Align reference with FP8 KV-cache used by testsThe new FP8 entry lacks kv_cache_quant_algo. Given tests assert FP8 for both weights and KV cache on Maverick, consider specifying KV8 explicitly to avoid selector ambiguity during accuracy matching.
Apply:
- - quant_algo: FP8 - accuracy: 86.40 + - quant_algo: FP8 + kv_cache_quant_algo: FP8 + accuracy: 86.40Please confirm the evaluator’s matching logic; if it requires exact matches on kv_cache_quant_algo when present in the run, this change is necessary to prevent falling back to a less-specific reference.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
733-736
: Migrate from use_cuda_graph to cuda_graph_config for chunked prefill (Scout FP8)Other tests use CudaGraphConfig; keep this consistent and avoid deprecated flags.
Apply:
- enable_chunked_prefill=True, - max_num_tokens=256, - use_cuda_graph=cuda_graph) as llm: + enable_chunked_prefill=True, + max_num_tokens=256, + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
754-757
: Migrate from use_cuda_graph to cuda_graph_config for chunked prefill (Scout FP4)Consistency with the newer CUDA graph configuration API.
Apply:
- max_seq_len=22000, - enable_chunked_prefill=True, - max_num_tokens=256, - use_cuda_graph=cuda_graph) as llm: + max_seq_len=22000, + enable_chunked_prefill=True, + max_num_tokens=256, + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(9 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tests/integration/defs/conftest.py (4)
get_device_count
(1941-1943)get_device_memory
(1946-1977)llm_models_root
(77-83)parametrize_with_ids
(1786-1811)tensorrt_llm/llmapi/llm_args.py (2)
KvCacheConfig
(929-1024)CudaGraphConfig
(106-163)tests/integration/defs/accuracy/accuracy_core.py (1)
LlmapiAccuracyTestHarness
(768-779)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/test_lists/qa/llm_function_full.txt (2)
464-474
: Added tp4 variants for Maverick FP8/auto_dtype look correctThe new entries’ parameter IDs (tp4, tp4ep2, tp4ep4 with cuda_graph toggles) match the parameterization in TestLlama4MaverickInstruct.
482-484
: Added tp4 variants for Scout auto_dtype look correctThe new tp4, tp4ep2, tp4ep4 cases and CUDA graph toggle align with TestLlama4ScoutInstruct parameter IDs.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
30-33
: Consolidated imports from test conftest are appropriateImporting get_device_count/get_device_memory/parametrize_with_ids here keeps resource gating and parametrization consistent with the shared test utilities.
514-517
: Expanding to tp4/ep2/ep4 improves coverageAdding 4-GPU expert-parallel combinations to Maverick auto_dtype meaningfully extends the matrix without touching runtime logic.
ce2a074
to
ddd3e69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
519-523
: Fix memory gating: current condition over-skips 4-GPU runs and under-protects 8-GPU runsThe check
get_device_memory() < 270000 and get_device_count() < 8
will skip all <8-GPU jobs (even valid tp4 cases) and won’t guard 8-GPU low-memory runs. Gate per configuration and check device-count alignment first.Apply:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # 1) Ensure exact world-size match for this config + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + # 2) Per-config memory threshold + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}" + )
558-558
: Replace non-registered markerskip_device_memory
with repo-standardskip_less_device_memory
@pytest.mark.skip_device_memory
isn’t a standard marker in this repo and likely won’t take effect. Useskip_less_device_memory
for consistency.Apply:
- @pytest.mark.skip_device_memory(80000) + @pytest.mark.skip_less_device_memory(80000)You can verify marker usage/registration with:
#!/bin/bash set -euo pipefail rg -n --no-ignore -C2 '@pytest\.mark\.skip_device_memory\(' rg -n --no-ignore -C2 '@pytest\.mark\.skip_less_device_memory\(' rg -n --no-ignore -C2 'skip_device_memory' pytest.ini setup.cfg tox.ini **/conftest.py || true rg -n --no-ignore -C2 'skip_less_device_memory' pytest.ini setup.cfg tox.ini **/conftest.py || true
565-569
: Fix FP8 memory gating: avoid blanket skip and guard 8-GPU runsSame gating issue as auto_dtype. Check world-size first, then a tp_size-based memory threshold.
Apply:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}" + )
641-641
: Replace class-levelskip_device_memory
markerSame marker issue as above; please switch to
skip_less_device_memory
or register the custom marker.Apply:
-@pytest.mark.skip_device_memory(80000) +@pytest.mark.skip_less_device_memory(80000)
🧹 Nitpick comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
652-654
: Consider adding per-config memory gating (or rely on a fixed class-level marker)Device-count check is correct. Once the class-level marker is fixed to
skip_less_device_memory
, memory pressure is mitigated. If CI still OOMs, gate per tp_size as done for Maverick.
726-736
: Align chunked prefill to CudaGraphConfig (avoid deprecateduse_cuda_graph
)The rest of this file migrated to
cuda_graph_config
. Update this remaining usage.Apply:
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
747-757
: Align FP4 chunked prefill to CudaGraphConfigMirror the change suggested for FP8 chunked prefill for consistency and to avoid deprecated flags.
Apply:
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(9 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (8)
30-33
: LGTM: imports expanded for resource-aware gating and utilitiesBringing in get_device_count/get_device_memory aligns with the new per-test gating logic and broader parametrization helpers. No issues spotted.
512-518
: LGTM: expanded TP/EP matrix for Llama-4 MaverickThe added parameterization covers the tp4 ep2/ep4 cases and keeps IDs consistent with existing conventions.
524-533
: LGTM: inject KvCacheConfig to control memory footprintPassing KvCacheConfig(free_gpu_memory_fraction=0.8) is sensible to reduce OOM risk during warmup for these large-context tests.
561-564
: LGTM: expanded TP/EP matrix for Maverick FP8Matrix mirrors auto_dtype coverage; IDs remain consistent.
577-579
: LGTM: switched to CudaGraphConfigUsing
cuda_graph_config=CudaGraphConfig()
iff enabled matches the newer API surface and avoids deprecated flags.
601-603
: LGTM: chunked prefill uses CudaGraphConfig togglingConsistent with the rest of the file.
647-651
: LGTM: Scout auto_dtype matrix adds tp4 variantsIDs mirror Maverick’s naming; consistent parameterization.
687-689
: LGTM: Scout FP8 uses consistent kv_cache_config/cuda_graph_configAsserting QuantAlgo.FP8 and enabling graphs conditionally is consistent with the rest of the suite.
tests/integration/test_lists/qa/llm_function_full.txt (3)
464-466
: LGTM: add tp4 variants for Llama4 Maverick auto_dtypeEntries match the new parametrization IDs ([tp4], [tp4ep2], [tp4ep4]) and combine with the cuda_graph toggle as expected.
472-474
: LGTM: add tp4 variants for Llama4 Maverick FP8These align with the added test matrix and ID naming in the test file.
482-484
: LGTM: add tp4 variants for Llama4 Scout auto_dtypeIDs mirror the test parameterization; inclusion here ensures QA coverage for the new matrix.
68026e9
to
c0d099c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
734-735
: Deprecated arg use_cuda_graph — switch to cuda_graph_configThe LLM API has moved to cuda_graph_config; keep Scout FP8 chunked_prefill aligned with other tests.
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
756-756
: Deprecated arg use_cuda_graph — switch to cuda_graph_config (FP4)Mirror the FP8 fix for the FP4 chunked_prefill test.
- use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
519-523
: Fix memory gating: avoid over-skipping <8-GPU runs and under-guarding 8-GPU runsThis global check can skip valid 4-GPU configs and won’t protect low-memory 8-GPU runs. Gate by world-size first, then use a per-tp_size memory threshold.
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Ensure exact world size + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + # Per-config device memory threshold (MiB) + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}" + )
565-569
: Refactor FP8 gating to per-tp_size thresholdsSame gating issue as auto_dtype: the “< 8 GPUs” clause can misgate runs. Check world size first, then apply tp-aware memory thresholds.
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem}, got {get_device_memory()}" + )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
18-18
: Unify imports from conftest for consistencyUse a single relative import from ..conftest for all helpers to avoid mixed absolute/relative imports.
-from defs.conftest import get_sm_version +from ..conftest import get_sm_version @@ -from ..conftest import (get_device_count, get_device_memory, llm_models_root, - parametrize_with_ids, skip_no_hopper, - skip_post_blackwell, skip_pre_ada, skip_pre_blackwell, - skip_pre_hopper) +from ..conftest import ( + get_device_count, + get_device_memory, + llm_models_root, + parametrize_with_ids, + skip_no_hopper, + skip_post_blackwell, + skip_pre_ada, + skip_pre_blackwell, + skip_pre_hopper, + get_sm_version, +)Also applies to: 30-33
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(9 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/references/gsm8k.yaml
- tests/integration/defs/accuracy/references/mmlu.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (20)
tests/integration/test_lists/qa/llm_function_full.txt (3)
464-466
: Maverick tp4 auto_dtype additions — LGTMtp4/tp4ep2/tp4ep4 variants expand coverage appropriately and match the new parameterization in the test file. No issues spotted.
472-474
: Maverick tp4 FP8 additions — LGTMAdding tp4 permutations for FP8 keeps parity with tp8 entries and aligns with CI gating in the tests.
482-484
: Scout tp4 auto_dtype additions — LGTMtp4 variants are consistent with the test implementation’s parameter IDs. Looks good.
tests/integration/test_lists/qa/llm_function_sanity.txt (4)
74-76
: Maverick tp4 auto_dtype in sanity — LGTMGood to have light-weight tp4 sanity coverage mirroring the full list.
79-86
: Maverick FP8 entries in sanity — LGTM; confirm CI capacityThese FP8 permutations mirror the full list. Ensure the CI stage gating/device filters select suitable GPU SKUs for these cases to avoid timeouts.
92-94
: Scout tp4 auto_dtype in sanity — LGTMParameter IDs align with the test definitions; no discrepancies found.
99-100
: Scout chunked_prefill sanity entries: ensure CUDA graph API alignmentThese tests will exercise CUDA graph enabled paths. The implementation currently passes use_cuda_graph rather than cuda_graph_config in the chunked_prefill tests. Please update the test code to use cuda_graph_config to avoid divergence with the rest of the suite.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (13)
512-518
: Maverick tp8/tp4(+EP) matrix expansion — LGTMBroadened matrix is coherent and matches the IDs used in the test lists.
524-533
: KV cache budget wired in — LGTMPassing KvCacheConfig(free_gpu_memory_fraction=0.8) helps avoid warmup OOM. Looks correct.
558-563
: Added explicit device-memory skip for FP8 — LGTMMarker-based gating is consistent with repo conventions. Good addition.
577-578
: CUDA graph toggling via config — LGTMCorrect migration to cuda_graph_config for FP8 path.
601-603
: FP8 chunked_prefill uses cuda_graph_config — LGTMMatches the new API; consistent with other tests.
641-641
: Class-level device-memory guard on Scout — LGTMUsing skip_less_device_memory is consistent with the rest of the suite.
647-650
: Scout tp8/tp4(+EP) matrix expansion — LGTMIDs align with the sanity/full lists.
652-654
: World-size check — LGTMSimple and effective. Keeps runs from misconfiguring MPI world size.
675-677
: World-size check in FP8 — LGTMConsistent with auto_dtype; good.
686-688
: KV cache + CUDA graph config in FP8 — LGTMRight knobs for memory and CG control. No issues.
700-702
: World-size check in FP4 — LGTMMatches FP8 gating; OK.
711-713
: FP4 CUDA graph usage — LGTMUsing cuda_graph_config toggling is correct.
743-744
: MPI world-size skip for FP4 chunked_prefill — LGTMAppropriate gating for 4-GPU configuration.
c0d099c
to
b84884d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
tests/integration/test_lists/qa/llm_function_sanity.txt (1)
99-100
: Switch chunked_prefill to pass cuda_graph_config instead of use_cuda_graph in testsThese entries are fine, but the corresponding tests in tests/integration/defs/accuracy/test_llm_api_pytorch.py still pass the deprecated use_cuda_graph flag for chunked_prefill (Scout FP8/FP4). Please update the test code to pass cuda_graph_config=CudaGraphConfig() when cuda_graph=True to match the rest of the suite.
Apply the diff suggested in the test file comments for the two occurrences.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
519-522
: Refine gating: order checks and gate by tp_size-specific memory thresholdsCurrent check can behave unexpectedly across environments (e.g., it never skips 8-GPU runs even if total device memory is insufficient). Prefer verifying device-count alignment first, then gating memory per tp_size against total device memory returned by get_device_memory().
Apply this diff:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Ensure exact world size for this parameter set + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + # Gate by total device memory heuristics per tp_size + required_mem = 270000 if tp_size >= 8 else 140000 # MiB (total across GPUs) + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough total GPU memory: require >= {required_mem} MiB, " + f"got {get_device_memory()} MiB" + )
558-566
: Remove redundant runtime memory check; rely on decorators + world-size checkThis test already has @pytest.mark.skip_less_device_memory(80000). Keep the device-count check and drop the extra runtime memory condition to avoid double/uneven gating logic.
Apply this diff:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size")
725-735
: Replace deprecated use_cuda_graph with cuda_graph_config in FP8 chunked_prefillThese calls still pass use_cuda_graph. Update to the new CudaGraphConfig API like the other tests in this file to avoid divergence and potential deprecation issues.
Apply this diff:
- with LLM( + with LLM( f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP8", tensor_parallel_size=tp_size, max_seq_len=22000, pipeline_parallel_size=pp_size, moe_expert_parallel_size=ep_size, enable_chunked_prefill=True, max_num_tokens=256, - use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
748-756
: Replace deprecated use_cuda_graph with cuda_graph_config in FP4 chunked_prefillSame issue as FP8 chunked_prefill: switch to cuda_graph_config.
Apply this diff:
- with LLM( + with LLM( f"{llm_models_root()}/llama4-models/Llama-4-Scout-17B-16E-Instruct-FP4", tensor_parallel_size=tp_size, pipeline_parallel_size=pp_size, moe_expert_parallel_size=ep_size, max_seq_len=22000, enable_chunked_prefill=True, max_num_tokens=256, - use_cuda_graph=cuda_graph) as llm: + cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
🧹 Nitpick comments (1)
tests/integration/test_lists/qa/llm_function_full.txt (1)
482-484
: Commented-out Scout tp4 auto_dtype in “full” list — confirm intentUnlike the sanity list, tp4-based Scout auto_dtype entries are commented out here. If the omission is intentional for resource gating, consider adding a brief comment at the top of this file clarifying why some entries are elided in the “full” list. Otherwise, enable them to keep the full matrix consistent.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(9 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(2 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/defs/accuracy/references/gsm8k.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
tests/integration/test_lists/qa/llm_function_full.txt
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/utils.py (1)
get_device_count
(83-84)tests/integration/defs/conftest.py (4)
get_device_count
(1941-1943)get_device_memory
(1946-1977)llm_models_root
(77-83)parametrize_with_ids
(1786-1811)tensorrt_llm/llmapi/llm_args.py (2)
KvCacheConfig
(929-1024)CudaGraphConfig
(106-163)tensorrt_llm/llmapi/llm.py (1)
LLM
(1079-1095)
🔇 Additional comments (14)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
74-86
: Maverick tp4/tp4ep{2,4} additions align with new parameterizationThe added entries match the expanded permutations in TestLlama4MaverickInstruct (tp4, tp4ep2, tp4ep4) and reflect the cuda_graph variants present in the tests.
92-94
: Scout tp4-based auto_dtype entries look consistentThese entries mirror the new tp4-based permutations for TestLlama4ScoutInstruct::test_auto_dtype in the test file.
tests/integration/test_lists/qa/llm_function_full.txt (3)
464-466
: Add tp4-based permutations for Maverick auto_dtype — goodThese entries are consistent with the new parameterization in the test code.
472-474
: Add tp4-based permutations for Maverick FP8 — goodThe new FP8 tp4/tp4ep{2,4} entries align with tests and improve coverage.
493-493
: DeepSeekV3Lite bfloat16 entry updated — OKThe updated test id reflects current parameters (no stray enable_chunked_prefill here). Looks synchronized with the test definition.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (9)
30-33
: Importing device/memory utilities and helpers — OKUsing get_device_count/get_device_memory with parametrize_with_ids and skip markers aligns with the new gating approach.
512-517
: Expanded Maverick tp4/tp4ep{2,4} parameterization — OKAccurately reflects added configurations; ids are clear and consistent.
524-533
: KvCacheConfig usage in Maverick auto_dtype — goodPassing KvCacheConfig(free_gpu_memory_fraction=0.8) is consistent with the rest of the suite and helps avoid warmup OOMs.
577-578
: cuda_graph_config conditional — OKThis mirrors the new API usage and matches patterns elsewhere in the file.
641-651
: Class-level device-memory gating for Scout — OKGood use of @skip_less_device_memory with clear tp/ep permutations.
652-654
: World-size check for Scout auto_dtype — OKConsistent with other tests’ gating approach.
686-689
: KvCacheConfig in Scout FP8 — sensible defaultUsing free_gpu_memory_fraction=0.8 is consistent with similar tests; ensures headroom.
701-702
: World-size check for Scout FP4 — OKKeeps parameterized runs aligned with available devices.
743-744
: MPI world-size gating for FP4 chunked_prefill — OKAppropriate guard for tp4ep4 config.
b84884d
to
e2dc37a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(11 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(2 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/integration/defs/accuracy/references/gsm8k.yaml
- tests/integration/test_lists/qa/llm_function_sanity.txt
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
🔇 Additional comments (2)
tests/integration/test_lists/qa/llm_function_full.txt (2)
464-466
: Llama-4 Maverick tp4 auto_dtype additions verified
- tests/integration/defs/accuracy/test_llm_api_pytorch.py:
TestLlama4MaverickInstruct::test_auto_dtype defines the new TP4 variants (tp4
,tp4ep2
,tp4ep4
) alongside existing TP8 patterns.
482-484
: Confirmed tp4 auto_dtype permutations are present in llm_function_full.txtAll six tensor-parallel variants—tp8, tp8ep4, tp8ep8 and tp4, tp4ep2, tp4ep4—with their respective cuda_graph flags appear in tests/integration/test_lists/qa/llm_function_full.txt (lines 479–484), matching the parametrization in TestLlama4ScoutInstruct. No further changes needed.
e2dc37a
to
a5a29d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
519-523
: Global memory gate can skip valid 4-GPU runs and fail to protect low-mem 8-GPU runsAs previously noted, the condition
get_device_memory() < 270000 and get_device_count() < 8
over-skips <8-GPU runs and under-protects 8-GPU runs. Gate per configuration based on tp_size and check device-count alignment first.Apply:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Gate by per-config resources to avoid over/under-skipping. + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem} MiB per device, " + f"got {get_device_memory()}" + )
565-569
: Same gating issue as auto_dtype — use tp_size-based memory thresholdOrder checks: first device-count alignment, then per-tp memory threshold. This avoids over-skipping tp4 and protects low-mem tp8.
Apply:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem} MiB per device, " + f"got {get_device_memory()}" + )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
577-578
: Conditional CUDA Graph config — nit: consider DRY-ingPattern
CudaGraphConfig() if cuda_graph else None
repeats in many tests. Minor: assign a localcg = CudaGraphConfig() if cuda_graph else None
to reduce duplication and ease future tuning.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(11 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/test_lists/qa/llm_function_sanity.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (18)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)
24-26
: FP8 GSM8K baseline added for Maverick — MMLU counterpart confirmedThe new FP8 + FP8 KV entry for meta-llama/Llama-4-Maverick-17B-128E-Instruct also exists in the MMLU references:
- tests/integration/defs/accuracy/references/mmlu.yaml: lines 74–75 and 77–79
LGTM.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (14)
30-33
: Importing device gating utilities — good additionPulling in get_device_count/get_device_memory and common markers centralizes resource gating. No issues.
514-517
: Expanded Maverick permutations (tp8/tp4 with EP variants) — good coverageThe added parameter space aligns with the PR goal (tp4). Test IDs are clear and consistent.
524-533
: KvCacheConfig injection for Maverick auto_dtype — goodSetting free_gpu_memory_fraction=0.8 is sensible to reduce OOM risk at warmup and aligns with surrounding tests.
558-558
: Standardized memory marker — goodUsing
@pytest.mark.skip_less_device_memory(80000)
matches repo conventions and ensures the skip is honored.
561-563
: Expanded FP8 permutations for Maverick — goodCovering both tp8 and tp4 with EP variants increases confidence in FP8 coverage.
601-602
: Chunked prefill FP8 (Maverick) — conditional CUDA Graph looks correctMatches the updated API surface (cuda_graph_config) and other tests.
641-643
: Class-level device + host memory gating for Scout — goodThese broad gates help keep per-test logic lean. The per-test device-count checks complement them well.
648-651
: Scout: tp8/tp4 with EP variants — good parity with MaverickMirrors Maverick coverage, which is helpful for cross-model comparisons.
653-655
: Device-count alignment gate — goodChecking
get_device_count() == tp_size * pp_size
up front prevents misconfigured runs.
676-678
: Device-count alignment for Scout FP8 — goodConsistent gating across tests; keeps CI predictable.
687-687
: Scout FP8: KvCacheConfig added — goodUsing
free_gpu_memory_fraction=0.8
matches Maverick and reduces OOM risk.
735-736
: Scout FP8 chunked prefill: conditional CUDA Graph — goodMatches the updated CUDA graph configuration pattern.
745-746
: World-size gate for Scout FP4 chunked prefill — good
skip_less_mpi_world_size(4)
is appropriate for tp4-only parametrization here.
758-759
: Scout FP4 chunked prefill: conditional CUDA Graph — goodConsistent with other tests.
tests/integration/test_lists/qa/llm_function_full.txt (3)
470-472
: QA list: added Maverick tp4 auto_dtype entries — aligned with defsIDs and cuda_graph flags match the parameterization in TestLlama4MaverickInstruct::test_auto_dtype.
478-480
: QA list: added Maverick tp4 FP8 entries — aligned with defsReflects new FP8 permutations for tp4; good to keep the QA list in sync.
488-490
: QA list: added Scout tp4 auto_dtype entries — aligned with defsParallels Maverick’s additions; consistent IDs and graph flags.
a5a29d2
to
85079ec
Compare
/bot run |
PR_Github #15747 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
512-523
: Memory/device gating can skip valid 4-GPU runs and miss low-mem 8-GPU OOMsThe condition
get_device_memory() < 270000 and get_device_count() < 8
skips all <8-GPU runners regardless of per-config requirements, and never guards 8-GPU nodes with insufficient memory. This risks OOMs and over-skipping.Refactor to:
- Check device-count alignment first.
- Gate memory per tp_size (heuristic thresholds; tune as needed).
Apply:
@skip_pre_blackwell @parametrize_with_ids("cuda_graph", [False, True]) @pytest.mark.parametrize( "tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4), (8, 1, 8), (4, 1, 1), (4, 1, 2), (4, 1, 4)], ids=["tp8", "tp8ep4", "tp8ep8", "tp4", "tp4ep2", "tp4ep4"]) def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size): - if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Ensure world-size alignment first + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + # Per-config memory threshold (heuristic) + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem} MiB, got {get_device_memory()} MiB" + ) kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8)
558-569
: Same gating issue in FP8 — fix ordering and thresholds to avoid OOM/over-skipsMirror the auto_dtype fix: first check device-count alignment, then apply a tp_size-based memory threshold. The current
... and get_device_count() < 8
leaves 8-GPU low-memory nodes unguarded and can over-skip 4-GPU configs.Apply:
@skip_pre_hopper @pytest.mark.skip_less_device_memory(80000) @parametrize_with_ids("cuda_graph", [False, True]) @pytest.mark.parametrize( "tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4), (8, 1, 8), (4, 1, 1), (4, 1, 2), (4, 1, 4)], ids=["tp8", "tp8ep4", "tp8ep8", "tp4", "tp4ep2", "tp4ep4"]) def test_fp8(self, cuda_graph, tp_size, pp_size, ep_size): - if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # World-size alignment first + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + # Memory threshold per tp_size (tune if needed) + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require >= {required_mem} MiB, got {get_device_memory()} MiB" + )
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
721-737
: Consider adding a device-memory gate for chunked_prefill tp4ep4max_seq_len=22000 with chunked_prefill and ep4 can be memory-intensive. Adding a device-memory skip reduces CI OOM risk on smaller nodes.
Add this decorator above the test:
@pytest.mark.skip_less_device_memory(80000)
745-759
: Same recommendation for FP4 chunked_prefillSame rationale as FP8; add a device-memory guard for safety on memory-constrained runners.
Add:
@pytest.mark.skip_less_device_memory(80000)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(11 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/test_lists/waives.txt
- tests/integration/defs/accuracy/references/gsm8k.yaml
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tests/integration/defs/conftest.py (4)
get_device_count
(1941-1943)get_device_memory
(1946-1977)llm_models_root
(77-83)parametrize_with_ids
(1786-1811)tensorrt_llm/llmapi/llm_args.py (2)
KvCacheConfig
(929-1024)CudaGraphConfig
(106-163)tensorrt_llm/llmapi/llm.py (1)
LLM
(1079-1095)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
74-86
: Nodeid Formatting Verified — No Updates RequiredThe decorator order in TestLlama4MaverickInstruct (@pytest.mark.parametrize for tp_size/pp_size/ep_size closest to the function, then @parametrize_with_ids for cuda_graph) produces nodeids in the form
[tpX-cuda_graph=Y]
. That matches the entries in llm_function_sanity.txt, so no changes are needed.
92-100
: Verify Scout permutations & chunked_prefill nodeid order and cuda_graph_config usage
- File tests/integration/test_lists/qa/llm_function_sanity.txt (lines 92–100): confirm that each entry for TestLlama4ScoutInstruct exactly matches the pytest-generated nodeid (including id text and order) for
• test_auto_dtype
• test_fp8
• test_fp4
• test_fp8_chunked_prefill
• test_fp4_chunked_prefill- The underlying tests in tests/integration/defs/accuracy/test_llm_api_pytorch.py now all use
cuda_graph_config
(no deprecateduse_cuda_graph
remains).- Run locally to inspect collection order, for example:
pytest -q --collect-only \ tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype \ | rg '::test_auto_dtype\[' pytest -q --collect-only \ tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp8_chunked_prefill \ | rg '::test_fp8_chunked_prefill\['- If any id or ordering mismatches appear, update either the list entries or the decorator/argument order in the tests to keep them in sync.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
30-33
: Importing get_device_count/get_device_memory for gating — LGTMUsing centralized helpers improves consistency across tests.
524-535
: KV cache fraction injection and cuda_graph_config control — LGTMIntroducing KvCacheConfig(free_gpu_memory_fraction=0.8) and switching to
cuda_graph_config
makes the new permutations more robust and consistent with the API.
577-578
: CUDA graph arg migration — LGTMUsing
cuda_graph_config=CudaGraphConfig()
instead of the deprecated flag is correct and consistent with the rest of the suite.Also applies to: 601-602
641-643
: Class-level resource gates for Scout — LGTMskip_less_device_memory(80000) and skip_less_host_memory(100000) provide a useful coarse guard for the expanded matrix.
648-655
: Scout auto_dtype permutations and device-count gate — LGTMParam ids and the device-count alignment check look good for the new tp4/ep variants.
85079ec
to
a93d565
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
519-523
: Fix memory gating: current condition over-skips TP4 and under-gates TP8The check
get_device_memory() < 270000 and get_device_count() < 8
will:
- Skip all <8-GPU runs (including valid TP4 configs), and
- Never skip 8-GPU runs even if memory is insufficient.
Gate by tp_size and check device-count alignment first. Suggested change:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Ensure we have exactly tp_size × pp_size GPUs + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + + # Memory gate tuned per TP size (heuristics; adjust as needed) + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require ≥ {required_mem} MiB per GPU, " + f"got {get_device_memory()} MiB" + )
565-569
: Fix FP8 memory gating: same issue as auto_dtypeSame over/under gating risk as the bf16 test. Apply tp_size-aware memory gating after verifying device count:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory: require ≥ {required_mem} MiB per GPU, " + f"got {get_device_memory()} MiB" + )
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
745-746
: Prefer device-based gating over MPI world-size marker for consistencyElsewhere we’ve moved to device/memory gating; this test still uses skip_less_mpi_world_size(4). For consistency (and to decouple from MPI), consider switching to device-based gating:
- @pytest.mark.skip_less_mpi_world_size(4) + @pytest.mark.skip_less_device(4)If you want to harden it further, add a top-of-test guard:
if get_device_count() != tp_size * pp_size: pytest.skip("Device count mismatch with world size")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(11 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/test_lists/waives.txt
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/defs/accuracy/references/gsm8k.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
tests/integration/test_lists/qa/llm_function_full.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (8)
30-33
: LGTM: centralizing device/memory helpersImporting get_device_count/get_device_memory from defs.conftest is consistent with how the file imports get_sm_version and aligns with the new per-test resource gating.
514-518
: Good coverage expansion: add TP/EP permutations incl. TP4The expanded parameterization (tp8/ep{1,4,8} and tp4/ep{1,2,4}) matches the PR intent (tp4 scenarios) and broadens coverage meaningfully.
524-524
: KV cache config added — sensible defaults for stabilityUsing KvCacheConfig(free_gpu_memory_fraction=0.8) is a reasonable starting point to mitigate warmup OOMs in CI while keeping prefill room. No issues here.
Also applies to: 532-533
558-564
: LGTM: scoped skip marker and expanded permutations
- skip_less_device_memory(80000) on FP8 is aligned with other tests.
- Adding TP/EP permutations mirrors the auto_dtype expansion.
577-578
: LGTM: switched to CudaGraphConfig consistentlyUsing cuda_graph_config=CudaGraphConfig() if cuda_graph else None is consistent across tests and avoids divergence from deprecated flags.
Also applies to: 601-602, 735-737, 758-759
641-643
: Class-level host/device memory gating for Scout: good guardrailsskip_less_device_memory(80000) and skip_less_host_memory(100000) at class scope keeps individual tests simple and reduces repeated checks. Looks good.
649-651
: Param expansion for Scout (TP8/TP4; EP variants) is appropriateThese mirror Maverick’s matrix and deliver TP4 coverage per PR goal.
653-655
: Device-count alignment checks are correctValidating get_device_count() == tp_size * pp_size is the right gate for these Scout tests, given class-level memory guards.
Also applies to: 676-678, 701-703
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
74-86
: Sanity list updated with Maverick TP4 and FP8 variants — goodThe added tp4-{cg=False, ep2/ep4 cg=True} variants and FP8 permutations reflect the new parameterizations. This keeps the smoke surface representative while not exploding runtime.
92-100
: ✅ New Llama4 TP4 and chunked_prefill entries verified in llm_function_sanity.txtThe following entries were found:
• tests/integration/test_lists/qa/llm_function_sanity.txt:
– Line 74: accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_auto_dtype[tp4-cuda_graph=False]
– Line 84: accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_fp8[tp4-cuda_graph=True]
– Line 92: accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype[tp4-cuda_graph=False]
– Line 99: accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp8_chunked_prefill[tp4ep4-cuda_graph=True]All entries align with the new TestLlama4ScoutInstruct coverage. No further updates needed.
PR_Github #15747 [ run ] completed with state |
a93d565
to
4e8dd72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
519-523
: Fix memory gating: current AND condition skips valid 4‑GPU runs and leaves 8‑GPU runs ungated (risking OOM)
get_device_memory() < 270000 and get_device_count() < 8
will never gate 8‑GPU runs even when memory is too low.- It can over‑skip 4‑GPU runs irrespective of their lower memory needs.
Order checks by device-count alignment first, then gate per tp_size.
Apply:
- if get_device_memory() < 270000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + # Gate by per-config resources to avoid over/under-skipping. + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 270000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory for tp{tp_size}: " + f"require >= {required_mem} MiB, got {get_device_memory()} MiB" + )
565-569
: Fix FP8 memory gating: same AND issue; gate per tp_sizeAs above, the current logic doesn’t protect 8‑GPU runs and can over‑skip 4‑GPU runs.
Apply:
- if get_device_memory() < 140000 and get_device_count() < 8: - pytest.skip("Not enough memory for this test") - if get_device_count() != tp_size * pp_size: - pytest.skip("Device count mismatch with world size") + if get_device_count() != tp_size * pp_size: + pytest.skip("Device count mismatch with world size") + required_mem = 140000 if tp_size >= 8 else 80000 + if get_device_memory() < required_mem: + pytest.skip( + f"Not enough device memory for tp{tp_size}: " + f"require >= {required_mem} MiB, got {get_device_memory()} MiB" + )
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
30-33
: Use a single import style (relative vs package) to avoid confusionThis file mixes
from defs.conftest ...
(Line 18) withfrom ..conftest ...
(Lines 30–33). Prefer one style consistently. Since this module already uses relative imports, pullget_sm_version
into the same relative import and drop the package import at Line 18.Apply within this hunk:
-from ..conftest import (get_device_count, get_device_memory, llm_models_root, - parametrize_with_ids, skip_no_hopper, - skip_post_blackwell, skip_pre_ada, skip_pre_blackwell, - skip_pre_hopper) +from ..conftest import (get_device_count, get_device_memory, llm_models_root, + parametrize_with_ids, skip_no_hopper, + skip_post_blackwell, skip_pre_ada, skip_pre_blackwell, + skip_pre_hopper, get_sm_version)Then remove the earlier absolute import at Line 18:
# Remove this after adding get_sm_version to the relative import: # from defs.conftest import get_sm_version
712-713
: Consider adding kv_cache_config to Scout FP4 for parity and OOM headroomFP4 also benefits from capping KV allocation (FP8 KV is already asserted). Mirror the FP8 test by passing a KV cache config.
Apply:
moe_expert_parallel_size=ep_size, + kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.8), cuda_graph_config=CudaGraphConfig() if cuda_graph else None) as llm:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(11 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/defs/accuracy/references/mmlu.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/integration/defs/accuracy/references/gsm8k.yaml
- tests/integration/test_lists/qa/llm_function_sanity.txt
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
tests/integration/defs/conftest.py (4)
get_device_count
(1941-1943)get_device_memory
(1946-1977)llm_models_root
(77-83)parametrize_with_ids
(1786-1811)tensorrt_llm/llmapi/llm_args.py (2)
KvCacheConfig
(929-1024)CudaGraphConfig
(106-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (14)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (14)
514-517
: LGTM: TP/EP matrix expansion for Llama‑4 MaverickThe added parameterization covers tp8 and tp4 variants with EP scaling. Aligns with the PR objective to add tp4 tests.
524-534
: LGTM: Reduce KV cache pressurePassing
KvCacheConfig(free_gpu_memory_fraction=0.8)
and togglingcuda_graph_config
per case is a sensible way to reduce warmup/alloc OOMs and keeps the CG behavior explicit.
558-558
: LGTM: Correct marker usedSwitching to the repository-standard
skip_less_device_memory(80000)
makes the skip effective under pytest.
560-563
: LGTM: FP8 matrix expansionAdded tp8 and tp4 with EP scaling; consistent with the Maverick coverage.
577-578
: LGTM: CUDA Graph config togglePassing
CudaGraphConfig()
only when requested is clear and consistent with other tests.
641-643
: LGTM: Class-level device and host memory guards for ScoutThese thresholds help avoid node instability and match comments about system constraints.
648-651
: LGTM: Scout TP/EP matrix expansionCovers tp8/ep variants and tp4/ep variants. Looks good.
653-655
: LGTM: Device-count alignment checkEarly skip on world-size mismatch is correct for per-parameter gating.
676-678
: LGTM: Device-count check for Scout FP8Keeps the run constrained to the intended topology.
687-687
: LGTM: Lower KV cache footprint for Scout FP8Setting
free_gpu_memory_fraction=0.8
mirrors Maverick and reduces OOM risk.
701-703
: LGTM: Device-count gating for Scout FP4Topology check is correct. One minor improvement below to align memory pressure handling with FP8.
735-736
: LGTM: CUDA Graph config toggle (chunked prefill FP8)Consistent with the rest of the suite.
745-746
: LGTM: MPI world-size gating retained for this specific caseGiven chunked prefill sensitivity, the explicit
skip_less_mpi_world_size(4)
here is acceptable.
758-759
: LGTM: CUDA Graph config toggle (chunked prefill FP4)Matches the FP8 pattern.
Signed-off-by: Xin He (SW-GPU) <[email protected]>
4e8dd72
to
05936d2
Compare
/bot run |
PR_Github #15831 [ run ] triggered by Bot |
PR_Github #15831 [ run ] completed with state |
Summary by CodeRabbit
Tests
Documentation
Description
Test Coverage
https://prod.blsm.nvidia.com/swqa-tensorrt-qa-test/job/DEBUG_LLM_FUNCTION_CLUSTER_TEST/88/testReport/
https://prod.blsm.nvidia.com/swqa-tensorrt-qa-test/view/TRT-LLM-Function-Pipelines/job/DEBUG_LLM_FUNCTION_CLUSTER_TEST/89/testReport/
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.