chore: Mass integration of release/0.21 (part 4) #6211

dc3671 · 2025-07-21T06:32:50Z

Mass integration of commits from the release 0.21 branch. (part 4)

Please help review your commit!

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Documentation
- Updated installation and quick start guides with license information and Docker usage recommendations.
- Expanded and reformatted performance tables with new models and updated throughput data.
- Added new supported models to the support matrix.
- Released detailed notes for version 0.21.0, highlighting new features, improvements, and fixes.
New Features
- Enhanced support for peer-to-peer capability checks in DeepseekV3 models.
- Expanded test coverage for multi-GPU and MPI session scenarios.
Bug Fixes
- Improved error handling and validation in model API and test cases.
- Adjusted test tolerances and memory usage settings for more robust testing.
Chores
- Updated build scripts to specify Triton component versions for improved compatibility.
- Refined test scripts and configurations for more reliable test execution.
Refactor
- Simplified metrics verification and removed version-specific logic in backend tests.
- Streamlined test helper functions and removed redundant or obsolete tests.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

tests/integration/test_lists/test-db/l0_a100.yml (1)
30-36: YAML indentation error – list items need 4-space indent

tests: is indented two spaces, so its list items must be indented four.
Currently line 31 starts with only two spaces:
  - unittest/trt/attention/test_sage_attention.py …
This breaks YAML parsing and will make the test-selector loader crash.
-  - unittest/trt/attention/test_sage_attention.py …
+    - unittest/trt/attention/test_sage_attention.py …
Please fix before merging to avoid CI failures.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
391-399: Critical: Remove duplicated code block

This code block is an exact duplicate of lines 381-389. While the AI summary mentions this duplication was added to prevent illegal memory access during KV cache writes in overlap scheduler warmup, having identical code blocks like this creates maintenance issues and suggests a logical error in the control flow.

The duplication should be removed and the logic should be restructured to ensure the tensor zeroing happens exactly once under the appropriate conditions. Consider consolidating the logic or adding proper conditional checks to avoid redundant execution.
-        if (not self._disable_overlap_scheduler
-                and next_draft_tokens_device is None
-                and len(extend_requests) > 0):
-            # During warmup, for those generation requests, we don't have previous tensors,
-            # so we need to set the previous_pos_id_offsets and previous_kv_lens_offsets to zeros
-            # to skip the value changes in _preprocess_inputs. Otherwise, there will be illegal memory access
-            # when writing key/values to the KV cache.
-            self.previous_pos_id_offsets_cuda *= 0
-            self.previous_kv_lens_offsets_cuda *= 0

🧹 Nitpick comments (7)

tests/integration/defs/triton_server/test_triton.py (1)
511-511: Format long cmake command for better readability.

Good consistency with the build script changes. However, the line exceeds the 120-character limit and should be formatted for better readability.

Apply this formatting improvement:
-        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON  -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
+        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ "
+        f"-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON "
+        f"-DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 "
+        f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
tests/unittest/llmapi/test_mpi_session.py (1)
63-72: Add text=True and remove redundant universal_newlines

universal_newlines=True is the Py-3.11 legacy spelling of text=True. Since you are already on 3.8+ you can simplify:
-               bufsize=1,
-               start_new_session=True,
-               universal_newlines=True,
+               bufsize=1,
+               start_new_session=True,
+               text=True,
Nit-level, but brings the code in line with modern std-lib usage and is marginally clearer for newcomers.
tests/unittest/llmapi/_test_remote_mpi_session.sh (1)
9-11: --allow-run-as-root may not exist in all MPI stacks

mpirun --allow-run-as-root works for OpenMPI/HPC-X, but other MPI impls (Intel-MPI, MPICH) error out on the unknown flag, failing the test even when it would otherwise pass. If portability matters, gate the flag:
if mpirun --help 2>&1 | grep -q -- '--allow-run-as-root'; then
    root_flag=--allow-run-as-root
fi
timeout 60 mpirun ${root_flag:+$root_flag} -np 2 … 
tests/integration/test_lists/waives.txt (1)

86-86: Hard-to-maintain one-liner – split long skip list

The mega-line containing 11 separate paths (line 86) is nearly 300 chars. Breaking one test per line improves diff-ability and future merges.

Not urgent, but worth refactoring next time this file is touched.
docs/source/release-notes.md (3)
31-32: Capitalize “CUDA” for consistency with NVIDIA terminology
Spelling nit: “cuda” ➜ “CUDA”.
-  - Added Piecewise cuda graph support for MLA
+  - Added Piecewise CUDA graph support for MLA
34-35: Use the official “PyTorch” casing
-  - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
+  - Integrated TRT-LLM Gen FP8 block-scale MoE with PyTorch workflow kernel autotuner
43-44: Align tool name with existing docs (“trtllm-serve bench”)
The CLI utility is referred to elsewhere as trtllm-serve bench. Consider updating for clarity:
-  - Added no_kv_cache_reuse option and streaming support for trtllm serve bench
+  - Added no_kv_cache_reuse option and streaming support for `trtllm-serve bench`

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88076ee and 2314239.

📒 Files selected for processing (23)

docs/source/installation/linux.md (1 hunks)
docs/source/performance/perf-overview.md (3 hunks)
docs/source/quick-start-guide.md (2 hunks)
docs/source/reference/support-matrix.md (2 hunks)
docs/source/release-notes.md (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (10 hunks)
tests/integration/defs/triton_server/test_triton.py (1 hunks)
tests/integration/test_lists/test-db/l0_a100.yml (2 hunks)
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1 hunks)
tests/integration/test_lists/waives.txt (2 hunks)
tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1 hunks)
tests/unittest/llmapi/_test_remote_mpi_session.sh (1 hunks)
tests/unittest/llmapi/test_llm.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (1 hunks)
tests/unittest/llmapi/test_mpi_session.py (1 hunks)
triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (4 hunks)
triton_backend/ci/L0_backend_trtllm/test.sh (1 hunks)
triton_backend/inflight_batcher_llm/scripts/build.sh (1 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py

🧰 Additional context used

🧠 Learnings (3)

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

docs/source/release-notes.md (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Shellcheck (0.10.0)

triton_backend/inflight_batcher_llm/scripts/build.sh

[warning] 55-55: Quote this to prevent word splitting.

(SC2046)

triton_backend/ci/L0_backend_trtllm/test.sh

[warning] 233-233: Quote this to prevent word splitting.

(SC2046)

🪛 Ruff (0.12.2)

tests/integration/defs/triton_server/test_triton.py

511-511: Line too long (245 > 120)

(E501)

🪛 markdownlint-cli2 (0.17.2)

docs/source/performance/perf-overview.md

62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 LanguageTool

docs/source/performance/perf-overview.md

[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...

(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)

🔇 Additional comments (38)

tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)

103-103: LGTM! Tolerance adjustment aligns with FP8 quantization improvements.

The relaxation of tolerance from 1e-3 to 1e-2 for both atol and rtol parameters brings consistency with the second test function (line 196) and aligns with the FP8 quantization enhancements introduced in TensorRT-LLM Release 0.21.0. This adjustment ensures robust validation under the updated numerical behaviors of FP8 block-scale GEMM operations.

triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (3)

85-110: LGTM on V1 batching strategy cleanup.

The removal of is_v1 parameters and V1-specific logic aligns well with the deprecation of static batching mentioned in the test script. The simplified interface is cleaner and the assertion on line 109 properly enforces the exclusion of V1-specific metrics.

112-138: Simplified test method improves maintainability.

The removal of V1-specific conditional logic makes the test method more straightforward while preserving all the core validation functionality.

139-166: Test method updates are consistent with interface changes.

All test methods properly use the simplified _base_test signature, and the focus on inflight batching (IFB) tests aligns with the deprecation of V1 batching strategy.
tests/unittest/llmapi/test_mpi_session.py (1)

63-72: Consider a safety timeout for the subprocess

In the shell script you guard mpirun with timeout 60, but the Python side will happily block forever if that script itself hangs before spawning MPI (e.g., path typo).
Add timeout= to process.wait() or switch to subprocess.run(..., timeout=…) to prevent indeterminate CI stalls.
-        return_code = process.wait()
+        # Fail fast after 120 s wall-clock to keep CI responsive
+        return_code = process.wait(timeout=120)
docs/source/installation/linux.md (1)

35-35: Good addition for legal transparency.

Adding this warning about third-party software after the installation command is appropriate and helps users make informed decisions about licensing compliance.

tests/unittest/llmapi/test_llm_pytorch.py (1)

9-14: Test coverage for capture-request-error is maintained

The _test_llm_capture_request_error helper and its corresponding test_llm_capture_request_error are still defined and exercised in:

tests/unittest/llmapi/test_llm.py

tests/unittest/llmapi/test_llm_multi_gpu.py

No further changes needed here.

tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)

18-18: Appropriate timeout configuration for GB200 hardware.

Adding the timeout designation for the latency_trtllmgen test variant is reasonable given the resource-intensive nature of accuracy tests on GB200 multi-node setups. The 180-second timeout aligns with similar tests in the configuration.

tests/unittest/llmapi/test_llm_multi_gpu.py (1)

469-469: Verified removal of pytorch_backend from all call sites

All instances of _test_llm_capture_request_error now match the simplified signature—no remaining calls include the removed pytorch_backend parameter:

tests/unittest/llmapi/test_llm_multi_gpu.py: _test_llm_capture_request_error(tp_size=2)

tests/unittest/llmapi/test_llm.py:
• Definition: def _test_llm_capture_request_error(tp_size: int = 1):
• Call site: _test_llm_capture_request_error(tp_size=1)

No further updates are needed; approving this cleanup.

docs/source/reference/support-matrix.md (2)

28-29: Proper documentation of new Qwen3 model support.

The addition of Qwen3ForCausalLM and Qwen3MoeForCausalLM entries with their respective HuggingFace examples is well-formatted and consistent with the existing documentation structure.

77-77: Consistent update to TensorRT backend model list.

Including Qwen3 in the TensorRT backend Qwen model link maintains consistency with the PyTorch backend additions and provides users with complete support information.

docs/source/quick-start-guide.md (3)

11-11: LGTM: Important legal notice added.

The addition of a license warning about third-party software is a good practice for legal compliance and user awareness.

19-19: LGTM: Helpful Docker usage guidance.

The recommendation to run Docker commands as root or with appropriate permissions will help users avoid common permission issues during setup.

97-97: LGTM: Improved code block formatting.

Adding the language specifier with file path (bash:docs/source/quick-start-guide.md) will improve syntax highlighting and documentation rendering.

tensorrt_llm/llmapi/llm.py (2)

41-41: LGTM: Appropriate import for P2P functionality.

The import of can_access_peer is consistent with the AI summary indicating peer-to-peer access checks are being added to support enhanced model features.

563-563: LGTM: Improved error message consistency.

Changing from max_seq_len to build_config.max_seq_len in the error message makes it consistent with the actual source of the value and improves debugging clarity.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

41-41: LGTM: Consistent P2P import pattern.

The import of can_access_peer follows the same pattern as in the LLM file and supports the P2P access checking functionality.

606-606: LGTM: Clear P2P support detection.

Initializing is_p2p_supported with can_access_peer(mapping) provides a clear way to determine if peer-to-peer access is available for the current mapping configuration.

801-802: LGTM: Enhanced fusion condition with P2P check.

The addition of self.is_p2p_supported to the do_finalize condition is logical - the fused operations likely require peer-to-peer access to work correctly. This prevents attempting fusion when P2P isn't available, which could lead to performance issues or failures.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

208-210: LGTM: More specific dummy request filtering logic

The change to exclude is_attention_dp_dummy and is_cuda_graph_dummy requests (in addition to the general is_dummy check) from appending last tokens is more precise and aligns with the specific types of dummy requests that should be filtered out during extend requests without previous tensors.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (10)

661-661: LGTM: Memory fraction reduction for improved stability.

The reduction from 0.9 to 0.75 aligns with the broader memory management improvements in TensorRT-LLM Release 0.21.0 and should help prevent OOM issues in CI environments.

702-702: LGTM: Consistent memory optimization for multi-GPU scenarios.

This change maintains consistency with the single-GPU test and helps ensure stable test execution across different GPU configurations.

742-742: LGTM: Appropriate memory management for FP8 quantization tests.

The reduced memory fraction is well-suited for FP8 quantized model testing where memory usage patterns may differ from standard configurations.

828-828: LGTM: Memory optimization for CUDA graph padding scenarios.

The reduced memory fraction appropriately accounts for additional overhead from CUDA graph padding operations.

853-853: LGTM: Consistent memory management for multi-GPU CUDA graph scenarios.

This change maintains consistency across similar test configurations and helps ensure stable execution in resource-intensive multi-GPU setups.

896-896: LGTM: Appropriate memory allocation for multi-GPU FP8 testing.

The conservative memory fraction is well-suited for intensive 4-GPU FP8 block scales testing scenarios.

995-995: LGTM: Conservative memory allocation for expert parallel load balancing.

The reduced memory fraction is appropriate for complex MoE scenarios with static expert parallel load balancing.

1088-1088: LGTM: Appropriate memory management for NVFP4 quantization.

The conservative memory allocation is well-suited for NVFP4 quantization testing, which may have unique memory usage patterns.

1140-1140: LGTM: Consistent memory optimization for multi-GPU NVFP4 scenarios.

This maintains consistency with other multi-GPU test configurations and ensures stable execution for complex NVFP4 quantization scenarios.

1197-1198: LGTM: Appropriate memory management for no KV cache reuse scenarios.

The conservative memory fraction is well-suited for tests with KV cache block reuse disabled, where memory usage patterns differ from standard configurations.

tests/unittest/llmapi/test_llm.py (2)

2093-2107: LGTM: Test simplification consolidates error handling logic.

The removal of the pytorch_backend conditional logic simplifies the test by always using the LLM class with a consistent BuildConfig. This change aligns with the broader consolidation effort mentioned in the AI summary, where backend-specific error capture tests were unified into a single implementation.

The test logic remains correct:

Uses a fixed max_num_tokens=64 configuration

Tests with a prompt of 65 'A' tokens to exceed the limit

Properly expects a RequestError exception

2109-2110: LGTM: Wrapper function correctly updated.

The test_llm_capture_request_error wrapper function has been appropriately updated to call _test_llm_capture_request_error without the pytorch_backend argument, consistent with the simplified function signature.

docs/source/performance/perf-overview.md (6)

32-46: LGTM: Improved table formatting and structure.

The Llama 3.3 70B FP4 table has been reformatted with clearer column headers by explicitly labeling "TP Size" and removing the redundant "GPU" column. The throughput values have been updated and the table structure is more readable.

50-64: LGTM: Expanded coverage with new TP size configurations.

The Llama 3.1 405B FP4 table now includes TP sizes 4 and 8 with corresponding throughput data, providing more comprehensive performance metrics. The table formatting is consistent with the other tables.

69-71: LGTM: Model list updated to reflect latest support.

The FP8 models list has been appropriately updated to replace "Llama 3.1 70B FP8" with "Llama 3.3 70B FP8" and add the new "Llama 4 Maverick FP8" model, aligning with the performance tables below.

92-110: LGTM: Comprehensive performance data for Llama 3.3 70B FP8.

The table provides extensive throughput measurements across different TP sizes and GPU configurations (H200 and H100). The formatting is consistent and the data appears complete, though some cells are intentionally left empty where measurements may not be available.

128-143: LGTM: New Llama 4 Maverick FP8 performance section.

The addition of the Llama 4 Maverick FP8 performance table provides valuable benchmarking data for the new model. The table follows the established format and includes measurements for both H200 and H100 GPUs across various input/output sequence length combinations.

219-243: LGTM: Clear documentation of benchmark methodology.

The addition of the note specifying that v0.20 benchmark data was collected using the provided llm_options.yml file improves reproducibility. The YAML configuration is properly formatted and the explanation about KV cache percentage usage provides helpful context for users attempting to reproduce results.

triton_backend/ci/L0_backend_trtllm/test.sh

triton_backend/inflight_batcher_llm/scripts/build.sh

dc3671 · 2025-07-21T06:45:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-21T06:50:56Z

PR_Github #12422 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

triton_backend/inflight_batcher_llm/scripts/build.sh (1)
54-55: Fix shell quoting issue in cmake command.

The TODO comment provides good context for the temporary version specification. However, the command substitution should be quoted to prevent word splitting.

Apply this fix for the shell quoting issue:
-cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..
+cmake -DCMAKE_INSTALL_PREFIX:PATH="`pwd`"/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..
triton_backend/ci/L0_backend_trtllm/test.sh (1)
232-237: Approve simplified V1 deprecation handling, but fix shell quoting.

The approach of checking server logs for the deprecation warning is more efficient than running full test suites. This aligns well with the V1 cleanup in the metrics verification tests.

Apply this fix for the shell quoting issue:
-    if [ `grep -c "Static batching type is deprecated" $SERVER_LOG` == "0" ]; then
+    if [ "$(grep -c "Static batching type is deprecated" "$SERVER_LOG")" == "0" ]; then

🧹 Nitpick comments (6)

tests/integration/defs/triton_server/test_triton.py (1)
511-511: Break long line for better readability.

The explicit Triton repository tags ensure version consistency, which aligns well with the build script changes. However, the line exceeds 120 characters.

Consider breaking the line for better readability:
-        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON  -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
+        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON "
+        f"-DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 "
+        f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
391-400: Consider refactoring to eliminate code duplication

This block is identical to lines 381-390, creating code duplication. While the reinforced zeroing may be intentional for preventing illegal memory access during KV cache writes (as mentioned in the release notes), consider extracting this logic into a helper method or ensuring it only executes once under the same conditions.
+    def _zero_previous_offsets_if_needed(self, next_draft_tokens_device, extend_requests):
+        if (not self._disable_overlap_scheduler
+                and next_draft_tokens_device is None
+                and len(extend_requests) > 0):
+            # During warmup, for those generation requests, we don't have previous tensors,
+            # so we need to set the previous_pos_id_offsets and previous_kv_lens_offsets to zeros
+            # to skip the value changes in _preprocess_inputs. Otherwise, there will be illegal memory access
+            # when writing key/values to the KV cache.
+            self.previous_pos_id_offsets_cuda *= 0
+            self.previous_kv_lens_offsets_cuda *= 0
Then call this method once at the appropriate location instead of having duplicate blocks.
docs/source/release-notes.md (4)
37-44: Unify list-item heading style for “Benchmark” section

Benchmark: is the only second-level list heading that keeps a trailing colon and is not bolded (compare with **Model Support**, **Features**).
For consistency and cleaner Markdown rendering, drop the colon and bold the tag:
- - Benchmark:
+ - **Benchmark**
81-82: Fix punctuation typo (“support.Refer”)

There is a missing space after the period.
-  - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
+  - Added Qwen3 support. Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
54-55: Wrap _AutoDeployLlmArgs in back-ticks

The leading underscore triggers italic formatting in Markdown, making the identifier hard to read. Use code styling and tighten wording:
- - Set _AutoDeployLlmArgs as primary config object
+ - Set `_AutoDeployLlmArgs` as the primary configuration object
17-18: Use code formatting for quantization mode names

Technical tokens such as w4a8_mxfp4_fp8 and fp8 rowwise quantization are easier to spot when wrapped in back-ticks.
-  - Added support for w4a8_mxfp4_fp8 quantization
-  - Added support for fp8 rowwise quantization
+  - Added support for `w4a8_mxfp4_fp8` quantization
+  - Added support for `fp8` row-wise quantization

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88076ee and ed31c7d.

📒 Files selected for processing (23)

docs/source/installation/linux.md (1 hunks)
docs/source/performance/perf-overview.md (3 hunks)
docs/source/quick-start-guide.md (2 hunks)
docs/source/reference/support-matrix.md (2 hunks)
docs/source/release-notes.md (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (10 hunks)
tests/integration/defs/triton_server/test_triton.py (1 hunks)
tests/integration/test_lists/test-db/l0_a100.yml (2 hunks)
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1 hunks)
tests/integration/test_lists/waives.txt (2 hunks)
tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1 hunks)
tests/unittest/llmapi/_test_remote_mpi_session.sh (1 hunks)
tests/unittest/llmapi/test_llm.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (1 hunks)
tests/unittest/llmapi/test_mpi_session.py (1 hunks)
triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (4 hunks)
triton_backend/ci/L0_backend_trtllm/test.sh (1 hunks)
triton_backend/inflight_batcher_llm/scripts/build.sh (1 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py

🧰 Additional context used

🧠 Learnings (2)

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

docs/source/release-notes.md (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧬 Code Graph Analysis (1)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

KvCacheConfig (764-843)

🪛 Ruff (0.12.2)

tests/integration/defs/triton_server/test_triton.py

511-511: Line too long (245 > 120)

(E501)

🪛 Shellcheck (0.10.0)

triton_backend/inflight_batcher_llm/scripts/build.sh

[warning] 55-55: Quote this to prevent word splitting.

(SC2046)

triton_backend/ci/L0_backend_trtllm/test.sh

[warning] 233-233: Quote this to prevent word splitting.

(SC2046)

🪛 markdownlint-cli2 (0.17.2)

docs/source/performance/perf-overview.md

66-66: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

67-67: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 LanguageTool

docs/source/performance/perf-overview.md

[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...

(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (28)

triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (1)

85-165: LGTM: Clean V1 deprecation removal with improved validation.

The systematic removal of V1-specific logic from metrics parsing and testing is well-executed and aligns with the broader V1 deprecation effort across the codebase. The addition of the explicit assertion self.assertNotIn("v1_specific_metric", key) at line 109 provides good defensive programming to ensure V1 metrics are properly rejected.

The method signature simplifications and consistent updates to all test methods maintain code clarity while removing deprecated functionality.

tests/integration/test_lists/waives.txt (1)

86-86: Re-enabling MPI session tests on Blackwell platforms.

The removal of unittest/llmapi/test_mpi_session.py from the waive lists for both B200_PCIe and B200 platforms indicates that the MPI session test is now stable and ready to run on Blackwell hardware. This aligns with the coordinated changes in the test configuration files to actively include this test in the pre-merge testing pipeline.

Also applies to: 158-158

tests/integration/test_lists/test-db/l0_a100.yml (2)

17-17: Logical placement of MPI session test in pre-merge PyTorch pipeline.

Adding unittest/llmapi/test_mpi_session.py to the pre-merge PyTorch backend test suite makes sense as MPI session functionality is foundational and should be validated early in the pipeline. The "generic tests" comment appropriately categorizes this as infrastructure testing.

31-31: Consolidated test configuration for post-merge TensorRT backend.

The restructuring to move MPI session tests to the PyTorch backend while keeping other unit tests in the TensorRT backend reflects a logical separation of concerns between PyTorch-specific MPI functionality and TensorRT-specific components.

tests/unittest/llmapi/_test_remote_mpi_session.sh (1)

10-10: Security consideration: Adding root privileges to MPI execution.

The --allow-run-as-root flag is necessary for MPI to run in containerized CI environments where tests may execute as root. While this introduces elevated privileges, it's appropriate for this test-specific script that runs in controlled CI environments.

Please ensure this script is only used in test environments and not in production contexts where running MPI as root could pose security risks.

tests/unittest/llmapi/test_mpi_session.py (1)

71-71: Improved subprocess reliability with explicit working directory.

Setting cwd=os.path.dirname(os.path.abspath(__file__)) ensures the shell script _test_remote_mpi_session.sh can be located reliably regardless of the test execution context. This makes the test more robust and location-independent, which is essential for consistent CI execution.

docs/source/installation/linux.md (1)

35-35: Good addition for legal compliance and user awareness.

The warning about third-party software downloads and license review is appropriately placed immediately after the installation command and provides important legal information to users.

tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)

103-103: Appropriate tolerance adjustment for FP8 precision characteristics.

The relaxation from 1e-3 to 1e-2 tolerance is reasonable for FP8 quantization operations, which have inherent precision limitations compared to higher precision formats. This aligns with the FP8 enhancements mentioned in the release notes.

docs/source/reference/support-matrix.md (2)

28-29: Proper documentation of new Qwen3 model support.

The new model entries follow the existing format consistently, with appropriate HuggingFace examples and correct modality classification.

77-77: Consistent reference update for Qwen3 support.

The link update appropriately includes Qwen3 alongside the existing Qwen model references.

tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)

18-18: Appropriate test timeout configuration addition.

The new test case entry follows the existing format and timeout value (180 seconds) is consistent with other entries. This properly manages execution expectations for the DeepSeekR1 NVFP4 multi-GPU test.

tests/unittest/llmapi/test_llm_multi_gpu.py (1)

469-469: Good refactoring that simplifies the test interface.

Removing the pytorch_backend=False parameter aligns with the helper function refactoring mentioned in the AI summary, where backend-specific logic was unified. This simplifies the test call while maintaining functionality.

docs/source/quick-start-guide.md (3)

11-11: Excellent addition for legal compliance and user awareness.

The warning about third-party software downloads and license review is a valuable addition that enhances transparency and helps users make informed decisions about software usage.

19-19: Good improvement to Docker usage guidance.

The expanded instructions with the recommendation to run as root for streamlined setup will help users avoid common Docker permission issues during development and testing.

97-99: Good documentation formatting improvement.

Adding the bash language specifier enables proper syntax highlighting in the documentation, improving readability and following documentation best practices.

tests/unittest/llmapi/test_llm_pytorch.py (1)

9-14: Good test code consolidation.

The removal of the _test_llm_capture_request_error import and associated test function aligns with the broader effort to consolidate error capture testing logic, reducing code duplication across test files.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

661-661: Systematic memory management improvement.

The consistent reduction of free_gpu_memory_fraction from 0.9 to 0.75 across multiple test configurations represents a conservative approach to GPU memory allocation, likely improving stability and compatibility with the enhanced quantization features in release 0.21.0.

Also applies to: 702-702, 742-742, 828-828, 853-853, 896-896, 995-995, 1088-1088, 1140-1140, 1197-1197

tensorrt_llm/llmapi/llm.py (1)

563-563: LGTM: Error message now consistently references the correct max_seq_len value.

The error message now directly references build_config.max_seq_len, which aligns with the logic above where max_seq_len is determined (line 555). This creates consistency between the validation logic and error reporting.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

208-210: LGTM: Proper exclusion of additional dummy request types

The logic correctly excludes is_attention_dp_dummy and is_cuda_graph_dummy requests from token ID copying during extend request processing, which aligns with the existing pattern of not processing token IDs for dummy requests.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

41-41: LGTM: Import addition is well-placed and necessary.

The import of can_access_peer from tensorrt_llm._ipc_utils is appropriately placed with other tensorrt_llm imports and is used later in the code for P2P support detection.

606-606: LGTM: P2P support detection is properly initialized.

The is_p2p_supported attribute is correctly initialized by calling can_access_peer(mapping) with the appropriate mapping object. This will be used to determine whether peer-to-peer access is available for MoE fusion optimizations.

797-801: LGTM: P2P support check properly integrated into fusion logic.

The addition of self.is_p2p_supported to the do_finalize condition is correct. The fusion optimization (do_finalize = False) is now only enabled when P2P access is available, along with other requirements like small token count, TRTLLM backend, and NVFP4 quantization. This ensures that the MoE fusion pattern only runs when the underlying peer-to-peer communication infrastructure is available to support efficient expert parallelism.

tests/unittest/llmapi/test_llm.py (2)

2093-2107: Good simplification - unified error handling test.

The removal of backend-specific logic makes this test cleaner and more focused. The unified approach of always expecting a RequestError for token limit violations aligns with the broader simplification mentioned in the AI summary.

2109-2111: LGTM - test call updated correctly.

The function call has been properly updated to match the simplified signature of _test_llm_capture_request_error.

docs/source/performance/perf-overview.md (4)

31-64: Performance data updated for FP4 models.

The benchmark results for Llama 3.3 70B FP4 and Llama 3.1 405B FP4 have been updated with comprehensive throughput data across different tensor parallel configurations. The data structure and formatting are consistent.

69-143: Model names corrected and FP8 performance data updated.

Good corrections to the model names to match the actual NVIDIA model identifiers. The extensive FP8 performance data updates provide comprehensive benchmark results across different models and configurations.

219-220: Good addition - benchmark methodology transparency.

Adding the note about which configuration file was used for v0.20 benchmarks improves reproducibility and transparency of the performance data collection process.

243-243: Minor wording improvement for clarity.

The revision improves readability. This also addresses the static analysis hint about "a majority of" being potentially wordy.

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt-cicd · 2025-07-21T12:35:39Z

PR_Github #12422 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9238 completed with status: 'SUCCESS'

… to 0.75 to prevent OOM on CI. (NVIDIA#5896) Signed-off-by: Bo Li <[email protected]>

Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Yingge He <[email protected]> Signed-off-by: Martin Marciniszyn Mehringer <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: nv-guomingz <[email protected]> Co-authored-by: Yingge He <[email protected]> Co-authored-by: Martin Marciniszyn Mehringer <[email protected]> Co-authored-by: Kaiyu Xie <[email protected]> Co-authored-by: zpatel <[email protected]>

Signed-off-by: Superjomn <[email protected]>

…IA#5849) Signed-off-by: Nikita Korobov <[email protected]>

Signed-off-by: Yi Zhang <[email protected]>

NVIDIA#6039) Signed-off-by: nv-guomingz <[email protected]>

Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>

Signed-off-by: Patrice Castonguay <[email protected]>

Signed-off-by: bhsueh <[email protected]>

Signed-off-by: Pengyun Lin <[email protected]>

dc3671 · 2025-07-22T02:37:45Z

/bot skip --comment "already succeed in previous pipeline"

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

triton_backend/inflight_batcher_llm/scripts/build.sh (1)
54-55: Fix shell quoting issue in cmake command.

The TODO comment provides good context for the temporary version specification. However, the command substitution should be quoted to prevent word splitting.

Apply this fix for the shell quoting issue:
-cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..
+cmake -DCMAKE_INSTALL_PREFIX:PATH="`pwd`"/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..
triton_backend/ci/L0_backend_trtllm/test.sh (1)
232-237: Fix shell quoting issue and correct error message.

The shell quoting issue on line 233 was already identified in a previous review but wasn't fixed. Additionally, the error message on line 234 mentions "GPT model type error" but the actual check is for the deprecation warning.

Apply the previously suggested fix for shell quoting:
-    if [ `grep -c "Static batching type is deprecated" $SERVER_LOG` == "0" ]; then
+    if [ "$(grep -c "Static batching type is deprecated" "$SERVER_LOG")" == "0" ]; then
And correct the error message:
-        echo -e "\n***\n*** GPT model type error not handled gracefully: line ${LINENO}\n***"
+        echo -e "\n***\n*** Static batching deprecation warning not found: line ${LINENO}\n***"

🧹 Nitpick comments (7)

docs/source/quick-start-guide.md (1)
19-19: Consider security implications of root access recommendation.

While running as root simplifies setup, this recommendation could pose security risks in production environments. Consider adding a note about using root only for development/testing purposes and following least-privilege principles in production.

Consider this enhancement:
-The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). Ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process.
+The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). For development and testing, ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process. In production environments, follow the principle of least privilege.
tests/integration/defs/triton_server/test_triton.py (1)
511-511: Approve version pinning but fix line length violation.

The explicit Triton repository tag specifications improve build reproducibility. However, the line exceeds the 120-character limit.

Consider breaking the long CMake command into multiple lines for better readability:
-        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON  -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
+        f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON "
+        f"-DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 "
+        f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "
docs/source/performance/perf-overview.md (1)
243-243: Consider rephrasing for conciseness.

The static analysis tool suggests "a majority of cases" might be wordy. Consider: "In most cases, we also use a higher KV cache percentage..."
-In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command.
+In most cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command.
docs/source/release-notes.md (4)
17-19: Hyphenate “rowwise” for clarity.

“rowwise quantization” → “row-wise quantization” is the prevailing spelling in earlier sections of the docs.
-  - Added support for fp8 rowwise quantization
+  - Added support for fp8 row-wise quantization
30-33: Capitalize CUDA & tidy wording.

Minor consistency fixes: CUDA should be uppercase, and Eagle3 should keep its capital E.
-  - Added Piecewise cuda graph support for MLA
-  - Added model-agnostic one-engine eagle3
+  - Added piece-wise CUDA graph support for MLA
+  - Added model-agnostic one-engine Eagle3
37-44: Promote “Benchmark:” to a sub-heading or remove the colon.

The line “Benchmark:” is rendered as a bullet but visually looks like a heading.
Either:

Make it a level-3 heading:
- - Benchmark:
+ ### Benchmark
Or drop the colon so it cleanly joins the bulleted list:
- - Benchmark:
-   - Added all_reduce.py benchmark script for testing
+ - **Benchmark**
+   - Added all_reduce.py benchmark script for testing
Choose one for consistency with previous release sections.

81-83: Add missing space after period.
-  - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
+  - Added Qwen3 support. Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb5cb5b and 02615c4.

📒 Files selected for processing (23)

docs/source/installation/linux.md (1 hunks)
docs/source/performance/perf-overview.md (3 hunks)
docs/source/quick-start-guide.md (2 hunks)
docs/source/reference/support-matrix.md (2 hunks)
docs/source/release-notes.md (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (10 hunks)
tests/integration/defs/triton_server/test_triton.py (1 hunks)
tests/integration/test_lists/test-db/l0_a100.yml (2 hunks)
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1 hunks)
tests/integration/test_lists/waives.txt (2 hunks)
tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1 hunks)
tests/unittest/llmapi/_test_remote_mpi_session.sh (1 hunks)
tests/unittest/llmapi/test_llm.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (1 hunks)
tests/unittest/llmapi/test_mpi_session.py (1 hunks)
triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (4 hunks)
triton_backend/ci/L0_backend_trtllm/test.sh (1 hunks)
triton_backend/inflight_batcher_llm/scripts/build.sh (1 hunks)

🧠 Learnings (3)

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

docs/source/release-notes.md (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Shellcheck (0.10.0)

triton_backend/inflight_batcher_llm/scripts/build.sh

[warning] 55-55: Quote this to prevent word splitting.

(SC2046)

triton_backend/ci/L0_backend_trtllm/test.sh

[warning] 233-233: Quote this to prevent word splitting.

(SC2046)

🪛 markdownlint-cli2 (0.17.2)

docs/source/performance/perf-overview.md

66-66: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

67-67: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 LanguageTool

docs/source/performance/perf-overview.md

[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...

(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)

🪛 Ruff (0.12.2)

tests/integration/defs/triton_server/test_triton.py

511-511: Line too long (245 > 120)

(E501)

💤 Files with no reviewable changes (1)

tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py

🧰 Additional context used

🧠 Learnings (3)

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

docs/source/release-notes.md (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Shellcheck (0.10.0)

triton_backend/inflight_batcher_llm/scripts/build.sh

[warning] 55-55: Quote this to prevent word splitting.

(SC2046)

triton_backend/ci/L0_backend_trtllm/test.sh

[warning] 233-233: Quote this to prevent word splitting.

(SC2046)

🪛 markdownlint-cli2 (0.17.2)

docs/source/performance/perf-overview.md

66-66: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

67-67: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 LanguageTool

docs/source/performance/perf-overview.md

[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...

(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)

🪛 Ruff (0.12.2)

tests/integration/defs/triton_server/test_triton.py

511-511: Line too long (245 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (34)

docs/source/installation/linux.md (1)

35-35: LGTM: Important compliance notice added.

This legal disclaimer appropriately informs users about third-party software downloads and license review requirements, which is essential for compliance.

tests/integration/test_lists/test-db/l0_a100.yml (1)

17-17: LGTM: Appropriate test suite reorganization.

Moving test_mpi_session.py to the PyTorch backend pre_merge stage aligns with enabling this test on Blackwell GPUs and ensures MPI functionality is tested earlier in the pipeline.

docs/source/reference/support-matrix.md (2)

28-29: LGTM: New Qwen3 model support properly documented.

The addition of Qwen3ForCausalLM and Qwen3MoeForCausalLM entries with correct HuggingFace examples and language-only modality designation is consistent with the existing documentation format.

77-77: LGTM: TensorRT backend reference updated appropriately.

The link text correctly reflects the expanded Qwen model support including Qwen3 alongside the existing versions.

tests/unittest/llmapi/test_mpi_session.py (1)

71-71: LGTM: Explicit working directory ensures test robustness.

Setting cwd to the test file's directory makes the subprocess execution more predictable and prevents issues when the test is run from different working directories.

tests/unittest/llmapi/test_llm_pytorch.py (1)

9-14: LGTM: Import cleanup aligns with test refactoring.

Removing the unused _test_llm_capture_request_error import is consistent with the test function removal and simplification mentioned in the summary.

tests/integration/test_lists/waives.txt (1)

86-86: LGTM! Enabling MPI session test on Blackwell hardware.

The removal of the waiver for unittest/llmapi/test_mpi_session.py on Blackwell platforms (B200_PCIe and B200) enables more comprehensive test coverage. This aligns with the broader changes mentioned in the AI summary to support MPI session testing across different hardware configurations.

tests/unittest/llmapi/_test_remote_mpi_session.sh (1)

10-10: LGTM! Adding root execution support for CI environments.

The addition of the --allow-run-as-root flag to the mpirun command is appropriate for testing environments where processes may need to run with elevated privileges, such as in containerized CI systems.

tests/unittest/llmapi/test_llm_multi_gpu.py (1)

469-469: LGTM! Simplifying function call by removing explicit parameter.

The removal of the pytorch_backend=False argument simplifies the function call to rely on default parameter handling, which aligns with the consolidation of error capture test logic mentioned in the AI summary.

tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)

103-103: LGTM! Tolerance adjustment aligns with FP8 quantization enhancements.

The relaxed tolerance thresholds (from 1e-3 to 1e-2) in the assertion are appropriate given the FP8 quantization improvements and performance enhancements mentioned in the broader update. This ensures the test remains robust while accommodating the numerical differences introduced by the updated FP8 implementation.

tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)

18-18: LGTM! Test configuration addition is consistent.

The new timeout entry for TestDeepSeekR1::test_nvfp4_multi_gpus[latency_trtllmgen] follows the established pattern and uses a consistent 180-second timeout value.

docs/source/quick-start-guide.md (2)

11-12: Excellent addition for legal compliance.

The license warning appropriately informs users about third-party dependencies and their licensing obligations before installation.

97-99: Good documentation improvement.

Adding the language specifier to the code block enhances documentation clarity and enables proper syntax highlighting.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1219-1220: LGTM - Improved specificity in dummy request handling.

This change refines the conditional logic to be more granular, excluding only specific types of dummy requests (is_attention_dp_dummy or is_cuda_graph_dummy) rather than all dummy requests. This improvement allows other types of dummy requests to have their token IDs processed when needed, which aligns with the broader PyTorch backend improvements mentioned in the summary.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

650-650: LGTM: Memory optimization adjustments align with performance improvements.

The systematic reduction of free_gpu_memory_fraction from 0.9 to 0.75 in KvCacheConfig instances is consistent with the broader optimization efforts mentioned in the PR summary. This change should help prevent OOM issues in CI environments while maintaining test coverage.

However, I notice two instances (lines 781 and 945) in the test_cute_dsl_fp8_block_scales methods still use 0.9. Please verify whether this is intentional due to specific memory requirements for CUTE DSL operations.

Also applies to: 690-690, 728-728, 816-816, 841-841, 882-882, 1073-1073, 1124-1124, 1181-1181

tensorrt_llm/llmapi/llm.py (1)

563-563: LGTM: Consistent reference to max_seq_len in error message.

This fix ensures the error message references the same max_seq_len value that was calculated and used in the validation logic above (lines 554-555), improving consistency and accuracy.

docs/source/performance/perf-overview.md (5)

31-65: LGTM: Performance data updates for Llama 3.3 70B FP4 and Llama 3.1 405B FP4.

The updated throughput numbers and table formatting look consistent with the existing benchmark structure. The data provides comprehensive coverage across different tensor parallel sizes and input/output sequence length combinations.

69-69: LGTM: Addition of Llama 3.3 70B FP8 model to the list.

The model addition is consistent with the performance data updates in the tables below.

71-71: LGTM: Addition of Llama 4 Maverick model to the list.

This aligns with the new performance section added for this model.

75-143: LGTM: Comprehensive performance data updates for FP8 models.

The updated throughput numbers cover multiple models (Llama 3.1 8B FP8, Llama 3.3 70B FP8, Llama 3.1 405B FP8) and the new Llama 4 Maverick FP8 section. All data follows the consistent table format and provides good coverage across different hardware configurations (H200, H100) and parallelism settings.

219-220: LGTM: Added clarification about benchmark configuration.

The note about v0.20 benchmark data collection configuration helps users understand which benchmark version and settings were used for the data.

tests/unittest/llmapi/test_llm.py (2)

2093-2107: LGTM: Good test simplification!

The refactoring removes conditional backend logic and unifies error handling expectations, making the test more focused and maintainable. The change from conditionally expecting ValueError vs RequestError to always expecting RequestError aligns with the unified error handling approach mentioned in the PR summary.

2109-2110: LGTM: Consistent with helper function changes.

The function call correctly updated to match the simplified helper function signature.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

41-41: LGTM: Clean import addition for P2P capability checking.

The import of can_access_peer is properly placed and necessary for the P2P support functionality added to the DeepSeek decoder layer.

606-606: LGTM: Efficient P2P capability initialization.

Computing P2P support once during initialization is more efficient than checking it repeatedly during forward passes. The placement is appropriate after mapping assignment.

797-801: LGTM: Proper integration of P2P support in MoE finalization logic.

The addition of self.is_p2p_supported to the do_finalize condition correctly ensures that MoE output finalization occurs immediately when P2P is not available, preventing the optimization path from being taken inappropriately.

triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (7)

85-85: LGTM: Method signature simplification.

Removing the is_v1 parameter simplifies the metrics parsing by treating all metrics uniformly without version-specific branching.

94-94: LGTM: Consistent method call update.

The method call correctly removes the is_v1 parameter to match the simplified method signature.

98-98: LGTM: Consistent signature simplification.

The method signature correctly removes version-specific parameters, maintaining consistency with the unified metrics handling approach.

109-109: LGTM: Defensive assertion for metric consolidation.

The assertion correctly validates that v1-specific metrics are not present in the unified metrics handling, ensuring the consolidation is working as expected.

112-112: LGTM: Base test method simplification.

The removal of the is_v1 parameter from the main test orchestration method completes the transition to unified metrics handling.

114-114: LGTM: Consistent method call in base test.

The method call correctly removes the is_v1 parameter, maintaining consistency with the simplified parsing method signature.

141-165: LGTM: Complete test method consolidation.

All test methods have been consistently updated to remove version-specific parameters, completing the transition to unified metrics verification. The consolidation from separate v1/v2 tests to a single approach simplifies the test suite while maintaining coverage.

docs/source/release-notes.md (1)

46-52: Confirm version alignment across docs.

The infrastructure table bumps PyTorch to 2.7.1 and TensorRT to 10.11, but older sections (0.20.x) list 2.7.0 / 10.10.0.
Double-check that the numbers here match the actual container tags and wheel metadata for 0.21.0 to avoid confusion.

tensorrt-cicd · 2025-07-22T03:11:37Z

PR_Github #12508 [ skip ] triggered by Bot

tensorrt-cicd · 2025-07-22T03:44:19Z

PR_Github #12508 [ skip ] completed with state SUCCESS
Skipping testing for commit 02615c4

dc3671 requested review from a team as code owners July 21, 2025 06:32

dc3671 requested review from juney-nvidia, hyukn and bobboli July 21, 2025 06:32

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

triton_backend/ci/L0_backend_trtllm/test.sh Show resolved Hide resolved

triton_backend/inflight_batcher_llm/scripts/build.sh Show resolved Hide resolved

dc3671 requested review from amirkl94, Superjomn, nekorobov, lfr-0531, nv-guomingz, yizhang-nv, byshiue, QiJune, pcastonguay and LinPoly July 21, 2025 06:40

NVIDIA deleted a comment from coderabbitai bot Jul 21, 2025

dc3671 force-pushed the mi-release-0.21-4 branch from 2314239 to ed31c7d Compare July 21, 2025 06:45

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

NVIDIA deleted a comment from coderabbitai bot Jul 21, 2025

pcastonguay approved these changes Jul 21, 2025

View reviewed changes

bobboli approved these changes Jul 21, 2025

View reviewed changes

lfr-0531 reviewed Jul 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

QiJune approved these changes Jul 21, 2025

View reviewed changes

bobboli and others added 2 commits July 21, 2025 19:35

fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction…

c470f48

… to 0.75 to prevent OOM on CI. (NVIDIA#5896) Signed-off-by: Bo Li <[email protected]>

Superjomn and others added 9 commits July 21, 2025 19:35

fix [nvbug/5351244]: address remote mpi session submit (NVIDIA#5664)

c23b7a8

Signed-off-by: Superjomn <[email protected]>

fix: [5328141] increase tolerance for test_fp8_block_scale_gemm (NVID…

d72da31

…IA#5849) Signed-off-by: Nikita Korobov <[email protected]>

fix: fix index out of bounds error in spec decoding (NVIDIA#5954)

5998ac6

[nvbugs/5368410][fix] Disable moe allreduce for multi node (NVIDIA#5918)

9a690f2

Signed-off-by: Yi Zhang <[email protected]>

[TRTLLM-6495] doc: add disclaimer for 3rd party software installation. (

1649347

NVIDIA#6039) Signed-off-by: nv-guomingz <[email protected]>

add release notes for 0.21 release (NVIDIA#6049)

47e640d

Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>

fix: Fix triton backend build [nvbug 5396469] (NVIDIA#6098)

38f22fb

Signed-off-by: Patrice Castonguay <[email protected]>

[Doc][Qwen3] update qwen3 into support-matrix (NVIDIA#6161)

40d54e2

Signed-off-by: bhsueh <[email protected]>

[fix]: Revert commit 388b491 (NVIDIA#6143)

02615c4

Signed-off-by: Pengyun Lin <[email protected]>

dc3671 force-pushed the mi-release-0.21-4 branch from ed31c7d to 02615c4 Compare July 22, 2025 02:36

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

NVIDIA deleted a comment from coderabbitai bot Jul 22, 2025

dc3671 merged commit 48ddc3d into NVIDIA:main Jul 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Mass integration of release/0.21 (part 4) #6211

chore: Mass integration of release/0.21 (part 4) #6211

Uh oh!

dc3671 commented Jul 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

dc3671 commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

dc3671 commented Jul 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

chore: Mass integration of release/0.21 (part 4) #6211

chore: Mass integration of release/0.21 (part 4) #6211

Uh oh!

Conversation

dc3671 commented Jul 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dc3671 commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

dc3671 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

dc3671 commented Jul 21, 2025 •

edited by coderabbitai bot

Loading

dc3671 commented Jul 22, 2025 •

edited

Loading