-
Notifications
You must be signed in to change notification settings - Fork 1.7k
chore: Mass integration of release/0.21 (part 4) #6211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
tests/integration/test_lists/test-db/l0_a100.yml (1)
30-36
: YAML indentation error – list items need 4-space indent
tests:
is indented two spaces, so its list items must be indented four.
Currently line 31 starts with only two spaces:- unittest/trt/attention/test_sage_attention.py …
This breaks YAML parsing and will make the test-selector loader crash.
- - unittest/trt/attention/test_sage_attention.py … + - unittest/trt/attention/test_sage_attention.py …Please fix before merging to avoid CI failures.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
391-399
: Critical: Remove duplicated code blockThis code block is an exact duplicate of lines 381-389. While the AI summary mentions this duplication was added to prevent illegal memory access during KV cache writes in overlap scheduler warmup, having identical code blocks like this creates maintenance issues and suggests a logical error in the control flow.
The duplication should be removed and the logic should be restructured to ensure the tensor zeroing happens exactly once under the appropriate conditions. Consider consolidating the logic or adding proper conditional checks to avoid redundant execution.
- if (not self._disable_overlap_scheduler - and next_draft_tokens_device is None - and len(extend_requests) > 0): - # During warmup, for those generation requests, we don't have previous tensors, - # so we need to set the previous_pos_id_offsets and previous_kv_lens_offsets to zeros - # to skip the value changes in _preprocess_inputs. Otherwise, there will be illegal memory access - # when writing key/values to the KV cache. - self.previous_pos_id_offsets_cuda *= 0 - self.previous_kv_lens_offsets_cuda *= 0
🧹 Nitpick comments (7)
tests/integration/defs/triton_server/test_triton.py (1)
511-511
: Format long cmake command for better readability.Good consistency with the build script changes. However, the line exceeds the 120-character limit and should be formatted for better readability.
Apply this formatting improvement:
- f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 " + f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ " + f"-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON " + f"-DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 " + f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "tests/unittest/llmapi/test_mpi_session.py (1)
63-72
: Addtext=True
and remove redundantuniversal_newlines
universal_newlines=True
is the Py-3.11 legacy spelling oftext=True
. Since you are already on 3.8+ you can simplify:- bufsize=1, - start_new_session=True, - universal_newlines=True, + bufsize=1, + start_new_session=True, + text=True,Nit-level, but brings the code in line with modern std-lib usage and is marginally clearer for newcomers.
tests/unittest/llmapi/_test_remote_mpi_session.sh (1)
9-11
:--allow-run-as-root
may not exist in all MPI stacks
mpirun --allow-run-as-root
works for OpenMPI/HPC-X, but other MPI impls (Intel-MPI, MPICH) error out on the unknown flag, failing the test even when it would otherwise pass. If portability matters, gate the flag:if mpirun --help 2>&1 | grep -q -- '--allow-run-as-root'; then root_flag=--allow-run-as-root fi timeout 60 mpirun ${root_flag:+$root_flag} -np 2 …tests/integration/test_lists/waives.txt (1)
86-86
: Hard-to-maintain one-liner – split long skip listThe mega-line containing 11 separate paths (line 86) is nearly 300 chars. Breaking one test per line improves diff-ability and future merges.
Not urgent, but worth refactoring next time this file is touched.
docs/source/release-notes.md (3)
31-32
: Capitalize “CUDA” for consistency with NVIDIA terminology
Spelling nit: “cuda” ➜ “CUDA”.- - Added Piecewise cuda graph support for MLA + - Added Piecewise CUDA graph support for MLA
34-35
: Use the official “PyTorch” casing- - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner + - Integrated TRT-LLM Gen FP8 block-scale MoE with PyTorch workflow kernel autotuner
43-44
: Align tool name with existing docs (“trtllm-serve bench”)
The CLI utility is referred to elsewhere astrtllm-serve bench
. Consider updating for clarity:- - Added no_kv_cache_reuse option and streaming support for trtllm serve bench + - Added no_kv_cache_reuse option and streaming support for `trtllm-serve bench`
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
docs/source/installation/linux.md
(1 hunks)docs/source/performance/perf-overview.md
(3 hunks)docs/source/quick-start-guide.md
(2 hunks)docs/source/reference/support-matrix.md
(2 hunks)docs/source/release-notes.md
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(10 hunks)tests/integration/defs/triton_server/test_triton.py
(1 hunks)tests/integration/test_lists/test-db/l0_a100.yml
(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
(1 hunks)tests/integration/test_lists/waives.txt
(2 hunks)tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py
(1 hunks)tests/unittest/llmapi/_test_remote_mpi_session.sh
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(1 hunks)tests/unittest/llmapi/test_mpi_session.py
(1 hunks)triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py
(4 hunks)triton_backend/ci/L0_backend_trtllm/test.sh
(1 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
🧰 Additional context used
🧠 Learnings (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Shellcheck (0.10.0)
triton_backend/inflight_batcher_llm/scripts/build.sh
[warning] 55-55: Quote this to prevent word splitting.
(SC2046)
triton_backend/ci/L0_backend_trtllm/test.sh
[warning] 233-233: Quote this to prevent word splitting.
(SC2046)
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py
511-511: Line too long (245 > 120)
(E501)
🪛 markdownlint-cli2 (0.17.2)
docs/source/performance/perf-overview.md
62-62: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 LanguageTool
docs/source/performance/perf-overview.md
[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...
(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)
🔇 Additional comments (38)
tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)
103-103
: LGTM! Tolerance adjustment aligns with FP8 quantization improvements.The relaxation of tolerance from 1e-3 to 1e-2 for both
atol
andrtol
parameters brings consistency with the second test function (line 196) and aligns with the FP8 quantization enhancements introduced in TensorRT-LLM Release 0.21.0. This adjustment ensures robust validation under the updated numerical behaviors of FP8 block-scale GEMM operations.triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (3)
85-110
: LGTM on V1 batching strategy cleanup.The removal of
is_v1
parameters and V1-specific logic aligns well with the deprecation of static batching mentioned in the test script. The simplified interface is cleaner and the assertion on line 109 properly enforces the exclusion of V1-specific metrics.
112-138
: Simplified test method improves maintainability.The removal of V1-specific conditional logic makes the test method more straightforward while preserving all the core validation functionality.
139-166
: Test method updates are consistent with interface changes.All test methods properly use the simplified
_base_test
signature, and the focus on inflight batching (IFB) tests aligns with the deprecation of V1 batching strategy.tests/unittest/llmapi/test_mpi_session.py (1)
63-72
: Consider a safety timeout for the subprocessIn the shell script you guard
mpirun
withtimeout 60
, but the Python side will happily block forever if that script itself hangs before spawning MPI (e.g., path typo).
Addtimeout=
toprocess.wait()
or switch tosubprocess.run(..., timeout=…)
to prevent indeterminate CI stalls.- return_code = process.wait() + # Fail fast after 120 s wall-clock to keep CI responsive + return_code = process.wait(timeout=120)docs/source/installation/linux.md (1)
35-35
: Good addition for legal transparency.Adding this warning about third-party software after the installation command is appropriate and helps users make informed decisions about licensing compliance.
tests/unittest/llmapi/test_llm_pytorch.py (1)
9-14
: Test coverage for capture-request-error is maintainedThe
_test_llm_capture_request_error
helper and its correspondingtest_llm_capture_request_error
are still defined and exercised in:
tests/unittest/llmapi/test_llm.py
tests/unittest/llmapi/test_llm_multi_gpu.py
No further changes needed here.
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)
18-18
: Appropriate timeout configuration for GB200 hardware.Adding the timeout designation for the
latency_trtllmgen
test variant is reasonable given the resource-intensive nature of accuracy tests on GB200 multi-node setups. The 180-second timeout aligns with similar tests in the configuration.tests/unittest/llmapi/test_llm_multi_gpu.py (1)
469-469
: Verified removal ofpytorch_backend
from all call sitesAll instances of
_test_llm_capture_request_error
now match the simplified signature—no remaining calls include the removedpytorch_backend
parameter:
- tests/unittest/llmapi/test_llm_multi_gpu.py:
_test_llm_capture_request_error(tp_size=2)
- tests/unittest/llmapi/test_llm.py:
• Definition:def _test_llm_capture_request_error(tp_size: int = 1):
• Call site:_test_llm_capture_request_error(tp_size=1)
No further updates are needed; approving this cleanup.
docs/source/reference/support-matrix.md (2)
28-29
: Proper documentation of new Qwen3 model support.The addition of
Qwen3ForCausalLM
andQwen3MoeForCausalLM
entries with their respective HuggingFace examples is well-formatted and consistent with the existing documentation structure.
77-77
: Consistent update to TensorRT backend model list.Including Qwen3 in the TensorRT backend Qwen model link maintains consistency with the PyTorch backend additions and provides users with complete support information.
docs/source/quick-start-guide.md (3)
11-11
: LGTM: Important legal notice added.The addition of a license warning about third-party software is a good practice for legal compliance and user awareness.
19-19
: LGTM: Helpful Docker usage guidance.The recommendation to run Docker commands as root or with appropriate permissions will help users avoid common permission issues during setup.
97-97
: LGTM: Improved code block formatting.Adding the language specifier with file path (
bash:docs/source/quick-start-guide.md
) will improve syntax highlighting and documentation rendering.tensorrt_llm/llmapi/llm.py (2)
41-41
: LGTM: Appropriate import for P2P functionality.The import of
can_access_peer
is consistent with the AI summary indicating peer-to-peer access checks are being added to support enhanced model features.
563-563
: LGTM: Improved error message consistency.Changing from
max_seq_len
tobuild_config.max_seq_len
in the error message makes it consistent with the actual source of the value and improves debugging clarity.tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
41-41
: LGTM: Consistent P2P import pattern.The import of
can_access_peer
follows the same pattern as in the LLM file and supports the P2P access checking functionality.
606-606
: LGTM: Clear P2P support detection.Initializing
is_p2p_supported
withcan_access_peer(mapping)
provides a clear way to determine if peer-to-peer access is available for the current mapping configuration.
801-802
: LGTM: Enhanced fusion condition with P2P check.The addition of
self.is_p2p_supported
to thedo_finalize
condition is logical - the fused operations likely require peer-to-peer access to work correctly. This prevents attempting fusion when P2P isn't available, which could lead to performance issues or failures.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
208-210
: LGTM: More specific dummy request filtering logicThe change to exclude
is_attention_dp_dummy
andis_cuda_graph_dummy
requests (in addition to the generalis_dummy
check) from appending last tokens is more precise and aligns with the specific types of dummy requests that should be filtered out during extend requests without previous tensors.tests/integration/defs/accuracy/test_llm_api_pytorch.py (10)
661-661
: LGTM: Memory fraction reduction for improved stability.The reduction from 0.9 to 0.75 aligns with the broader memory management improvements in TensorRT-LLM Release 0.21.0 and should help prevent OOM issues in CI environments.
702-702
: LGTM: Consistent memory optimization for multi-GPU scenarios.This change maintains consistency with the single-GPU test and helps ensure stable test execution across different GPU configurations.
742-742
: LGTM: Appropriate memory management for FP8 quantization tests.The reduced memory fraction is well-suited for FP8 quantized model testing where memory usage patterns may differ from standard configurations.
828-828
: LGTM: Memory optimization for CUDA graph padding scenarios.The reduced memory fraction appropriately accounts for additional overhead from CUDA graph padding operations.
853-853
: LGTM: Consistent memory management for multi-GPU CUDA graph scenarios.This change maintains consistency across similar test configurations and helps ensure stable execution in resource-intensive multi-GPU setups.
896-896
: LGTM: Appropriate memory allocation for multi-GPU FP8 testing.The conservative memory fraction is well-suited for intensive 4-GPU FP8 block scales testing scenarios.
995-995
: LGTM: Conservative memory allocation for expert parallel load balancing.The reduced memory fraction is appropriate for complex MoE scenarios with static expert parallel load balancing.
1088-1088
: LGTM: Appropriate memory management for NVFP4 quantization.The conservative memory allocation is well-suited for NVFP4 quantization testing, which may have unique memory usage patterns.
1140-1140
: LGTM: Consistent memory optimization for multi-GPU NVFP4 scenarios.This maintains consistency with other multi-GPU test configurations and ensures stable execution for complex NVFP4 quantization scenarios.
1197-1198
: LGTM: Appropriate memory management for no KV cache reuse scenarios.The conservative memory fraction is well-suited for tests with KV cache block reuse disabled, where memory usage patterns differ from standard configurations.
tests/unittest/llmapi/test_llm.py (2)
2093-2107
: LGTM: Test simplification consolidates error handling logic.The removal of the
pytorch_backend
conditional logic simplifies the test by always using theLLM
class with a consistentBuildConfig
. This change aligns with the broader consolidation effort mentioned in the AI summary, where backend-specific error capture tests were unified into a single implementation.The test logic remains correct:
- Uses a fixed
max_num_tokens=64
configuration- Tests with a prompt of 65 'A' tokens to exceed the limit
- Properly expects a
RequestError
exception
2109-2110
: LGTM: Wrapper function correctly updated.The
test_llm_capture_request_error
wrapper function has been appropriately updated to call_test_llm_capture_request_error
without thepytorch_backend
argument, consistent with the simplified function signature.docs/source/performance/perf-overview.md (6)
32-46
: LGTM: Improved table formatting and structure.The Llama 3.3 70B FP4 table has been reformatted with clearer column headers by explicitly labeling "TP Size" and removing the redundant "GPU" column. The throughput values have been updated and the table structure is more readable.
50-64
: LGTM: Expanded coverage with new TP size configurations.The Llama 3.1 405B FP4 table now includes TP sizes 4 and 8 with corresponding throughput data, providing more comprehensive performance metrics. The table formatting is consistent with the other tables.
69-71
: LGTM: Model list updated to reflect latest support.The FP8 models list has been appropriately updated to replace "Llama 3.1 70B FP8" with "Llama 3.3 70B FP8" and add the new "Llama 4 Maverick FP8" model, aligning with the performance tables below.
92-110
: LGTM: Comprehensive performance data for Llama 3.3 70B FP8.The table provides extensive throughput measurements across different TP sizes and GPU configurations (H200 and H100). The formatting is consistent and the data appears complete, though some cells are intentionally left empty where measurements may not be available.
128-143
: LGTM: New Llama 4 Maverick FP8 performance section.The addition of the Llama 4 Maverick FP8 performance table provides valuable benchmarking data for the new model. The table follows the established format and includes measurements for both H200 and H100 GPUs across various input/output sequence length combinations.
219-243
: LGTM: Clear documentation of benchmark methodology.The addition of the note specifying that v0.20 benchmark data was collected using the provided
llm_options.yml
file improves reproducibility. The YAML configuration is properly formatted and the explanation about KV cache percentage usage provides helpful context for users attempting to reproduce results.
2314239
to
ed31c7d
Compare
/bot run --disable-fail-fast |
PR_Github #12422 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
triton_backend/inflight_batcher_llm/scripts/build.sh (1)
54-55
: Fix shell quoting issue in cmake command.The TODO comment provides good context for the temporary version specification. However, the command substitution should be quoted to prevent word splitting.
Apply this fix for the shell quoting issue:
-cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 .. +cmake -DCMAKE_INSTALL_PREFIX:PATH="`pwd`"/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..triton_backend/ci/L0_backend_trtllm/test.sh (1)
232-237
: Approve simplified V1 deprecation handling, but fix shell quoting.The approach of checking server logs for the deprecation warning is more efficient than running full test suites. This aligns well with the V1 cleanup in the metrics verification tests.
Apply this fix for the shell quoting issue:
- if [ `grep -c "Static batching type is deprecated" $SERVER_LOG` == "0" ]; then + if [ "$(grep -c "Static batching type is deprecated" "$SERVER_LOG")" == "0" ]; then
🧹 Nitpick comments (6)
tests/integration/defs/triton_server/test_triton.py (1)
511-511
: Break long line for better readability.The explicit Triton repository tags ensure version consistency, which aligns well with the build script changes. However, the line exceeds 120 characters.
Consider breaking the line for better readability:
- f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 " + f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON " + f"-DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 " + f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
391-400
: Consider refactoring to eliminate code duplicationThis block is identical to lines 381-390, creating code duplication. While the reinforced zeroing may be intentional for preventing illegal memory access during KV cache writes (as mentioned in the release notes), consider extracting this logic into a helper method or ensuring it only executes once under the same conditions.
+ def _zero_previous_offsets_if_needed(self, next_draft_tokens_device, extend_requests): + if (not self._disable_overlap_scheduler + and next_draft_tokens_device is None + and len(extend_requests) > 0): + # During warmup, for those generation requests, we don't have previous tensors, + # so we need to set the previous_pos_id_offsets and previous_kv_lens_offsets to zeros + # to skip the value changes in _preprocess_inputs. Otherwise, there will be illegal memory access + # when writing key/values to the KV cache. + self.previous_pos_id_offsets_cuda *= 0 + self.previous_kv_lens_offsets_cuda *= 0Then call this method once at the appropriate location instead of having duplicate blocks.
docs/source/release-notes.md (4)
37-44
: Unify list-item heading style for “Benchmark” section
Benchmark:
is the only second-level list heading that keeps a trailing colon and is not bolded (compare with**Model Support**
,**Features**
).
For consistency and cleaner Markdown rendering, drop the colon and bold the tag:- - Benchmark: + - **Benchmark**
81-82
: Fix punctuation typo (“support.Refer”)There is a missing space after the period.
- - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`. + - Added Qwen3 support. Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
54-55
: Wrap_AutoDeployLlmArgs
in back-ticksThe leading underscore triggers italic formatting in Markdown, making the identifier hard to read. Use code styling and tighten wording:
- - Set _AutoDeployLlmArgs as primary config object + - Set `_AutoDeployLlmArgs` as the primary configuration object
17-18
: Use code formatting for quantization mode namesTechnical tokens such as
w4a8_mxfp4_fp8
andfp8 rowwise quantization
are easier to spot when wrapped in back-ticks.- - Added support for w4a8_mxfp4_fp8 quantization - - Added support for fp8 rowwise quantization + - Added support for `w4a8_mxfp4_fp8` quantization + - Added support for `fp8` row-wise quantization
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
docs/source/installation/linux.md
(1 hunks)docs/source/performance/perf-overview.md
(3 hunks)docs/source/quick-start-guide.md
(2 hunks)docs/source/reference/support-matrix.md
(2 hunks)docs/source/release-notes.md
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(10 hunks)tests/integration/defs/triton_server/test_triton.py
(1 hunks)tests/integration/test_lists/test-db/l0_a100.yml
(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
(1 hunks)tests/integration/test_lists/waives.txt
(2 hunks)tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py
(1 hunks)tests/unittest/llmapi/_test_remote_mpi_session.sh
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(1 hunks)tests/unittest/llmapi/test_mpi_session.py
(1 hunks)triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py
(4 hunks)triton_backend/ci/L0_backend_trtllm/test.sh
(1 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
🧰 Additional context used
🧠 Learnings (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig
(764-843)
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py
511-511: Line too long (245 > 120)
(E501)
🪛 Shellcheck (0.10.0)
triton_backend/inflight_batcher_llm/scripts/build.sh
[warning] 55-55: Quote this to prevent word splitting.
(SC2046)
triton_backend/ci/L0_backend_trtllm/test.sh
[warning] 233-233: Quote this to prevent word splitting.
(SC2046)
🪛 markdownlint-cli2 (0.17.2)
docs/source/performance/perf-overview.md
66-66: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
67-67: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 LanguageTool
docs/source/performance/perf-overview.md
[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...
(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (28)
triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (1)
85-165
: LGTM: Clean V1 deprecation removal with improved validation.The systematic removal of V1-specific logic from metrics parsing and testing is well-executed and aligns with the broader V1 deprecation effort across the codebase. The addition of the explicit assertion
self.assertNotIn("v1_specific_metric", key)
at line 109 provides good defensive programming to ensure V1 metrics are properly rejected.The method signature simplifications and consistent updates to all test methods maintain code clarity while removing deprecated functionality.
tests/integration/test_lists/waives.txt (1)
86-86
: Re-enabling MPI session tests on Blackwell platforms.The removal of
unittest/llmapi/test_mpi_session.py
from the waive lists for both B200_PCIe and B200 platforms indicates that the MPI session test is now stable and ready to run on Blackwell hardware. This aligns with the coordinated changes in the test configuration files to actively include this test in the pre-merge testing pipeline.Also applies to: 158-158
tests/integration/test_lists/test-db/l0_a100.yml (2)
17-17
: Logical placement of MPI session test in pre-merge PyTorch pipeline.Adding
unittest/llmapi/test_mpi_session.py
to the pre-merge PyTorch backend test suite makes sense as MPI session functionality is foundational and should be validated early in the pipeline. The "generic tests" comment appropriately categorizes this as infrastructure testing.
31-31
: Consolidated test configuration for post-merge TensorRT backend.The restructuring to move MPI session tests to the PyTorch backend while keeping other unit tests in the TensorRT backend reflects a logical separation of concerns between PyTorch-specific MPI functionality and TensorRT-specific components.
tests/unittest/llmapi/_test_remote_mpi_session.sh (1)
10-10
: Security consideration: Adding root privileges to MPI execution.The
--allow-run-as-root
flag is necessary for MPI to run in containerized CI environments where tests may execute as root. While this introduces elevated privileges, it's appropriate for this test-specific script that runs in controlled CI environments.Please ensure this script is only used in test environments and not in production contexts where running MPI as root could pose security risks.
tests/unittest/llmapi/test_mpi_session.py (1)
71-71
: Improved subprocess reliability with explicit working directory.Setting
cwd=os.path.dirname(os.path.abspath(__file__))
ensures the shell script_test_remote_mpi_session.sh
can be located reliably regardless of the test execution context. This makes the test more robust and location-independent, which is essential for consistent CI execution.docs/source/installation/linux.md (1)
35-35
: Good addition for legal compliance and user awareness.The warning about third-party software downloads and license review is appropriately placed immediately after the installation command and provides important legal information to users.
tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)
103-103
: Appropriate tolerance adjustment for FP8 precision characteristics.The relaxation from 1e-3 to 1e-2 tolerance is reasonable for FP8 quantization operations, which have inherent precision limitations compared to higher precision formats. This aligns with the FP8 enhancements mentioned in the release notes.
docs/source/reference/support-matrix.md (2)
28-29
: Proper documentation of new Qwen3 model support.The new model entries follow the existing format consistently, with appropriate HuggingFace examples and correct modality classification.
77-77
: Consistent reference update for Qwen3 support.The link update appropriately includes Qwen3 alongside the existing Qwen model references.
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)
18-18
: Appropriate test timeout configuration addition.The new test case entry follows the existing format and timeout value (180 seconds) is consistent with other entries. This properly manages execution expectations for the DeepSeekR1 NVFP4 multi-GPU test.
tests/unittest/llmapi/test_llm_multi_gpu.py (1)
469-469
: Good refactoring that simplifies the test interface.Removing the
pytorch_backend=False
parameter aligns with the helper function refactoring mentioned in the AI summary, where backend-specific logic was unified. This simplifies the test call while maintaining functionality.docs/source/quick-start-guide.md (3)
11-11
: Excellent addition for legal compliance and user awareness.The warning about third-party software downloads and license review is a valuable addition that enhances transparency and helps users make informed decisions about software usage.
19-19
: Good improvement to Docker usage guidance.The expanded instructions with the recommendation to run as root for streamlined setup will help users avoid common Docker permission issues during development and testing.
97-99
: Good documentation formatting improvement.Adding the bash language specifier enables proper syntax highlighting in the documentation, improving readability and following documentation best practices.
tests/unittest/llmapi/test_llm_pytorch.py (1)
9-14
: Good test code consolidation.The removal of the
_test_llm_capture_request_error
import and associated test function aligns with the broader effort to consolidate error capture testing logic, reducing code duplication across test files.tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
661-661
: Systematic memory management improvement.The consistent reduction of
free_gpu_memory_fraction
from 0.9 to 0.75 across multiple test configurations represents a conservative approach to GPU memory allocation, likely improving stability and compatibility with the enhanced quantization features in release 0.21.0.Also applies to: 702-702, 742-742, 828-828, 853-853, 896-896, 995-995, 1088-1088, 1140-1140, 1197-1197
tensorrt_llm/llmapi/llm.py (1)
563-563
: LGTM: Error message now consistently references the correct max_seq_len value.The error message now directly references
build_config.max_seq_len
, which aligns with the logic above wheremax_seq_len
is determined (line 555). This creates consistency between the validation logic and error reporting.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
208-210
: LGTM: Proper exclusion of additional dummy request typesThe logic correctly excludes
is_attention_dp_dummy
andis_cuda_graph_dummy
requests from token ID copying during extend request processing, which aligns with the existing pattern of not processing token IDs for dummy requests.tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
41-41
: LGTM: Import addition is well-placed and necessary.The import of
can_access_peer
fromtensorrt_llm._ipc_utils
is appropriately placed with other tensorrt_llm imports and is used later in the code for P2P support detection.
606-606
: LGTM: P2P support detection is properly initialized.The
is_p2p_supported
attribute is correctly initialized by callingcan_access_peer(mapping)
with the appropriate mapping object. This will be used to determine whether peer-to-peer access is available for MoE fusion optimizations.
797-801
: LGTM: P2P support check properly integrated into fusion logic.The addition of
self.is_p2p_supported
to thedo_finalize
condition is correct. The fusion optimization (do_finalize = False
) is now only enabled when P2P access is available, along with other requirements like small token count, TRTLLM backend, and NVFP4 quantization. This ensures that the MoE fusion pattern only runs when the underlying peer-to-peer communication infrastructure is available to support efficient expert parallelism.tests/unittest/llmapi/test_llm.py (2)
2093-2107
: Good simplification - unified error handling test.The removal of backend-specific logic makes this test cleaner and more focused. The unified approach of always expecting a
RequestError
for token limit violations aligns with the broader simplification mentioned in the AI summary.
2109-2111
: LGTM - test call updated correctly.The function call has been properly updated to match the simplified signature of
_test_llm_capture_request_error
.docs/source/performance/perf-overview.md (4)
31-64
: Performance data updated for FP4 models.The benchmark results for Llama 3.3 70B FP4 and Llama 3.1 405B FP4 have been updated with comprehensive throughput data across different tensor parallel configurations. The data structure and formatting are consistent.
69-143
: Model names corrected and FP8 performance data updated.Good corrections to the model names to match the actual NVIDIA model identifiers. The extensive FP8 performance data updates provide comprehensive benchmark results across different models and configurations.
219-220
: Good addition - benchmark methodology transparency.Adding the note about which configuration file was used for v0.20 benchmarks improves reproducibility and transparency of the performance data collection process.
243-243
: Minor wording improvement for clarity.The revision improves readability. This also addresses the static analysis hint about "a majority of" being potentially wordy.
PR_Github #12422 [ run ] completed with state |
… to 0.75 to prevent OOM on CI. (NVIDIA#5896) Signed-off-by: Bo Li <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Yingge He <[email protected]> Signed-off-by: Martin Marciniszyn Mehringer <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: nv-guomingz <[email protected]> Co-authored-by: Yingge He <[email protected]> Co-authored-by: Martin Marciniszyn Mehringer <[email protected]> Co-authored-by: Kaiyu Xie <[email protected]> Co-authored-by: zpatel <[email protected]>
Signed-off-by: Superjomn <[email protected]>
…IA#5849) Signed-off-by: Nikita Korobov <[email protected]>
Signed-off-by: Yi Zhang <[email protected]>
NVIDIA#6039) Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: bhsueh <[email protected]>
Signed-off-by: Pengyun Lin <[email protected]>
ed31c7d
to
02615c4
Compare
/bot skip --comment "already succeed in previous pipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
triton_backend/inflight_batcher_llm/scripts/build.sh (1)
54-55
: Fix shell quoting issue in cmake command.The TODO comment provides good context for the temporary version specification. However, the command substitution should be quoted to prevent word splitting.
Apply this fix for the shell quoting issue:
-cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 .. +cmake -DCMAKE_INSTALL_PREFIX:PATH="`pwd`"/install ${BUILD_TESTS_ARG} -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 ..triton_backend/ci/L0_backend_trtllm/test.sh (1)
232-237
: Fix shell quoting issue and correct error message.The shell quoting issue on line 233 was already identified in a previous review but wasn't fixed. Additionally, the error message on line 234 mentions "GPT model type error" but the actual check is for the deprecation warning.
Apply the previously suggested fix for shell quoting:
- if [ `grep -c "Static batching type is deprecated" $SERVER_LOG` == "0" ]; then + if [ "$(grep -c "Static batching type is deprecated" "$SERVER_LOG")" == "0" ]; thenAnd correct the error message:
- echo -e "\n***\n*** GPT model type error not handled gracefully: line ${LINENO}\n***" + echo -e "\n***\n*** Static batching deprecation warning not found: line ${LINENO}\n***"
🧹 Nitpick comments (7)
docs/source/quick-start-guide.md (1)
19-19
: Consider security implications of root access recommendation.While running as root simplifies setup, this recommendation could pose security risks in production environments. Consider adding a note about using root only for development/testing purposes and following least-privilege principles in production.
Consider this enhancement:
-The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). Ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process. +The following examples can most easily be executed using the prebuilt [Docker release container available on NGC](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/tensorrt-llm/containers/release) (see also [release.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/release.md) on GitHub). For development and testing, ensure to run these commands as a user with appropriate permissions, preferably `root`, to streamline the setup process. In production environments, follow the principle of least privilege.tests/integration/defs/triton_server/test_triton.py (1)
511-511
: Approve version pinning but fix line length violation.The explicit Triton repository tag specifications improve build reproducibility. However, the line exceeds the 120-character limit.
Consider breaking the long CMake command into multiple lines for better readability:
- f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 -DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 " + f"cmake .. -DTRTLLM_DIR={llm_root} -DCMAKE_INSTALL_PREFIX=install/ -DBUILD_TESTS=ON " + f"-DUSE_CXX11_ABI=ON -DTRITON_COMMON_REPO_TAG=r25.05 -DTRITON_CORE_REPO_TAG=r25.05 " + f"-DTRITON_THIRD_PARTY_REPO_TAG=r25.05 -DTRITON_BACKEND_REPO_TAG=r25.05 "docs/source/performance/perf-overview.md (1)
243-243
: Consider rephrasing for conciseness.The static analysis tool suggests "a majority of cases" might be wordy. Consider: "In most cases, we also use a higher KV cache percentage..."
-In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. +In most cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command.docs/source/release-notes.md (4)
17-19
: Hyphenate “rowwise” for clarity.“rowwise quantization” → “row-wise quantization” is the prevailing spelling in earlier sections of the docs.
- - Added support for fp8 rowwise quantization + - Added support for fp8 row-wise quantization
30-33
: Capitalize CUDA & tidy wording.Minor consistency fixes: CUDA should be uppercase, and Eagle3 should keep its capital E.
- - Added Piecewise cuda graph support for MLA - - Added model-agnostic one-engine eagle3 + - Added piece-wise CUDA graph support for MLA + - Added model-agnostic one-engine Eagle3
37-44
: Promote “Benchmark:” to a sub-heading or remove the colon.The line “Benchmark:” is rendered as a bullet but visually looks like a heading.
Either:
- Make it a level-3 heading:
- - Benchmark: + ### Benchmark
- Or drop the colon so it cleanly joins the bulleted list:
- - Benchmark: - - Added all_reduce.py benchmark script for testing + - **Benchmark** + - Added all_reduce.py benchmark script for testingChoose one for consistency with previous release sections.
81-83
: Add missing space after period.- - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`. + - Added Qwen3 support. Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
docs/source/installation/linux.md
(1 hunks)docs/source/performance/perf-overview.md
(3 hunks)docs/source/quick-start-guide.md
(2 hunks)docs/source/reference/support-matrix.md
(2 hunks)docs/source/release-notes.md
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(1 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(10 hunks)tests/integration/defs/triton_server/test_triton.py
(1 hunks)tests/integration/test_lists/test-db/l0_a100.yml
(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
(1 hunks)tests/integration/test_lists/waives.txt
(2 hunks)tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py
(1 hunks)tests/unittest/llmapi/_test_remote_mpi_session.sh
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu.py
(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(1 hunks)tests/unittest/llmapi/test_mpi_session.py
(1 hunks)triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py
(4 hunks)triton_backend/ci/L0_backend_trtllm/test.sh
(1 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh
(1 hunks)
🧠 Learnings (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Shellcheck (0.10.0)
triton_backend/inflight_batcher_llm/scripts/build.sh
[warning] 55-55: Quote this to prevent word splitting.
(SC2046)
triton_backend/ci/L0_backend_trtllm/test.sh
[warning] 233-233: Quote this to prevent word splitting.
(SC2046)
🪛 markdownlint-cli2 (0.17.2)
docs/source/performance/perf-overview.md
66-66: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
67-67: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 LanguageTool
docs/source/performance/perf-overview.md
[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...
(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py
511-511: Line too long (245 > 120)
(E501)
💤 Files with no reviewable changes (1)
- tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
🧰 Additional context used
🧠 Learnings (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Shellcheck (0.10.0)
triton_backend/inflight_batcher_llm/scripts/build.sh
[warning] 55-55: Quote this to prevent word splitting.
(SC2046)
triton_backend/ci/L0_backend_trtllm/test.sh
[warning] 233-233: Quote this to prevent word splitting.
(SC2046)
🪛 markdownlint-cli2 (0.17.2)
docs/source/performance/perf-overview.md
66-66: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
67-67: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 LanguageTool
docs/source/performance/perf-overview.md
[style] ~243-~243: ‘a majority of’ might be wordy. Consider a shorter alternative.
Context: ... - 2048 - 4096 - 8192 ``` In a majority of cases, we also use a higher KV cache pe...
(EN_WORDINESS_PREMIUM_A_MAJORITY_OF)
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py
511-511: Line too long (245 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (34)
docs/source/installation/linux.md (1)
35-35
: LGTM: Important compliance notice added.This legal disclaimer appropriately informs users about third-party software downloads and license review requirements, which is essential for compliance.
tests/integration/test_lists/test-db/l0_a100.yml (1)
17-17
: LGTM: Appropriate test suite reorganization.Moving
test_mpi_session.py
to the PyTorch backend pre_merge stage aligns with enabling this test on Blackwell GPUs and ensures MPI functionality is tested earlier in the pipeline.docs/source/reference/support-matrix.md (2)
28-29
: LGTM: New Qwen3 model support properly documented.The addition of
Qwen3ForCausalLM
andQwen3MoeForCausalLM
entries with correct HuggingFace examples and language-only modality designation is consistent with the existing documentation format.
77-77
: LGTM: TensorRT backend reference updated appropriately.The link text correctly reflects the expanded Qwen model support including Qwen3 alongside the existing versions.
tests/unittest/llmapi/test_mpi_session.py (1)
71-71
: LGTM: Explicit working directory ensures test robustness.Setting
cwd
to the test file's directory makes the subprocess execution more predictable and prevents issues when the test is run from different working directories.tests/unittest/llmapi/test_llm_pytorch.py (1)
9-14
: LGTM: Import cleanup aligns with test refactoring.Removing the unused
_test_llm_capture_request_error
import is consistent with the test function removal and simplification mentioned in the summary.tests/integration/test_lists/waives.txt (1)
86-86
: LGTM! Enabling MPI session test on Blackwell hardware.The removal of the waiver for
unittest/llmapi/test_mpi_session.py
on Blackwell platforms (B200_PCIe and B200) enables more comprehensive test coverage. This aligns with the broader changes mentioned in the AI summary to support MPI session testing across different hardware configurations.tests/unittest/llmapi/_test_remote_mpi_session.sh (1)
10-10
: LGTM! Adding root execution support for CI environments.The addition of the
--allow-run-as-root
flag to thempirun
command is appropriate for testing environments where processes may need to run with elevated privileges, such as in containerized CI systems.tests/unittest/llmapi/test_llm_multi_gpu.py (1)
469-469
: LGTM! Simplifying function call by removing explicit parameter.The removal of the
pytorch_backend=False
argument simplifies the function call to rely on default parameter handling, which aligns with the consolidation of error capture test logic mentioned in the AI summary.tests/unittest/_torch/test_fp8_per_tensor_scale_tllmg_gemm.py (1)
103-103
: LGTM! Tolerance adjustment aligns with FP8 quantization enhancements.The relaxed tolerance thresholds (from 1e-3 to 1e-2) in the assertion are appropriate given the FP8 quantization improvements and performance enhancements mentioned in the broader update. This ensures the test remains robust while accommodating the numerical differences introduced by the updated FP8 implementation.
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)
18-18
: LGTM! Test configuration addition is consistent.The new timeout entry for
TestDeepSeekR1::test_nvfp4_multi_gpus[latency_trtllmgen]
follows the established pattern and uses a consistent 180-second timeout value.docs/source/quick-start-guide.md (2)
11-12
: Excellent addition for legal compliance.The license warning appropriately informs users about third-party dependencies and their licensing obligations before installation.
97-99
: Good documentation improvement.Adding the language specifier to the code block enhances documentation clarity and enables proper syntax highlighting.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1219-1220
: LGTM - Improved specificity in dummy request handling.This change refines the conditional logic to be more granular, excluding only specific types of dummy requests (
is_attention_dp_dummy
oris_cuda_graph_dummy
) rather than all dummy requests. This improvement allows other types of dummy requests to have their token IDs processed when needed, which aligns with the broader PyTorch backend improvements mentioned in the summary.tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
650-650
: LGTM: Memory optimization adjustments align with performance improvements.The systematic reduction of
free_gpu_memory_fraction
from 0.9 to 0.75 in KvCacheConfig instances is consistent with the broader optimization efforts mentioned in the PR summary. This change should help prevent OOM issues in CI environments while maintaining test coverage.However, I notice two instances (lines 781 and 945) in the
test_cute_dsl_fp8_block_scales
methods still use 0.9. Please verify whether this is intentional due to specific memory requirements for CUTE DSL operations.Also applies to: 690-690, 728-728, 816-816, 841-841, 882-882, 1073-1073, 1124-1124, 1181-1181
tensorrt_llm/llmapi/llm.py (1)
563-563
: LGTM: Consistent reference to max_seq_len in error message.This fix ensures the error message references the same
max_seq_len
value that was calculated and used in the validation logic above (lines 554-555), improving consistency and accuracy.docs/source/performance/perf-overview.md (5)
31-65
: LGTM: Performance data updates for Llama 3.3 70B FP4 and Llama 3.1 405B FP4.The updated throughput numbers and table formatting look consistent with the existing benchmark structure. The data provides comprehensive coverage across different tensor parallel sizes and input/output sequence length combinations.
69-69
: LGTM: Addition of Llama 3.3 70B FP8 model to the list.The model addition is consistent with the performance data updates in the tables below.
71-71
: LGTM: Addition of Llama 4 Maverick model to the list.This aligns with the new performance section added for this model.
75-143
: LGTM: Comprehensive performance data updates for FP8 models.The updated throughput numbers cover multiple models (Llama 3.1 8B FP8, Llama 3.3 70B FP8, Llama 3.1 405B FP8) and the new Llama 4 Maverick FP8 section. All data follows the consistent table format and provides good coverage across different hardware configurations (H200, H100) and parallelism settings.
219-220
: LGTM: Added clarification about benchmark configuration.The note about v0.20 benchmark data collection configuration helps users understand which benchmark version and settings were used for the data.
tests/unittest/llmapi/test_llm.py (2)
2093-2107
: LGTM: Good test simplification!The refactoring removes conditional backend logic and unifies error handling expectations, making the test more focused and maintainable. The change from conditionally expecting
ValueError
vsRequestError
to always expectingRequestError
aligns with the unified error handling approach mentioned in the PR summary.
2109-2110
: LGTM: Consistent with helper function changes.The function call correctly updated to match the simplified helper function signature.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
41-41
: LGTM: Clean import addition for P2P capability checking.The import of
can_access_peer
is properly placed and necessary for the P2P support functionality added to the DeepSeek decoder layer.
606-606
: LGTM: Efficient P2P capability initialization.Computing P2P support once during initialization is more efficient than checking it repeatedly during forward passes. The placement is appropriate after mapping assignment.
797-801
: LGTM: Proper integration of P2P support in MoE finalization logic.The addition of
self.is_p2p_supported
to thedo_finalize
condition correctly ensures that MoE output finalization occurs immediately when P2P is not available, preventing the optimization path from being taken inappropriately.triton_backend/ci/L0_backend_trtllm/custom_metrics_verification_tests.py (7)
85-85
: LGTM: Method signature simplification.Removing the
is_v1
parameter simplifies the metrics parsing by treating all metrics uniformly without version-specific branching.
94-94
: LGTM: Consistent method call update.The method call correctly removes the
is_v1
parameter to match the simplified method signature.
98-98
: LGTM: Consistent signature simplification.The method signature correctly removes version-specific parameters, maintaining consistency with the unified metrics handling approach.
109-109
: LGTM: Defensive assertion for metric consolidation.The assertion correctly validates that v1-specific metrics are not present in the unified metrics handling, ensuring the consolidation is working as expected.
112-112
: LGTM: Base test method simplification.The removal of the
is_v1
parameter from the main test orchestration method completes the transition to unified metrics handling.
114-114
: LGTM: Consistent method call in base test.The method call correctly removes the
is_v1
parameter, maintaining consistency with the simplified parsing method signature.
141-165
: LGTM: Complete test method consolidation.All test methods have been consistently updated to remove version-specific parameters, completing the transition to unified metrics verification. The consolidation from separate v1/v2 tests to a single approach simplifies the test suite while maintaining coverage.
docs/source/release-notes.md (1)
46-52
: Confirm version alignment across docs.The infrastructure table bumps PyTorch to 2.7.1 and TensorRT to 10.11, but older sections (0.20.x) list 2.7.0 / 10.10.0.
Double-check that the numbers here match the actual container tags and wheel metadata for 0.21.0 to avoid confusion.
PR_Github #12508 [ skip ] triggered by Bot |
PR_Github #12508 [ skip ] completed with state |
Mass integration of commits from the release 0.21 branch. (part 4)
Please help review your commit!
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Documentation
New Features
Bug Fixes
Chores
Refactor