-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][chore] Mass integration of release/1.0 - 3rd #7519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e87278d
to
609b545
Compare
📝 Walkthrough✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)
1-3
: Add required NVIDIA copyright header (2025)Guidelines require the NVIDIA copyright header on all .py files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 import asyncio from difflib import SequenceMatcher from pathlib import Pathcpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)
73-81
: UB/crash: dereferencing possibly empty optionallogitsPostProcessorBatched
If any request sets
mApplyLogitsPostProcessorBatched
and the caller didn’t pass a batched processor (uses defaultnullopt
),(*logitsPostProcessorBatched)(...)
is undefined behavior.Apply:
// Invoke batched processor if (!reqIdsVec.empty()) { logitsPostProcessorIsApplied = true; - if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank()) - { - (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec); - } + if (!logitsPostProcessorBatched) + { + TLLM_THROW("Batched logits post-processor requested but not provided."); + } + if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank()) + { + (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec); + } }cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
246-257
: Data race: reading mReadyResponses without holding its mutex
!mReadyResponses.empty()
is checked withoutmResponderMutex
. Access tomReadyResponses
must be protected consistently.- if (!isSending() && !mReadyResponses.empty()) + bool hasReady = false; + { + std::unique_lock lk(mResponderMutex); + hasReady = !mReadyResponses.empty(); + } + if (!isSending() && hasReady) { auto const& requestInfo = mSender->recvRequestInfo();tests/integration/defs/disaggregated/test_disaggregated.py (2)
297-301
: Avoid masking exceptions if process startup fails.server_proc/workers_proc may be undefined if context setup throws, causing UnboundLocalError in finally.
- finally: - server_proc.terminate() - workers_proc.terminate() - server_proc.wait() - workers_proc.wait() + finally: + sp = locals().get("server_proc") + wp = locals().get("workers_proc") + if sp is not None: + sp.terminate() + sp.wait() + if wp is not None: + wp.terminate() + wp.wait()
1289-1293
: Same cleanup robustness for benchmark runner.Mirror the safe termination pattern here as well.
- finally: - server_proc.terminate() - workers_proc.terminate() - server_proc.wait() - workers_proc.wait() + finally: + sp = locals().get("server_proc") + wp = locals().get("workers_proc") + if sp is not None: + sp.terminate() + sp.wait() + if wp is not None: + wp.terminate() + wp.wait()tensorrt_llm/llmapi/mpi_session.py (1)
418-449
: Fix race on results aggregation and avoid silent drops when send retries exhaust.
self.results.append(...)
and the length check run in multiple callbacks without synchronization; two near-simultaneous completions can both observe the “all done” condition and send twice, or clear while the other thread is still pickling, yielding empty/partial payloads.ZeroMqQueue.put_noblock
now swallows EAGAIN after retries and only logs an error; the surrounding try/except won’t detect failure. Clearingself.results
unconditionally can lose results and cause the client to hang indefinitely.Apply this diff to make the callback idempotent and thread-safe, and to avoid clearing until after a successful handoff. It also aligns retries with other call sites and adds a small backoff:
@@ - def mpi_future_callback(self, future): - print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n", - "red") - if future.exception() is not None: - print_colored_debug( - f"mpi_future got exception: {future.exception()}, quitting\n", - "red") - self.queue.put(future.exception()) - return - - result = future.result() - self.results.append(result) - print_colored_debug( - f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n", - "grey") - if len(self.results) == self.num_results: - print_colored_debug( - f"RemoteMpiCommSessionServer received all results, sending to client\n", - "green") - try: - self.queue.put_noblock(self.results, retry=2) - except zmq.ZMQError as e: - # The client could be shutdown first. - if e.errno == zmq.EAGAIN: - pass - else: - raise e - - print_colored_debug( - f"RemoteMpiCommSessionServer sent results to client\n", "green") - self.results.clear() + def mpi_future_callback(self, future): + print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n", "red") + + to_send = None + # Aggregate results and decide send-once under a lock + with self._results_lock: + if future.exception() is not None: + print_colored_debug( + f"mpi_future got exception: {future.exception()}, quitting\n", "red") + if not self._results_sent: + to_send = future.exception() + self._results_sent = True + else: + result = future.result() + self.results.append(result) + print_colored_debug( + f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n", + "grey") + if len(self.results) == self.num_results and not self._results_sent: + to_send = list(self.results) # snapshot to avoid mutation while pickling + self._results_sent = True + + if to_send is None: + return + + print_colored_debug( + f"RemoteMpiCommSessionServer sending {'exception' if not isinstance(to_send, list) else 'results'} to client\n", + "green") + try: + # Slightly longer retry for robustness; keep non-blocking semantics in callback context + self.queue.put_noblock(to_send, retry=4, wait_time=0.01) + except zmq.ZMQError as e: + if e.errno != zmq.EAGAIN: + raise + else: + if isinstance(to_send, list): + # Clear only after handoff attempt; we can't detect failure with current API, so rely on idempotent send-once gate. + with self._results_lock: + self.results.clear() + print_colored_debug(f"RemoteMpiCommSessionServer send complete\n", "green")Add the following support changes outside this hunk:
# In RemoteMpiCommSessionServer.__init__ (near line ~356) self._results_lock = threading.Lock() self._results_sent = False# In RemoteMpiCommSessionServer.serve() just before scheduling futures (near lines ~409-416) with self._results_lock: self.results = [] self._results_sent = FalseOptionally, consider making
ZeroMqQueue.put_noblock
return a bool to signal success so the server can decide whether to retain and retry later instead of unconditionally proceeding. I can draft that change if desired.tests/integration/defs/perf/pytorch_model_config.py (1)
30-31
: Fix Python 3.8 typing: list[str] is not supported without postponed evaluation.The repository targets Python 3.8+. Using list[str] will error at import time on 3.8. Switch to typing.List.
-def get_model_yaml_config(model_label: str, - lora_dirs: list[str] = None) -> dict: +from typing import Optional, List # at file top + +def get_model_yaml_config(model_label: str, + lora_dirs: Optional[List[str]] = None) -> dict:Add the import near the other imports in this module if not present:
from typing import Optional, Listtensorrt_llm/_torch/attention_backend/interface.py (1)
304-306
: Also propagate buffers (and max_draft_tokens) to cross sub-metadata.Without this, cross metadata won’t share the preallocated buffers and may reallocate.
- cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata( - max_batch_size, True) + cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata( + max_batch_size, + sub_cross_metadata=True, + max_draft_tokens=max_draft_tokens, + buffers=buffers, + )tensorrt_llm/_torch/modules/gated_mlp.py (1)
1-3
: Add NVIDIA copyright header (2025).This file is missing the required NVIDIA copyright header for 2025.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)
1-5
: Add NVIDIA copyright header (2025).+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
138-139
: Disconnect intercommunicator on worker shutdown to prevent resource leaks.llm.shutdown() + try: + intercomm.Disconnect() + except Exception as e: + print(f"Worker {rank} intercomm disconnect error: {e}", flush=True)
🧹 Nitpick comments (50)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)
20-20
: Prefer xfail over skip to surface unexpected passes; also avoid invalid URL in reasonUsing xfail keeps the test visible and reports XPASS if the bug is fixed unintentionally. Drop the "https://" prefix to avoid broken-link checkers.
-@pytest.mark.skip(reason="https://nvbugs/5470782") +@pytest.mark.xfail(reason="nvbugs/5470782", strict=False)jenkins/L0_Test.groovy (2)
2275-2282
: Refactor is fine; prefer not to reach back intovalues
hereUsing
values[2]
is behaviorally equivalent tocpu_arch
, but it couples this inner block to the outer list indexing. For readability and to avoid accidental drift, use the already-derivedcpu_arch
.Also, please double-check that the CUDA repo path for Ubuntu 24.04 should remain
sbsa
for ARM. If NVIDIA switched toarm64
for 24.04, this would break.Apply:
- def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204" - def platform = values[2] == X86_64_TRIPLE ? "x86_64" : "sbsa" + def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204" + def platform = cpu_arch == X86_64_TRIPLE ? "x86_64" : "sbsa"
351-381
: UnifydockerGPUOption
naming (straydockerGpuOption
is unused)There’s a casing mismatch:
dockerGpuOption
is declared butdockerGPUOption
is used/assigned. Keep one name to avoid confusion and accidental scope issues.- def dockerGpuOption = "" + def dockerGPUOption = "" @@ - dockerGPUOption = sh(script: """ + dockerGPUOption = sh(script: """tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
6-7
: Fail fast on unsupported SM and include details in the errorGood guard. Two improvements:
- Move the SM check before super().init to avoid unnecessary init on unsupported GPUs.
- Include the detected SM in the error to aid debugging. Also confirm intent vs. docstring (“SM100 only”) since the guard only blocks SM>=120.
Apply within this block to improve the message:
- sm_version = get_sm_version() - if sm_version >= 120: - raise NotImplementedError( - "TRTLLMGenFusedMoE does not support SM120 and above.") + sm_version = get_sm_version() + if sm_version >= 120: + raise NotImplementedError( + f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")And move the check to run before super().init (outside the selected range, example placement shown):
# Place right at the start of __init__, before calling super().__init__ sm_version = get_sm_version() if sm_version >= 120: raise NotImplementedError(f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")Also applies to: 83-87
tensorrt_llm/serve/openai_disagg_server.py (1)
325-327
: Strengthen ctx_request_id validationThe None-check is good; also validate type and positivity to prevent downstream issues in metrics and routing.
- if choices[0].disaggregated_params.ctx_request_id is None: - raise ValueError("Invalid disaggregated params in context phase response.") + ctx_id = choices[0].disaggregated_params.ctx_request_id + if ctx_id is None or not isinstance(ctx_id, int) or ctx_id <= 0: + raise ValueError("Invalid ctx_request_id in context phase response.")tensorrt_llm/_torch/model_config.py (1)
497-499
: Guard against non-divisible MLP size before TP splitDividing by tp_size without checking divisibility can silently truncate and misconfigure bindings.
- mlp_hidden_size = self._infer_nemotron_ffn_mult( - ) // self.mapping.tp_size + mlp_hidden_size_global = self._infer_nemotron_ffn_mult() + if mlp_hidden_size_global % self.mapping.tp_size != 0: + raise ValueError( + f"Inferred mlp_hidden_size {mlp_hidden_size_global} is not divisible by tp_size {self.mapping.tp_size}" + ) + mlp_hidden_size = mlp_hidden_size_global // self.mapping.tp_sizetests/integration/test_lists/waives.txt (1)
291-297
: Dedupe duplicate SKIP entries
- tests/integration/test_lists/waives.txt lines 244 & 283: llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp
- tests/integration/test_lists/waives.txt lines 264 & 291: test_e2e.py::test_ptp_quickstart_multimodal[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False]
Remove one occurrence of each to keep the waiver list concise.tensorrt_llm/bench/dataclasses/configuration.py (1)
87-104
: Add NVIDIA copyright header.Per repo guidelines, prepend the 2025 NVIDIA header at file top.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +""" existing imports...tensorrt_llm/executor/ipc.py (2)
128-158
: Polish put_noblock: docstring D205 and avoid assert for validation.
- Fix D205: add a blank line after the summary line.
- Replace
assert
withValueError
(asserts can be stripped with -O).- Optional: use a loop instead of recursion; include queue name in the error for easier traceability.
def put_noblock(self, obj: Any, *, retry: int = 1, wait_time: float = 0.001): - ''' - Put an object into the queue without blocking, and retry if the send fails. - NOTE: It won't raise any error if the send fails. + ''' + Put an object into the queue without blocking, and retry if the send fails. + + NOTE: It won't raise any error if the send fails. Parameters: obj (Any): The object to send. retry (int): The number of times to retry sending the object. wait_time (float): The time to wait before retrying. ''' - assert retry >= 0 and retry <= 10, "Retry must be between 0 and 10, adjust the wait_time if needed" + if not (0 <= retry <= 10): + raise ValueError("retry must be between 0 and 10; increase wait_time if needed") self.setup_lazily() with nvtx_range_debug("send", color="blue", category="IPC"): data = pickle.dumps(obj) # nosec B301 if self.use_hmac_encryption: data = self._sign_data(data) - try: - self.socket.send(data, flags=zmq.NOBLOCK) - except zmq.Again: - if retry > 0: - time.sleep(wait_time) - self.put_noblock(obj, retry=retry - 1, wait_time=wait_time) - else: - logger.error(f"Failed to send object: {obj}") + while True: + try: + self.socket.send(data, flags=zmq.NOBLOCK) + break + except zmq.Again: + if retry > 0: + retry -= 1 + time.sleep(wait_time) + continue + logger.error(f"[{self.name or 'ipc'}] Failed to send object: {type(obj).__name__}") + break
1-10
: Add NVIDIA copyright header.This file is missing the standard header; please add.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# """ existing imports...examples/llm-api/_tensorrt_engine/quickstart_example.py (2)
7-9
: Prefer constructor kwargs (if supported) and keep namespaced usageCleaner and immutable config if BuildConfig supports kwargs; also aligns with namespaced import.
Apply:
- build_config = BuildConfig() - build_config.max_batch_size = 256 - build_config.max_num_tokens = 1024 + build_config = trtllm.BuildConfig(max_batch_size=256, max_num_tokens=1024)If kwargs aren’t supported, keep attribute assignment but use
trtllm.BuildConfig()
.
13-14
: Use namespaced symbols consistentlyKeep
SamplingParams
and any other symbols under the module namespace per guidelines.Apply:
- llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", - build_config=build_config) + llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", + build_config=build_config) @@ - sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + sampling_params = trtllm.SamplingParams(temperature=0.8, top_p=0.95)Also applies to: 24-24
cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)
42-47
: Pre-size vectors to reduce reallocations in hot pathReserve capacity to avoid repeated allocations when many requests use batched processing.
Apply:
// Arguments for batched processor std::vector<LlmRequest::RequestIdType> reqIdsVec; std::vector<LlmRequest::TensorPtr> logitsVec; std::vector<std::reference_wrapper<LlmRequest::BeamTokens const>> beamTokensVec; std::vector<std::optional<LlmRequest::RequestIdType>> clientIdsVec; + reqIdsVec.reserve(inputBuffers.decoderRequests.size()); + logitsVec.reserve(inputBuffers.decoderRequests.size()); + beamTokensVec.reserve(inputBuffers.decoderRequests.size()); + clientIdsVec.reserve(inputBuffers.decoderRequests.size());Also applies to: 66-71
cpp/include/tensorrt_llm/batch_manager/logitsPostProcessor.h (4)
50-50
: Avoid const& optional with std::nullopt default; also note ABI changeBinding a const& to a temporary optional via the default
std::nullopt
is legal but non-obvious and changes the public ABI. Passing by value is simpler, avoids lifetime surprises, and copies at most one small optional. If the intent is to avoid copying the containedstd::function
, prefer pass-by-value and move inside the definition.- std::optional<LogitsPostProcessorBatched> const& logitsPostProcessorBatched = std::nullopt) const; + std::optional<LogitsPostProcessorBatched> logitsPostProcessorBatched = std::nullopt) const;Please confirm no out-of-tree consumers rely on the previous by-value signature.
1-16
: Update copyright year to include 2025Repo guideline says prepend current year. Header shows 2022-2024.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
18-18
: Headers should use include guards, not only#pragma once
Guideline: use guards of the form
TRTLLM_<FILENAME>_H
.Example (apply at file top/bottom):
-#pragma once +#ifndef TRTLLM_LOGITSPOSTPROCESSOR_H +#define TRTLLM_LOGITSPOSTPROCESSOR_H ... +#endif // TRTLLM_LOGITSPOSTPROCESSOR_H
44-44
: Constant naming styleGuideline: constants use k prefix with UPPER_SNAKE. Consider renaming
name
tokLOGITS_POST_PROCESSOR_NAME
.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
172-176
: Ensure background task termination and surface exceptionsDestructor sets
mTerminate
but doesn’t explicitly join/observemResponseFuture
. It’s safer to wait and surface exceptions fromresponse()
.~Impl() { terminate(); + if (mResponseFuture.valid()) { + mResponseFuture.get(); // join and surface any exception + } }
213-221
: Detached per-send threads may balloon; prefer a pool or reuse async workerCurrent path spawns a detached thread per response (when parallel send is enabled). This can cause unbounded thread creation under load.
- Reuse a fixed-size thread pool (or reuse the existing async worker).
- Alternatively, gate parallel sends with a semaphore.
docs/source/commands/trtllm-serve/trtllm-serve.rst (2)
204-210
: Grammar/clarity fixes for PyTorch metrics noteSubject–verb agreement and clarity.
- The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend. - Some fields, such as CPU memory usage, are not yet available for the PyTorch backend. - Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration. + The metrics endpoint for the default PyTorch backend is in beta and is not as comprehensive as the TensorRT backend. + Some fields (for example, CPU memory usage) are not yet available for the PyTorch backend. + Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
219-227
: Tighten wording around polling guidanceMinor style improvements and remove passive voice.
-After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint. -Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed. +After you send at least one inference request, poll the ``/metrics`` endpoint to fetch runtime-iteration statistics. +Because statistics are kept in an internal queue and removed once retrieved, poll shortly after each request and persist results if needed.tensorrt_llm/llmapi/utils.py (4)
521-538
: Prevent duplicate status tags in generated docsIf ApiParamTagger already prefixes the field description with a tag, this adds a second identical tag on the arg line. Guard to avoid duplication.
- arg_line = f"{indent} {field_name} ({type_str}): " - if status := field_info.get("status", None): - arg_line += f":tag:`{status}` " + arg_line = f"{indent} {field_name} ({type_str}): " + if status := field_info.get("status", None): + # Avoid duplicating tag if description already contains it + if not (field_description or "").lstrip().startswith(f":tag:`{status}`"): + arg_line += f":tag:`{status}` "
484-485
: Future-proof Pydantic schema accessPrefer model_json_schema() in Pydantic v2 with a v1 fallback.
- schema = model.schema() + schema = (model.model_json_schema() + if hasattr(model, "model_json_schema") + else model.schema())
562-565
: Fix D200: one-line docstring formattingRemove leading space and use concise wording.
- """ The main entry point to tag the api doc. """ + """The main entry point to tag the API doc."""
574-589
: Handle None descriptions and idempotency in description taggingIf description is None, current code yields "None" in docs; also re-tagging can double-prefix.
- for field_name in field_names: - field = cls.model_fields[field_name] - cls.model_fields[ - field_name].description = f":tag:`{tag}` {field.description}" + for field_name in field_names: + field = cls.model_fields[field_name] + desc = (field.description or "").strip() + prefix = f":tag:`{tag}` " + if not desc.startswith(prefix): + desc = prefix + desc + cls.model_fields[field_name].description = desc cls.model_rebuild(force=True)tests/unittest/llmapi/test_utils.py (1)
31-34
: Strengthen assertion to reduce false positivesAlso assert that at least one specific LlmArgs field line carries the tag.
def test_generate_api_docs_as_docstring(): - doc = generate_api_docs_as_docstring(LlmArgs) - assert ":tag:`beta`" in doc, "the label is not generated" + doc = generate_api_docs_as_docstring(LlmArgs) + assert ":tag:`beta`" in doc, "the label is not generated" + # Optional: check a representative field is tagged + assert re.search(r"\s+world_size.*:.*:tag:`beta`", doc), "world_size not tagged"tests/integration/defs/test_e2e.py (1)
1746-1753
: Advance regex cursor using match.end() to avoid re-scanSlicing from the end of group(1) re-scans the inner text; using match.end() is cleaner and marginally faster.
- _, end = match.span(1) - results.append(match.group(2)) - item = item[end:] + results.append(match.group(2)) + item = item[match.end():]tensorrt_llm/_torch/autotuner.py (2)
372-375
: Log dedup key scope (optional).If you want one warning per (op, shape) instead of per op, include shapes in the key. Current change is correct; this is optional.
- logger.warning_once( - f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}", - key=custom_op) + logger.warning_once( + f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}", + key=("cache_miss", custom_op, input_shapes))
1-1
: License header consistency (nit).Other Python files carry the NVIDIA SPDX header; this one doesn’t. Consider adding for consistency/compliance.
tensorrt_llm/executor/proxy.py (1)
352-355
: Optional: tune wait_time to reduce spin.Non-blocking retries default to 1ms sleeps. If shutdown storms persist, consider a slightly larger wait_time (e.g., 5–10ms).
- self.request_queue.put_noblock(None, retry=4) + self.request_queue.put_noblock(None, retry=4, wait_time=0.005)tensorrt_llm/logger.py (1)
111-116
: Avoid using assert for runtime argument validation.asserts can be stripped with -O; prefer an explicit exception for API safety.
- def log_once(self, severity, *msg, key): - assert key is not None, "key is required for log_once" + def log_once(self, severity, *msg, key): + if key is None: + raise ValueError("key is required for log_once") if key not in self._appeared_keys: self._appeared_keys.add(key) self.log(severity, *msg)tensorrt_llm/_torch/models/modeling_qwen3.py (2)
161-178
: Avoid hard-coding deep_gemm disable; make it configurable or scoped.Hard-coding
disable_deep_gemm = True
globally may regress performance where the bug doesn’t manifest (e.g., non-FP8, non-SM100). Prefer a config/env toggle so it can be flipped without code changes.Minimal change using an env var (defaults to disabled to preserve current behavior):
@@ - # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712) - disable_deep_gemm = True + # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712) + # Allow override via env; default to disabling for safety. + disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1" @@ config=model_config, disable_deep_gemm=disable_deep_gemm, )Add import once at the top of the file:
import osIf you prefer, we can instead plumb a
disable_deep_gemm
flag throughModelConfig
and only disable when FP8 is active andsm
== 100.
251-261
: Mirror the configurability for MLP.Keep Attention/MLP behavior consistent by reusing the same toggle when constructing
GatedMLP
.- # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712) - disable_deep_gemm = True + # Qwen3 has accuracy issues with deep_gemm + disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1" @@ config=model_config, disable_deep_gemm=disable_deep_gemm, )tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
856-881
: Balance NVTX ranges: add range_pop for the pushes in Stage 2.Both branches push an NVTX range but never pop, which skews traces.
if previous_batch is not None: sample_state = previous_batch.sample_state if not self.dist.is_last_pp_rank: torch.cuda.nvtx.range_push( "_handle_new_tokens_inter_pp") @@ else: torch.cuda.nvtx.range_push("_handle_new_tokens_last_pp") sample_state.sampler_event.synchronize() # Send tokens to next pp rank (w.r.t model forward direction) # Second last rank does not need to since last rank has original decoded tokens if not self.dist.is_second_last_pp_rank: self.wait_on_pp_send_handles(prev_microbatch_id) self.send_handles[ prev_microbatch_id] = self.dist.isend_object( sample_state.host, dest=self.dist.next_pp_rank, tag=prev_microbatch_id) + torch.cuda.nvtx.range_pop()
tensorrt_llm/_torch/modules/linear.py (2)
616-631
: Guard looks correct; add a quick SM cache to avoid repeated device queries.Minor: cache
sm = get_sm_version()
once per call (or at init) to avoid duplicate CUDA property queries in hot path.- if get_sm_version() == 100 and not module.disable_deep_gemm: + sm = get_sm_version() + if sm == 100 and not module.disable_deep_gemm:
1598-1599
: Public API grew: documentdisable_deep_gemm
.Please add a short Google-style docstring entry for this constructor param and the new attribute to align with repo guidelines.
Also applies to: 1616-1617
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
603-606
: OK to delegate to buffer-aware post-init; consider type annotation.
_post_init_with_buffers(self, buffers)
is fine. Consider annotatingbuffers: Optional[dict[str, list[torch.Tensor]]]
for clarity.tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
1568-1572
: Consolidate SM 120/121 skip and tighten condition.Use membership test and include actual SM in the message for quicker triage.
- if moe_backend == "TRTLLM" and (get_sm_version() == 120 - or get_sm_version() == 121): - pytest.skip( - "MOE TRTLLM backend does not support SM version 120 or 121") + if moe_backend == "TRTLLM": + sm = get_sm_version() + if sm in (120, 121): + pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
1621-1624
: Apply same SM check refactor here.Mirror the consolidated pattern to avoid duplication and repeated device queries.
- if moe_backend == "TRTLLM" and (get_sm_version() == 120 - or get_sm_version() == 121): - pytest.skip( - "MOE TRTLLM backend does not support SM version 120 or 121") + if moe_backend == "TRTLLM": + sm = get_sm_version() + if sm in (120, 121): + pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
1895-1899
: Same: compress SM 120/121 guard.Keep the message informative with the detected SM.
- if moe_backend == "TRTLLM" and (get_sm_version() == 120 - or get_sm_version() == 121): - pytest.skip( - "MOE TRTLLM backend does not support SM version 120 or 121") + if moe_backend == "TRTLLM": + sm = get_sm_version() + if sm in (120, 121): + pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
2523-2527
: Same: unify the TRTLLM MOE skip.Minor style and efficiency improvement.
- if moe_backend == "TRTLLM" and (get_sm_version() == 120 - or get_sm_version() == 121): - pytest.skip( - "MOE TRTLLM backend does not support SM version 120 or 121") + if moe_backend == "TRTLLM": + sm = get_sm_version() + if sm in (120, 121): + pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
2719-2723
: Same: compress condition, show SM.Align with other sites.
- if moe_backend == "TRTLLM" and (get_sm_version() == 120 - or get_sm_version() == 121): - pytest.skip( - "MOE TRTLLM backend does not support SM version 120 or 121") + if moe_backend == "TRTLLM": + sm = get_sm_version() + if sm in (120, 121): + pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")tensorrt_llm/_torch/modules/gated_mlp.py (3)
32-34
: New flags accepted; document behavior and defaults.Expose a brief docstring (init param section) for
use_cute_dsl_blockscaling_mm
anddisable_deep_gemm
to make their interaction clear.Apply:
class GatedMLP(nn.Module): @@ - def __init__(self, + def __init__(self, *, hidden_size: int, @@ - use_cute_dsl_blockscaling_mm: bool = False, - disable_deep_gemm: bool = False): + use_cute_dsl_blockscaling_mm: bool = False, + disable_deep_gemm: bool = False): + """ + Args: + use_cute_dsl_blockscaling_mm: Enable CUTE DSL block-scaling matmul path. + disable_deep_gemm: Force-disable deep-GEMM (e.g., on SM100) for this MLP. + """
108-118
: Add a brief docstring for_apply_activation
(newhas_lora
).Clarify that FP8 is disabled when LoRA is active due to grouped_gemm limitation.
- def _apply_activation(self, x, *, has_lora: bool = False): + def _apply_activation(self, x, *, has_lora: bool = False): + """Apply activation. + Args: + has_lora: If True, route to non-FP8 path until LoRA grouped_gemm supports FP8. + """
146-151
: Explicitness nit: passhas_lora=False
for readability.No behavior change; improves clarity when scanning callsites.
- h2 = self._apply_activation(h1) + h2 = self._apply_activation(h1, has_lora=False)tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (5)
67-73
: Generalize termination to detected remote size.Avoid hard-coded
dest=0/1
; iterate over remote group.def mpi_send_termination_request(intercomm): if intercomm is not None: - # Send termination requests - intercomm.send(None, dest=0, tag=MPI_REQUEST) - intercomm.send(None, dest=1, tag=MPI_REQUEST) - print("Sent termination requests to the workers.") + # Send termination requests to all remote ranks + try: + for dest in range(intercomm.Get_remote_size()): + intercomm.send(None, dest=dest, tag=MPI_REQUEST) + print("Sent termination requests to the workers.", flush=True) + except Exception as e: + print(f"Termination request send error: {e}", flush=True)
87-96
: Fix error messages: it's Lookup/Connect, not Publish.Message text is misleading when Connect fails.
- except MPI.Exception as e: - print(f"Error publishing port name: {e}") - raise e - except Exception as e: - print(f"Unexpected error publishing port name: {e}") - raise e + except MPI.Exception as e: + print(f"Error looking up or connecting to port name: {e}", flush=True) + raise + except Exception as e: + print(f"Unexpected error looking up or connecting to port name: {e}", flush=True) + raise
381-384
: Preserve exception details in MPI except block.- except MPI.Exception as e: - print(f"MPI Error") - raise e + except MPI.Exception as e: + print(f"MPI Error: {e}", flush=True) + raise
485-488
: Preserve exception details in MPI except block.- except MPI.Exception as e: - print(f"MPI Error") - raise e + except MPI.Exception as e: + print(f"MPI Error: {e}", flush=True) + raise
25-36
: Optional: unique service name per test to avoid cross-test collisions.Derive a unique service name (e.g., using PID/UUID) instead of hard-coding
'my_port'
.Example:
-MPI.Publish_name('my_port', port_name) +service_name = f"trtllm_port_{os.getpid()}_{int(asyncio.get_running_loop().time()*1e9)}" +MPI.Publish_name(service_name, port_name)Propagate
service_name
to lookup sites.
/bot run --disable-fail-fast |
PR_Github #17639 [ run ] triggered by Bot |
NVIDIA#7159) Signed-off-by: Hui Gao <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: qqiao <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…aph_config.max_batch_size is not provided in trtllm-bench (NVIDIA#7031) Signed-off-by: Jiagan Cheng <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…te (NVIDIA#7225) * Why? Some models (e.g. anything produced by Mistral) can have both sharded safetensors and a consolidated safetensor in the same checkpoint directory. In such cases, prefetching both to memory is a waste of time, and memory. * What? This commit skips over consolidated safetensors when they are not the only safetensor file present in the checkpoint directory. Signed-off-by: William Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…VIDIA#7265) Signed-off-by: William Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…DIA#7267) Signed-off-by: Balaram Buddharaju <[email protected]> Co-authored-by: Larry <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…er (NVIDIA#7245) Signed-off-by: Jin Li <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
67f6480
to
b69fb27
Compare
/bot run --skip-test |
PR_Github #17966 [ run ] triggered by Bot |
PR_Github #17966 [ run ] completed with state |
… for nemotron-nas (NVIDIA#7231) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…VIDIA#7271) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a_params_from_requests (NVIDIA#7203) Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…uracy issues (NVIDIA#7170) Signed-off-by: Dom Brown <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
b69fb27
to
74fc47c
Compare
/bot skip --comment "Drop erroneous change" |
PR_Github #17981 [ skip ] triggered by Bot |
PR_Github #17981 [ skip ] completed with state |
Signed-off-by: Nave Assaf <[email protected]> Signed-off-by: Wangshanshan <[email protected]> Signed-off-by: yechank <[email protected]> Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: qqiao <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Bo Deng <[email protected]> Signed-off-by: Jin Li <[email protected]> Signed-off-by: Yifei Zhang <[email protected]> Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Chenfei Zhang <[email protected]> Signed-off-by: Christina Zhang <[email protected]> Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Pamela <[email protected]> Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Alexandre Milesi <[email protected]> Signed-off-by: Shixiaowei02 <[email protected]> Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: peaceh <[email protected]> Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wanli Jiang <[email protected]> Signed-off-by: Patrice Castonguay <[email protected]> Signed-off-by: ruodil <[email protected]> Signed-off-by: Linda-Stadter <[email protected]> Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Jiagan Cheng <[email protected]> Signed-off-by: William Zhang <[email protected]> Signed-off-by: Dom Brown <[email protected]> Co-authored-by: Nave Assaf <[email protected]> Co-authored-by: Yechan Kim <[email protected]> Co-authored-by: brb-nv <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Emma Qiao <[email protected]> Co-authored-by: Yan Chunwei <[email protected]> Co-authored-by: Bo Deng <[email protected]> Co-authored-by: Jin Li <[email protected]> Co-authored-by: yifeizhang-c <[email protected]> Co-authored-by: amitz-nv <[email protected]> Co-authored-by: Erin <[email protected]> Co-authored-by: chenfeiz0326 <[email protected]> Co-authored-by: ChristinaZ <[email protected]> Co-authored-by: Venky <[email protected]> Co-authored-by: Pamela Peng <[email protected]> Co-authored-by: HuiGao-NV <[email protected]> Co-authored-by: milesial <[email protected]> Co-authored-by: Shi Xiaowei <[email protected]> Co-authored-by: Michal Guzek <[email protected]> Co-authored-by: peaceh-nv <[email protected]> Co-authored-by: Guoming Zhang <[email protected]> Co-authored-by: Wanli Jiang <[email protected]> Co-authored-by: pcastonguay <[email protected]> Co-authored-by: ruodil <[email protected]> Co-authored-by: Linda <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Co-authored-by: Yuxian Qiu <[email protected]> Co-authored-by: Jiagan Cheng <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Larry <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Dom Brown <[email protected]>
Signed-off-by: Nave Assaf <[email protected]> Signed-off-by: Wangshanshan <[email protected]> Signed-off-by: yechank <[email protected]> Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: qqiao <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Bo Deng <[email protected]> Signed-off-by: Jin Li <[email protected]> Signed-off-by: Yifei Zhang <[email protected]> Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Chenfei Zhang <[email protected]> Signed-off-by: Christina Zhang <[email protected]> Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Pamela <[email protected]> Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Alexandre Milesi <[email protected]> Signed-off-by: Shixiaowei02 <[email protected]> Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: peaceh <[email protected]> Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wanli Jiang <[email protected]> Signed-off-by: Patrice Castonguay <[email protected]> Signed-off-by: ruodil <[email protected]> Signed-off-by: Linda-Stadter <[email protected]> Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Jiagan Cheng <[email protected]> Signed-off-by: William Zhang <[email protected]> Signed-off-by: Dom Brown <[email protected]> Co-authored-by: Nave Assaf <[email protected]> Co-authored-by: Yechan Kim <[email protected]> Co-authored-by: brb-nv <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]> Co-authored-by: Emma Qiao <[email protected]> Co-authored-by: Yan Chunwei <[email protected]> Co-authored-by: Bo Deng <[email protected]> Co-authored-by: Jin Li <[email protected]> Co-authored-by: yifeizhang-c <[email protected]> Co-authored-by: amitz-nv <[email protected]> Co-authored-by: Erin <[email protected]> Co-authored-by: chenfeiz0326 <[email protected]> Co-authored-by: ChristinaZ <[email protected]> Co-authored-by: Venky <[email protected]> Co-authored-by: Pamela Peng <[email protected]> Co-authored-by: HuiGao-NV <[email protected]> Co-authored-by: milesial <[email protected]> Co-authored-by: Shi Xiaowei <[email protected]> Co-authored-by: Michal Guzek <[email protected]> Co-authored-by: peaceh-nv <[email protected]> Co-authored-by: Guoming Zhang <[email protected]> Co-authored-by: Wanli Jiang <[email protected]> Co-authored-by: pcastonguay <[email protected]> Co-authored-by: ruodil <[email protected]> Co-authored-by: Linda <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Co-authored-by: Yuxian Qiu <[email protected]> Co-authored-by: Jiagan Cheng <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Larry <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Dom Brown <[email protected]>
Summary by CodeRabbit
New Features
Performance
Bug Fixes
Documentation
Description
As discussed with some PR author, #6946 @PerkzZheng will not be cherry-pick, #7191 @Shixiaowei02 will not be cherry-pick, #7193 will not be cherry-pick @crazydemo, #7149 @Wanli-Jiang will not be cherry-pick, those PRs already have merged in main with different PRs in main branch.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.