Skip to content

Conversation

dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Sep 4, 2025

Summary by CodeRabbit

  • New Features

    • Introduced BuildConfig; LLM now accepts build_config and max_seq_len.
    • Added CUDA graph buffer reuse in attention backend to enable preallocated buffers.
  • Performance

    • Reduced GPU allocations and startup overhead via buffer caching.
    • Auto-injects max_batch_size into PyTorch cuda_graph_config when missing.
  • Bug Fixes

    • Improved accuracy/stability by disabling a problematic GEMM path for specific models.
    • Block unsupported GPUs for a fused MoE path.
    • Added stricter validation in disaggregated serving.
    • Non-blocking queue sends now retry; cleaner shutdown to avoid hangs.
  • Documentation

    • Metrics endpoint guidance updated; examples and quickstart refreshed.

Description

As discussed with some PR author, #6946 @PerkzZheng will not be cherry-pick, #7191 @Shixiaowei02 will not be cherry-pick, #7193 will not be cherry-pick @crazydemo, #7149 @Wanli-Jiang will not be cherry-pick, those PRs already have merged in main with different PRs in main branch.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

📝 Walkthrough
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)

1-3: Add required NVIDIA copyright header (2025)

Guidelines require the NVIDIA copyright header on all .py files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 import asyncio
 from difflib import SequenceMatcher
 from pathlib import Path
cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)

73-81: UB/crash: dereferencing possibly empty optional logitsPostProcessorBatched

If any request sets mApplyLogitsPostProcessorBatched and the caller didn’t pass a batched processor (uses default nullopt), (*logitsPostProcessorBatched)(...) is undefined behavior.

Apply:

     // Invoke batched processor
     if (!reqIdsVec.empty())
     {
         logitsPostProcessorIsApplied = true;
-        if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank())
-        {
-            (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec);
-        }
+        if (!logitsPostProcessorBatched)
+        {
+            TLLM_THROW("Batched logits post-processor requested but not provided.");
+        }
+        if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank())
+        {
+            (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec);
+        }
     }
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

246-257: Data race: reading mReadyResponses without holding its mutex

!mReadyResponses.empty() is checked without mResponderMutex. Access to mReadyResponses must be protected consistently.

-                if (!isSending() && !mReadyResponses.empty())
+                bool hasReady = false;
+                {
+                    std::unique_lock lk(mResponderMutex);
+                    hasReady = !mReadyResponses.empty();
+                }
+                if (!isSending() && hasReady)
                 {
                     auto const& requestInfo = mSender->recvRequestInfo();
tests/integration/defs/disaggregated/test_disaggregated.py (2)

297-301: Avoid masking exceptions if process startup fails.

server_proc/workers_proc may be undefined if context setup throws, causing UnboundLocalError in finally.

-    finally:
-        server_proc.terminate()
-        workers_proc.terminate()
-        server_proc.wait()
-        workers_proc.wait()
+    finally:
+        sp = locals().get("server_proc")
+        wp = locals().get("workers_proc")
+        if sp is not None:
+            sp.terminate()
+            sp.wait()
+        if wp is not None:
+            wp.terminate()
+            wp.wait()

1289-1293: Same cleanup robustness for benchmark runner.

Mirror the safe termination pattern here as well.

-    finally:
-        server_proc.terminate()
-        workers_proc.terminate()
-        server_proc.wait()
-        workers_proc.wait()
+    finally:
+        sp = locals().get("server_proc")
+        wp = locals().get("workers_proc")
+        if sp is not None:
+            sp.terminate()
+            sp.wait()
+        if wp is not None:
+            wp.terminate()
+            wp.wait()
tensorrt_llm/llmapi/mpi_session.py (1)

418-449: Fix race on results aggregation and avoid silent drops when send retries exhaust.

  • self.results.append(...) and the length check run in multiple callbacks without synchronization; two near-simultaneous completions can both observe the “all done” condition and send twice, or clear while the other thread is still pickling, yielding empty/partial payloads.
  • ZeroMqQueue.put_noblock now swallows EAGAIN after retries and only logs an error; the surrounding try/except won’t detect failure. Clearing self.results unconditionally can lose results and cause the client to hang indefinitely.

Apply this diff to make the callback idempotent and thread-safe, and to avoid clearing until after a successful handoff. It also aligns retries with other call sites and adds a small backoff:

@@
-    def mpi_future_callback(self, future):
-        print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n",
-                            "red")
-        if future.exception() is not None:
-            print_colored_debug(
-                f"mpi_future got exception: {future.exception()}, quitting\n",
-                "red")
-            self.queue.put(future.exception())
-            return
-
-        result = future.result()
-        self.results.append(result)
-        print_colored_debug(
-            f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n",
-            "grey")
-        if len(self.results) == self.num_results:
-            print_colored_debug(
-                f"RemoteMpiCommSessionServer received all results, sending to client\n",
-                "green")
-            try:
-                self.queue.put_noblock(self.results, retry=2)
-            except zmq.ZMQError as e:
-                # The client could be shutdown first.
-                if e.errno == zmq.EAGAIN:
-                    pass
-                else:
-                    raise e
-
-            print_colored_debug(
-                f"RemoteMpiCommSessionServer sent results to client\n", "green")
-            self.results.clear()
+    def mpi_future_callback(self, future):
+        print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n", "red")
+
+        to_send = None
+        # Aggregate results and decide send-once under a lock
+        with self._results_lock:
+            if future.exception() is not None:
+                print_colored_debug(
+                    f"mpi_future got exception: {future.exception()}, quitting\n", "red")
+                if not self._results_sent:
+                    to_send = future.exception()
+                    self._results_sent = True
+            else:
+                result = future.result()
+                self.results.append(result)
+                print_colored_debug(
+                    f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n",
+                    "grey")
+                if len(self.results) == self.num_results and not self._results_sent:
+                    to_send = list(self.results)  # snapshot to avoid mutation while pickling
+                    self._results_sent = True
+
+        if to_send is None:
+            return
+
+        print_colored_debug(
+            f"RemoteMpiCommSessionServer sending {'exception' if not isinstance(to_send, list) else 'results'} to client\n",
+            "green")
+        try:
+            # Slightly longer retry for robustness; keep non-blocking semantics in callback context
+            self.queue.put_noblock(to_send, retry=4, wait_time=0.01)
+        except zmq.ZMQError as e:
+            if e.errno != zmq.EAGAIN:
+                raise
+        else:
+            if isinstance(to_send, list):
+                # Clear only after handoff attempt; we can't detect failure with current API, so rely on idempotent send-once gate.
+                with self._results_lock:
+                    self.results.clear()
+        print_colored_debug(f"RemoteMpiCommSessionServer send complete\n", "green")

Add the following support changes outside this hunk:

# In RemoteMpiCommSessionServer.__init__ (near line ~356)
self._results_lock = threading.Lock()
self._results_sent = False
# In RemoteMpiCommSessionServer.serve() just before scheduling futures (near lines ~409-416)
with self._results_lock:
    self.results = []
    self._results_sent = False

Optionally, consider making ZeroMqQueue.put_noblock return a bool to signal success so the server can decide whether to retain and retry later instead of unconditionally proceeding. I can draft that change if desired.

tests/integration/defs/perf/pytorch_model_config.py (1)

30-31: Fix Python 3.8 typing: list[str] is not supported without postponed evaluation.

The repository targets Python 3.8+. Using list[str] will error at import time on 3.8. Switch to typing.List.

-def get_model_yaml_config(model_label: str,
-                          lora_dirs: list[str] = None) -> dict:
+from typing import Optional, List  # at file top
+
+def get_model_yaml_config(model_label: str,
+                          lora_dirs: Optional[List[str]] = None) -> dict:

Add the import near the other imports in this module if not present:

from typing import Optional, List
tensorrt_llm/_torch/attention_backend/interface.py (1)

304-306: Also propagate buffers (and max_draft_tokens) to cross sub-metadata.

Without this, cross metadata won’t share the preallocated buffers and may reallocate.

-            cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata(
-                max_batch_size, True)
+            cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata(
+                max_batch_size,
+                sub_cross_metadata=True,
+                max_draft_tokens=max_draft_tokens,
+                buffers=buffers,
+            )
tensorrt_llm/_torch/modules/gated_mlp.py (1)

1-3: Add NVIDIA copyright header (2025).

This file is missing the required NVIDIA copyright header for 2025.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)

1-5: Add NVIDIA copyright header (2025).

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

138-139: Disconnect intercommunicator on worker shutdown to prevent resource leaks.

     llm.shutdown()
+    try:
+        intercomm.Disconnect()
+    except Exception as e:
+        print(f"Worker {rank} intercomm disconnect error: {e}", flush=True)
🧹 Nitpick comments (50)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)

20-20: Prefer xfail over skip to surface unexpected passes; also avoid invalid URL in reason

Using xfail keeps the test visible and reports XPASS if the bug is fixed unintentionally. Drop the "https://" prefix to avoid broken-link checkers.

-@pytest.mark.skip(reason="https://nvbugs/5470782")
+@pytest.mark.xfail(reason="nvbugs/5470782", strict=False)
jenkins/L0_Test.groovy (2)

2275-2282: Refactor is fine; prefer not to reach back into values here

Using values[2] is behaviorally equivalent to cpu_arch, but it couples this inner block to the outer list indexing. For readability and to avoid accidental drift, use the already-derived cpu_arch.

Also, please double-check that the CUDA repo path for Ubuntu 24.04 should remain sbsa for ARM. If NVIDIA switched to arm64 for 24.04, this would break.

Apply:

-            def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204"
-            def platform = values[2] == X86_64_TRIPLE ? "x86_64" : "sbsa"
+            def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204"
+            def platform = cpu_arch == X86_64_TRIPLE ? "x86_64" : "sbsa"

351-381: Unify dockerGPUOption naming (stray dockerGpuOption is unused)

There’s a casing mismatch: dockerGpuOption is declared but dockerGPUOption is used/assigned. Keep one name to avoid confusion and accidental scope issues.

-                def dockerGpuOption = ""
+                def dockerGPUOption = ""
@@
-                    dockerGPUOption = sh(script: """
+                    dockerGPUOption = sh(script: """
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

6-7: Fail fast on unsupported SM and include details in the error

Good guard. Two improvements:

  • Move the SM check before super().init to avoid unnecessary init on unsupported GPUs.
  • Include the detected SM in the error to aid debugging. Also confirm intent vs. docstring (“SM100 only”) since the guard only blocks SM>=120.

Apply within this block to improve the message:

-        sm_version = get_sm_version()
-        if sm_version >= 120:
-            raise NotImplementedError(
-                "TRTLLMGenFusedMoE does not support SM120 and above.")
+        sm_version = get_sm_version()
+        if sm_version >= 120:
+            raise NotImplementedError(
+                f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")

And move the check to run before super().init (outside the selected range, example placement shown):

# Place right at the start of __init__, before calling super().__init__
sm_version = get_sm_version()
if sm_version >= 120:
    raise NotImplementedError(f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")

Also applies to: 83-87

tensorrt_llm/serve/openai_disagg_server.py (1)

325-327: Strengthen ctx_request_id validation

The None-check is good; also validate type and positivity to prevent downstream issues in metrics and routing.

-        if choices[0].disaggregated_params.ctx_request_id is None:
-            raise ValueError("Invalid disaggregated params in context phase response.")
+        ctx_id = choices[0].disaggregated_params.ctx_request_id
+        if ctx_id is None or not isinstance(ctx_id, int) or ctx_id <= 0:
+            raise ValueError("Invalid ctx_request_id in context phase response.")
tensorrt_llm/_torch/model_config.py (1)

497-499: Guard against non-divisible MLP size before TP split

Dividing by tp_size without checking divisibility can silently truncate and misconfigure bindings.

-                    mlp_hidden_size = self._infer_nemotron_ffn_mult(
-                    ) // self.mapping.tp_size
+                    mlp_hidden_size_global = self._infer_nemotron_ffn_mult()
+                    if mlp_hidden_size_global % self.mapping.tp_size != 0:
+                        raise ValueError(
+                            f"Inferred mlp_hidden_size {mlp_hidden_size_global} is not divisible by tp_size {self.mapping.tp_size}"
+                        )
+                    mlp_hidden_size = mlp_hidden_size_global // self.mapping.tp_size
tests/integration/test_lists/waives.txt (1)

291-297: Dedupe duplicate SKIP entries

  • tests/integration/test_lists/waives.txt lines 244 & 283: llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp
  • tests/integration/test_lists/waives.txt lines 264 & 291: test_e2e.py::test_ptp_quickstart_multimodal[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False]
    Remove one occurrence of each to keep the waiver list concise.
tensorrt_llm/bench/dataclasses/configuration.py (1)

87-104: Add NVIDIA copyright header.

Per repo guidelines, prepend the 2025 NVIDIA header at file top.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+"""
 existing imports...
tensorrt_llm/executor/ipc.py (2)

128-158: Polish put_noblock: docstring D205 and avoid assert for validation.

  • Fix D205: add a blank line after the summary line.
  • Replace assert with ValueError (asserts can be stripped with -O).
  • Optional: use a loop instead of recursion; include queue name in the error for easier traceability.
     def put_noblock(self,
                     obj: Any,
                     *,
                     retry: int = 1,
                     wait_time: float = 0.001):
-        '''
-        Put an object into the queue without blocking, and retry if the send fails.
-        NOTE: It won't raise any error if the send fails.
+        '''
+        Put an object into the queue without blocking, and retry if the send fails.
+
+        NOTE: It won't raise any error if the send fails.
 
         Parameters:
             obj (Any): The object to send.
             retry (int): The number of times to retry sending the object.
             wait_time (float): The time to wait before retrying.
         '''
 
-        assert retry >= 0 and retry <= 10, "Retry must be between 0 and 10, adjust the wait_time if needed"
+        if not (0 <= retry <= 10):
+            raise ValueError("retry must be between 0 and 10; increase wait_time if needed")
 
         self.setup_lazily()
         with nvtx_range_debug("send", color="blue", category="IPC"):
             data = pickle.dumps(obj)  # nosec B301
             if self.use_hmac_encryption:
                 data = self._sign_data(data)
-            try:
-                self.socket.send(data, flags=zmq.NOBLOCK)
-            except zmq.Again:
-                if retry > 0:
-                    time.sleep(wait_time)
-                    self.put_noblock(obj, retry=retry - 1, wait_time=wait_time)
-                else:
-                    logger.error(f"Failed to send object: {obj}")
+            while True:
+                try:
+                    self.socket.send(data, flags=zmq.NOBLOCK)
+                    break
+                except zmq.Again:
+                    if retry > 0:
+                        retry -= 1
+                        time.sleep(wait_time)
+                        continue
+                    logger.error(f"[{self.name or 'ipc'}] Failed to send object: {type(obj).__name__}")
+                    break

1-10: Add NVIDIA copyright header.

This file is missing the standard header; please add.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
 """
 existing imports...
examples/llm-api/_tensorrt_engine/quickstart_example.py (2)

7-9: Prefer constructor kwargs (if supported) and keep namespaced usage

Cleaner and immutable config if BuildConfig supports kwargs; also aligns with namespaced import.

Apply:

-    build_config = BuildConfig()
-    build_config.max_batch_size = 256
-    build_config.max_num_tokens = 1024
+    build_config = trtllm.BuildConfig(max_batch_size=256, max_num_tokens=1024)

If kwargs aren’t supported, keep attribute assignment but use trtllm.BuildConfig().


13-14: Use namespaced symbols consistently

Keep SamplingParams and any other symbols under the module namespace per guidelines.

Apply:

-    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-              build_config=build_config)
+    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+              build_config=build_config)
@@
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    sampling_params = trtllm.SamplingParams(temperature=0.8, top_p=0.95)

Also applies to: 24-24

cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)

42-47: Pre-size vectors to reduce reallocations in hot path

Reserve capacity to avoid repeated allocations when many requests use batched processing.

Apply:

     // Arguments for batched processor
     std::vector<LlmRequest::RequestIdType> reqIdsVec;
     std::vector<LlmRequest::TensorPtr> logitsVec;
     std::vector<std::reference_wrapper<LlmRequest::BeamTokens const>> beamTokensVec;
     std::vector<std::optional<LlmRequest::RequestIdType>> clientIdsVec;
+    reqIdsVec.reserve(inputBuffers.decoderRequests.size());
+    logitsVec.reserve(inputBuffers.decoderRequests.size());
+    beamTokensVec.reserve(inputBuffers.decoderRequests.size());
+    clientIdsVec.reserve(inputBuffers.decoderRequests.size());

Also applies to: 66-71

cpp/include/tensorrt_llm/batch_manager/logitsPostProcessor.h (4)

50-50: Avoid const& optional with std::nullopt default; also note ABI change

Binding a const& to a temporary optional via the default std::nullopt is legal but non-obvious and changes the public ABI. Passing by value is simpler, avoids lifetime surprises, and copies at most one small optional. If the intent is to avoid copying the contained std::function, prefer pass-by-value and move inside the definition.

-        std::optional<LogitsPostProcessorBatched> const& logitsPostProcessorBatched = std::nullopt) const;
+        std::optional<LogitsPostProcessorBatched> logitsPostProcessorBatched = std::nullopt) const;

Please confirm no out-of-tree consumers rely on the previous by-value signature.


1-16: Update copyright year to include 2025

Repo guideline says prepend current year. Header shows 2022-2024.

- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

18-18: Headers should use include guards, not only #pragma once

Guideline: use guards of the form TRTLLM_<FILENAME>_H.

Example (apply at file top/bottom):

-#pragma once
+#ifndef TRTLLM_LOGITSPOSTPROCESSOR_H
+#define TRTLLM_LOGITSPOSTPROCESSOR_H
 ...
+#endif // TRTLLM_LOGITSPOSTPROCESSOR_H

44-44: Constant naming style

Guideline: constants use k prefix with UPPER_SNAKE. Consider renaming name to kLOGITS_POST_PROCESSOR_NAME.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

172-176: Ensure background task termination and surface exceptions

Destructor sets mTerminate but doesn’t explicitly join/observe mResponseFuture. It’s safer to wait and surface exceptions from response().

     ~Impl()
     {
         terminate();
+        if (mResponseFuture.valid()) {
+            mResponseFuture.get(); // join and surface any exception
+        }
     }

213-221: Detached per-send threads may balloon; prefer a pool or reuse async worker

Current path spawns a detached thread per response (when parallel send is enabled). This can cause unbounded thread creation under load.

  • Reuse a fixed-size thread pool (or reuse the existing async worker).
  • Alternatively, gate parallel sends with a semaphore.
docs/source/commands/trtllm-serve/trtllm-serve.rst (2)

204-210: Grammar/clarity fixes for PyTorch metrics note

Subject–verb agreement and clarity.

-   The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.
-   Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.
-   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
+   The metrics endpoint for the default PyTorch backend is in beta and is not as comprehensive as the TensorRT backend.
+   Some fields (for example, CPU memory usage) are not yet available for the PyTorch backend.
+   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.

219-227: Tighten wording around polling guidance

Minor style improvements and remove passive voice.

-After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
-Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.
+After you send at least one inference request, poll the ``/metrics`` endpoint to fetch runtime-iteration statistics.
+Because statistics are kept in an internal queue and removed once retrieved, poll shortly after each request and persist results if needed.
tensorrt_llm/llmapi/utils.py (4)

521-538: Prevent duplicate status tags in generated docs

If ApiParamTagger already prefixes the field description with a tag, this adds a second identical tag on the arg line. Guard to avoid duplication.

-        arg_line = f"{indent}    {field_name} ({type_str}): "
-        if status := field_info.get("status", None):
-            arg_line += f":tag:`{status}` "
+        arg_line = f"{indent}    {field_name} ({type_str}): "
+        if status := field_info.get("status", None):
+            # Avoid duplicating tag if description already contains it
+            if not (field_description or "").lstrip().startswith(f":tag:`{status}`"):
+                arg_line += f":tag:`{status}` "

484-485: Future-proof Pydantic schema access

Prefer model_json_schema() in Pydantic v2 with a v1 fallback.

-    schema = model.schema()
+    schema = (model.model_json_schema()
+              if hasattr(model, "model_json_schema")
+              else model.schema())

562-565: Fix D200: one-line docstring formatting

Remove leading space and use concise wording.

-        """ The main entry point to tag the api doc. """
+        """The main entry point to tag the API doc."""

574-589: Handle None descriptions and idempotency in description tagging

If description is None, current code yields "None" in docs; also re-tagging can double-prefix.

-        for field_name in field_names:
-            field = cls.model_fields[field_name]
-            cls.model_fields[
-                field_name].description = f":tag:`{tag}` {field.description}"
+        for field_name in field_names:
+            field = cls.model_fields[field_name]
+            desc = (field.description or "").strip()
+            prefix = f":tag:`{tag}` "
+            if not desc.startswith(prefix):
+                desc = prefix + desc
+            cls.model_fields[field_name].description = desc
         cls.model_rebuild(force=True)
tests/unittest/llmapi/test_utils.py (1)

31-34: Strengthen assertion to reduce false positives

Also assert that at least one specific LlmArgs field line carries the tag.

 def test_generate_api_docs_as_docstring():
-    doc = generate_api_docs_as_docstring(LlmArgs)
-    assert ":tag:`beta`" in doc, "the label is not generated"
+    doc = generate_api_docs_as_docstring(LlmArgs)
+    assert ":tag:`beta`" in doc, "the label is not generated"
+    # Optional: check a representative field is tagged
+    assert re.search(r"\s+world_size.*:.*:tag:`beta`", doc), "world_size not tagged"
tests/integration/defs/test_e2e.py (1)

1746-1753: Advance regex cursor using match.end() to avoid re-scan

Slicing from the end of group(1) re-scans the inner text; using match.end() is cleaner and marginally faster.

-            _, end = match.span(1)
-            results.append(match.group(2))
-            item = item[end:]
+            results.append(match.group(2))
+            item = item[match.end():]
tensorrt_llm/_torch/autotuner.py (2)

372-375: Log dedup key scope (optional).

If you want one warning per (op, shape) instead of per op, include shapes in the key. Current change is correct; this is optional.

-                logger.warning_once(
-                    f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}",
-                    key=custom_op)
+                logger.warning_once(
+                    f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}",
+                    key=("cache_miss", custom_op, input_shapes))

1-1: License header consistency (nit).

Other Python files carry the NVIDIA SPDX header; this one doesn’t. Consider adding for consistency/compliance.

tensorrt_llm/executor/proxy.py (1)

352-355: Optional: tune wait_time to reduce spin.

Non-blocking retries default to 1ms sleeps. If shutdown storms persist, consider a slightly larger wait_time (e.g., 5–10ms).

-            self.request_queue.put_noblock(None, retry=4)
+            self.request_queue.put_noblock(None, retry=4, wait_time=0.005)
tensorrt_llm/logger.py (1)

111-116: Avoid using assert for runtime argument validation.

asserts can be stripped with -O; prefer an explicit exception for API safety.

-    def log_once(self, severity, *msg, key):
-        assert key is not None, "key is required for log_once"
+    def log_once(self, severity, *msg, key):
+        if key is None:
+            raise ValueError("key is required for log_once")
         if key not in self._appeared_keys:
             self._appeared_keys.add(key)
             self.log(severity, *msg)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)

161-178: Avoid hard-coding deep_gemm disable; make it configurable or scoped.

Hard-coding disable_deep_gemm = True globally may regress performance where the bug doesn’t manifest (e.g., non-FP8, non-SM100). Prefer a config/env toggle so it can be flipped without code changes.

Minimal change using an env var (defaults to disabled to preserve current behavior):

@@
-        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)
-        disable_deep_gemm = True
+        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)
+        # Allow override via env; default to disabling for safety.
+        disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1"
@@
             config=model_config,
             disable_deep_gemm=disable_deep_gemm,
         )

Add import once at the top of the file:

import os

If you prefer, we can instead plumb a disable_deep_gemm flag through ModelConfig and only disable when FP8 is active and sm == 100.


251-261: Mirror the configurability for MLP.

Keep Attention/MLP behavior consistent by reusing the same toggle when constructing GatedMLP.

-        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)
-        disable_deep_gemm = True
+        # Qwen3 has accuracy issues with deep_gemm
+        disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1"
@@
             config=model_config,
             disable_deep_gemm=disable_deep_gemm,
         )
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

856-881: Balance NVTX ranges: add range_pop for the pushes in Stage 2.

Both branches push an NVTX range but never pop, which skews traces.

                 if previous_batch is not None:
                     sample_state = previous_batch.sample_state
                     if not self.dist.is_last_pp_rank:
                         torch.cuda.nvtx.range_push(
                             "_handle_new_tokens_inter_pp")
@@
                     else:
                         torch.cuda.nvtx.range_push("_handle_new_tokens_last_pp")
                         sample_state.sampler_event.synchronize()
 
                     # Send tokens to next pp rank (w.r.t model forward direction)
                     # Second last rank does not need to since last rank has original decoded tokens
                     if not self.dist.is_second_last_pp_rank:
                         self.wait_on_pp_send_handles(prev_microbatch_id)
                         self.send_handles[
                             prev_microbatch_id] = self.dist.isend_object(
                                 sample_state.host,
                                 dest=self.dist.next_pp_rank,
                                 tag=prev_microbatch_id)
+                    torch.cuda.nvtx.range_pop()
tensorrt_llm/_torch/modules/linear.py (2)

616-631: Guard looks correct; add a quick SM cache to avoid repeated device queries.

Minor: cache sm = get_sm_version() once per call (or at init) to avoid duplicate CUDA property queries in hot path.

-        if get_sm_version() == 100 and not module.disable_deep_gemm:
+        sm = get_sm_version()
+        if sm == 100 and not module.disable_deep_gemm:

1598-1599: Public API grew: document disable_deep_gemm.

Please add a short Google-style docstring entry for this constructor param and the new attribute to align with repo guidelines.

Also applies to: 1616-1617

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

603-606: OK to delegate to buffer-aware post-init; consider type annotation.

_post_init_with_buffers(self, buffers) is fine. Consider annotating buffers: Optional[dict[str, list[torch.Tensor]]] for clarity.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

1568-1572: Consolidate SM 120/121 skip and tighten condition.

Use membership test and include actual SM in the message for quicker triage.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120
-                                        or get_sm_version() == 121):
-            pytest.skip(
-                "MOE TRTLLM backend does not support SM version 120 or 121")
+        if moe_backend == "TRTLLM":
+            sm = get_sm_version()
+            if sm in (120, 121):
+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

1621-1624: Apply same SM check refactor here.

Mirror the consolidated pattern to avoid duplication and repeated device queries.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120
-                                        or get_sm_version() == 121):
-            pytest.skip(
-                "MOE TRTLLM backend does not support SM version 120 or 121")
+        if moe_backend == "TRTLLM":
+            sm = get_sm_version()
+            if sm in (120, 121):
+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

1895-1899: Same: compress SM 120/121 guard.

Keep the message informative with the detected SM.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120
-                                        or get_sm_version() == 121):
-            pytest.skip(
-                "MOE TRTLLM backend does not support SM version 120 or 121")
+        if moe_backend == "TRTLLM":
+            sm = get_sm_version()
+            if sm in (120, 121):
+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

2523-2527: Same: unify the TRTLLM MOE skip.

Minor style and efficiency improvement.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120
-                                        or get_sm_version() == 121):
-            pytest.skip(
-                "MOE TRTLLM backend does not support SM version 120 or 121")
+        if moe_backend == "TRTLLM":
+            sm = get_sm_version()
+            if sm in (120, 121):
+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

2719-2723: Same: compress condition, show SM.

Align with other sites.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120
-                                        or get_sm_version() == 121):
-            pytest.skip(
-                "MOE TRTLLM backend does not support SM version 120 or 121")
+        if moe_backend == "TRTLLM":
+            sm = get_sm_version()
+            if sm in (120, 121):
+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
tensorrt_llm/_torch/modules/gated_mlp.py (3)

32-34: New flags accepted; document behavior and defaults.

Expose a brief docstring (init param section) for use_cute_dsl_blockscaling_mm and disable_deep_gemm to make their interaction clear.

Apply:

 class GatedMLP(nn.Module):
@@
-    def __init__(self,
+    def __init__(self,
                  *,
                  hidden_size: int,
@@
-                 use_cute_dsl_blockscaling_mm: bool = False,
-                 disable_deep_gemm: bool = False):
+                 use_cute_dsl_blockscaling_mm: bool = False,
+                 disable_deep_gemm: bool = False):
+        """
+        Args:
+            use_cute_dsl_blockscaling_mm: Enable CUTE DSL block-scaling matmul path.
+            disable_deep_gemm: Force-disable deep-GEMM (e.g., on SM100) for this MLP.
+        """

108-118: Add a brief docstring for _apply_activation (new has_lora).

Clarify that FP8 is disabled when LoRA is active due to grouped_gemm limitation.

-    def _apply_activation(self, x, *, has_lora: bool = False):
+    def _apply_activation(self, x, *, has_lora: bool = False):
+        """Apply activation.
+        Args:
+            has_lora: If True, route to non-FP8 path until LoRA grouped_gemm supports FP8.
+        """

146-151: Explicitness nit: pass has_lora=False for readability.

No behavior change; improves clarity when scanning callsites.

-        h2 = self._apply_activation(h1)
+        h2 = self._apply_activation(h1, has_lora=False)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (5)

67-73: Generalize termination to detected remote size.

Avoid hard-coded dest=0/1; iterate over remote group.

 def mpi_send_termination_request(intercomm):
     if intercomm is not None:
-        # Send termination requests
-        intercomm.send(None, dest=0, tag=MPI_REQUEST)
-        intercomm.send(None, dest=1, tag=MPI_REQUEST)
-        print("Sent termination requests to the workers.")
+        # Send termination requests to all remote ranks
+        try:
+            for dest in range(intercomm.Get_remote_size()):
+                intercomm.send(None, dest=dest, tag=MPI_REQUEST)
+            print("Sent termination requests to the workers.", flush=True)
+        except Exception as e:
+            print(f"Termination request send error: {e}", flush=True)

87-96: Fix error messages: it's Lookup/Connect, not Publish.

Message text is misleading when Connect fails.

-    except MPI.Exception as e:
-        print(f"Error publishing port name: {e}")
-        raise e
-    except Exception as e:
-        print(f"Unexpected error publishing port name: {e}")
-        raise e
+    except MPI.Exception as e:
+        print(f"Error looking up or connecting to port name: {e}", flush=True)
+        raise
+    except Exception as e:
+        print(f"Unexpected error looking up or connecting to port name: {e}", flush=True)
+        raise

381-384: Preserve exception details in MPI except block.

-        except MPI.Exception as e:
-            print(f"MPI Error")
-            raise e
+        except MPI.Exception as e:
+            print(f"MPI Error: {e}", flush=True)
+            raise

485-488: Preserve exception details in MPI except block.

-        except MPI.Exception as e:
-            print(f"MPI Error")
-            raise e
+        except MPI.Exception as e:
+            print(f"MPI Error: {e}", flush=True)
+            raise

25-36: Optional: unique service name per test to avoid cross-test collisions.

Derive a unique service name (e.g., using PID/UUID) instead of hard-coding 'my_port'.

Example:

-MPI.Publish_name('my_port', port_name)
+service_name = f"trtllm_port_{os.getpid()}_{int(asyncio.get_running_loop().time()*1e9)}"
+MPI.Publish_name(service_name, port_name)

Propagate service_name to lookup sites.

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17639 [ run ] triggered by Bot

dominicshanshan and others added 8 commits September 7, 2025 19:32
Signed-off-by: Yuxian Qiu <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…aph_config.max_batch_size is not provided in trtllm-bench (NVIDIA#7031)

Signed-off-by: Jiagan Cheng <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…te (NVIDIA#7225)

* Why?

Some models (e.g. anything produced by Mistral) can have both sharded
safetensors and a consolidated safetensor in the same checkpoint
directory. In such cases, prefetching both to memory is a waste of time,
and memory.

* What?

This commit skips over consolidated safetensors when they are not the
only safetensor file present in the checkpoint directory.

Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…DIA#7267)

Signed-off-by: Balaram Buddharaju <[email protected]>
Co-authored-by: Larry <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
@dominicshanshan
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17966 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17966 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13466 (Partly Tested) completed with status: 'SUCCESS'

venkywonka and others added 6 commits September 7, 2025 22:16
… for nemotron-nas (NVIDIA#7231)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]>
Co-authored-by: Sharan Chetlur <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…a_params_from_requests (NVIDIA#7203)

Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
@dominicshanshan
Copy link
Collaborator Author

/bot skip --comment "Drop erroneous change"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17981 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17981 [ skip ] completed with state SUCCESS
Skipping testing for commit 74fc47c

@nv-guomingz nv-guomingz merged commit c9dca69 into NVIDIA:main Sep 8, 2025
5 checks passed
jmydurant added a commit to jmydurant/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: Nave Assaf <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>
Signed-off-by: qqiao <[email protected]>
Signed-off-by: Superjomn <[email protected]>
Signed-off-by: Bo Deng <[email protected]>
Signed-off-by: Jin Li <[email protected]>
Signed-off-by: Yifei Zhang <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
Signed-off-by: Chenfei Zhang <[email protected]>
Signed-off-by: Christina Zhang <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Pamela <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Shixiaowei02 <[email protected]>
Signed-off-by: Michal Guzek <[email protected]>
Signed-off-by: peaceh <[email protected]>
Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: ruodil <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]>
Signed-off-by: Jiagan Cheng <[email protected]>
Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Dom Brown <[email protected]>
Co-authored-by: Nave Assaf <[email protected]>
Co-authored-by: Yechan Kim <[email protected]>
Co-authored-by: brb-nv <[email protected]>
Co-authored-by: Iman Tabrizian <[email protected]>
Co-authored-by: Emma Qiao <[email protected]>
Co-authored-by: Yan Chunwei <[email protected]>
Co-authored-by: Bo Deng <[email protected]>
Co-authored-by: Jin Li <[email protected]>
Co-authored-by: yifeizhang-c <[email protected]>
Co-authored-by: amitz-nv <[email protected]>
Co-authored-by: Erin <[email protected]>
Co-authored-by: chenfeiz0326 <[email protected]>
Co-authored-by: ChristinaZ <[email protected]>
Co-authored-by: Venky <[email protected]>
Co-authored-by: Pamela Peng <[email protected]>
Co-authored-by: HuiGao-NV <[email protected]>
Co-authored-by: milesial <[email protected]>
Co-authored-by: Shi Xiaowei <[email protected]>
Co-authored-by: Michal Guzek <[email protected]>
Co-authored-by: peaceh-nv <[email protected]>
Co-authored-by: Guoming Zhang <[email protected]>
Co-authored-by: Wanli Jiang <[email protected]>
Co-authored-by: pcastonguay <[email protected]>
Co-authored-by: ruodil <[email protected]>
Co-authored-by: Linda <[email protected]>
Co-authored-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yuxian Qiu <[email protected]>
Co-authored-by: Jiagan Cheng <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Co-authored-by: Larry <[email protected]>
Co-authored-by: Sharan Chetlur <[email protected]>
Co-authored-by: Dom Brown <[email protected]>
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Signed-off-by: Nave Assaf <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>
Signed-off-by: qqiao <[email protected]>
Signed-off-by: Superjomn <[email protected]>
Signed-off-by: Bo Deng <[email protected]>
Signed-off-by: Jin Li <[email protected]>
Signed-off-by: Yifei Zhang <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
Signed-off-by: Chenfei Zhang <[email protected]>
Signed-off-by: Christina Zhang <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Pamela <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Shixiaowei02 <[email protected]>
Signed-off-by: Michal Guzek <[email protected]>
Signed-off-by: peaceh <[email protected]>
Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: ruodil <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]>
Signed-off-by: Jiagan Cheng <[email protected]>
Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Dom Brown <[email protected]>
Co-authored-by: Nave Assaf <[email protected]>
Co-authored-by: Yechan Kim <[email protected]>
Co-authored-by: brb-nv <[email protected]>
Co-authored-by: Iman Tabrizian <[email protected]>
Co-authored-by: Emma Qiao <[email protected]>
Co-authored-by: Yan Chunwei <[email protected]>
Co-authored-by: Bo Deng <[email protected]>
Co-authored-by: Jin Li <[email protected]>
Co-authored-by: yifeizhang-c <[email protected]>
Co-authored-by: amitz-nv <[email protected]>
Co-authored-by: Erin <[email protected]>
Co-authored-by: chenfeiz0326 <[email protected]>
Co-authored-by: ChristinaZ <[email protected]>
Co-authored-by: Venky <[email protected]>
Co-authored-by: Pamela Peng <[email protected]>
Co-authored-by: HuiGao-NV <[email protected]>
Co-authored-by: milesial <[email protected]>
Co-authored-by: Shi Xiaowei <[email protected]>
Co-authored-by: Michal Guzek <[email protected]>
Co-authored-by: peaceh-nv <[email protected]>
Co-authored-by: Guoming Zhang <[email protected]>
Co-authored-by: Wanli Jiang <[email protected]>
Co-authored-by: pcastonguay <[email protected]>
Co-authored-by: ruodil <[email protected]>
Co-authored-by: Linda <[email protected]>
Co-authored-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yuxian Qiu <[email protected]>
Co-authored-by: Jiagan Cheng <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Co-authored-by: Larry <[email protected]>
Co-authored-by: Sharan Chetlur <[email protected]>
Co-authored-by: Dom Brown <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.