[TRTLLM-7361][feat] KV cache transfer for uneven pp #7117

chuangz0 · 2025-08-21T07:32:13Z

Summary by CodeRabbit

New Features
- Pipeline-parallel aware KV cache: per-PP attention-layer counts exposed to Python as attention_layer_num_per_pp and threaded into public APIs.
- PP collectives added to the distributed communicator (pp_allgather/pp_gather/pp_broadcast).
Improvements
- Dynamic per-target buffer sizing for KV cache transfers, stricter zero-copy eligibility, and agent-aware per-connection offsetting for domain routing.
- (De)serialization, state, and kernels now carry per-PP/per-peer layer layout information.
Tests
- New disaggregated configs/tests (ctxPP4/genTP4, DeepSeek) and many unit/integration tests updated for PP-aware behavior.

Description

currently trtllm cacheTransceiver don't support the case layer_num%pp_szie!=0.
For customer who want to use RTX 6kD (with PP) as context server and H20 (with TP) as gen server, they are unable to test Deepseek and qwen because of the limit.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-21T07:32:19Z

📝 Walkthrough

Walkthrough

Threads per-pipeline-parallel (per-PP) attention-layer counts through CacheState and CacheTransceiver APIs, buffer allocation/transfer logic, CUDA split/concat kernels, agent connection offsetting, (de)serialization, Python bindings, MP/PP collectives, and tests/configs.

Changes

Cohort / File(s)	Summary
CacheTransceiver API & usages `cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h`, `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp`	Insert `std::vector<SizeType32> const& attentionLayerNumPerPP` into CacheTransceiver constructors and bindings (after WorldConfig); reorder delegate calls and propagate vector into CacheState construction and Python/Nanobind signatures.
CacheState / ParallelConfig / serialization `cpp/include/tensorrt_llm/executor/dataTransceiverState.h`, `cpp/tensorrt_llm/executor/serialization.cpp`	Add `mAttentionLayerNumPerPP` to ParallelConfig; extend CacheState constructors and equality operator to include it; update serialize/deserialize and serializedSize to read/write the new vector.
Buffer manager API & implementation `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp`, `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp` (impl)	Replace single uniform per-target size with `std::vector<size_t> const& targetBufferEleSizes` in getOrAllocateSend/RecvBuffers and internal getOrAllocateBuffers; validate vector length; compute coverage and slicing using cumulative per-element sizes.
Formatters (CacheFormatter / MLA) `cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp`, `cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp`	Compute per-target bufferEleSizes from per-PP attention layout, tighten zero-copy eligibility, remove legacy copy branches, call updated allocation APIs, and refactor send/recv flows to use per-target buffers.
CUDA split/concat & TargetRanksInfo `cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu`, `.../cacheSplitConcat.h`	Add per-domain prefix layer mapping, device helper getLayerIdInDomainPPandRankInDomainPP, pass prefixLayerNumDevPtr into kernels, and extend TargetRanksInfo with per-peer domain-layer counts and helper accessor.
Agent connection / offset logic `cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h`, `.../connection.cpp`	Add CacheState member to AgentConnectionManager; compute and store per-connection offset ratios via computeSendOffsetRatio; extend AgentConnection::setSenderState to accept the offset ratio and apply it when building dstDesc; fix minor parameter name.
AgentConnectionManager integration (NIXL / tests) `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`, `cpp/tests/unit_tests/`, `cpp/tests/unit_tests/executor/`, `cpp/tests/unit_tests/multi_gpu/*`	Update call sites to pass CacheState (or pointer/reference) into AgentConnectionManager constructors and related test scaffolds; thread attentionLayerNumPerPP into CacheState in tests.
Python distributed + executor wiring `tensorrt_llm/_torch/distributed/communicator.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`	Add a PP MPI communicator (pp_comm) and pp_allgather/pp_gather/pp_broadcast helpers; thread `dist` into create_kv_cache_transceiver and BindKvCacheTransceiver; gather per-PP layer counts and forward to C++ transceiver.
Tests and configs `tests/integration/defs/disaggregated/.yaml`, `tests/integration/defs/disaggregated/test_disaggregated.py`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`, `cpp/tests/unit_tests/`	Add new disaggregated YAML/test variants; update test list; update unit tests to provide per-target size vectors and attentionLayerNumPerPP values to match new signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as Python caller
  participant Dist as Distributed (TP/PP)
  participant Binder as BindKvCacheTransceiver (py)
  participant Cpp as CacheTransceiver (C++)
  participant State as CacheState

  Py->>Binder: create_kv_cache_transceiver(mapping, dist, kv_cache_manager, ...)
  Binder->>Dist: pp_allgather(pp_layer_num)
  Dist-->>Binder: pp_layer_num_per_pp_rank
  Binder->>Cpp: CacheTransceiver(..., attentionLayerNumPerPP=pp_layer_num_per_pp_rank, dtype, ...)
  Cpp->>State: CacheState(..., attentionLayerNumPerPP, dtype, ...)
  State-->>Cpp: constructed
  Cpp-->>Binder: instance
  Binder-->>Py: transceiver

sequenceDiagram
  autonumber
  participant Fmt as CacheFormatter/MLAFormatter
  participant Buf as CacheTransBufferManager
  participant Dev as Device Buffers

  Fmt->>Fmt: compute per-target bufferEleSizes & LayerNumPerTarget
  Fmt->>Buf: getOrAllocateSendBuffers(bufferId, targetNum, bufferEleSizes, BM)
  Buf->>Buf: compute per-element coverage & allocate slices
  Buf-->>Fmt: (tensors, bufferCoverTargetNum, onlyUseDynamicBuffer)
  Fmt->>Dev: slice/copy/send per target using element sizes
  Note over Fmt,Dev: Receive path mirrors with getOrAllocateRecvBuffers

sequenceDiagram
  autonumber
  participant Mgr as AgentConnectionManager
  participant Conn as AgentConnection
  participant Peer as Remote Agent

  Mgr->>Mgr: compute per-peer domain layer counts
  Mgr->>Conn: setSenderState(receiverDesc, validIdx, offsetRatio)
  Conn->>Peer: send(buffer, dst_addr + offset)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

[None][chore] No-op changes to support context parallelism in disaggregated serving later #7063 — modifies CacheState/ParallelConfig serialization and constructor layout; high overlap with added mAttentionLayerNumPerPP.
[None][fix]revert kvcache transfer #6709 — touches cache transfer/formatter decision paths overlapping send/recv refactors here.
feat: Add support for disaggregation with pp with pytorch backend #6369 — threads per-PP/kv info into CacheTransceiver and BindKvCacheTransceiver; likely to conflict in binding/init signatures.

Suggested labels

KV-Cache Management

Suggested reviewers

Tabrizian
pcastonguay
Shixiaowei02
schetlur-nv
lucaslie

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

chuangz0 · 2025-08-21T07:39:28Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-08-21T07:44:56Z

PR_Github #16024 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
76-76: pybind11 typo: use py::class_ and add missing namespace alias

py::classh should be py::class_ (underscore), otherwise it won’t compile.

Missing namespace alias: py is used but namespace py = pybind11; is not declared locally (don’t rely on transitive aliases).

Apply this diff:
@@
-void tb::CacheTransceiverBindings::initBindings(py::module_& m)
+namespace py = pybind11;
+
+void tb::CacheTransceiverBindings::initBindings(py::module_& m)
@@
-    py::classh<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
+    py::class_<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
@@
-    py::classh<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
+    py::class_<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
Also applies to: 88-88
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
680-685: Bug: cacheType shadowing prevents kSELFKONLY from taking effect

The inner ‘auto cacheType’ shadows the outer variable; cacheType remains kSELF, breaking K-only tests and coverage.

Fix:
-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = (kvFactor == 1) ? CacheType::kSELFKONLY : CacheType::kSELF;
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
125-126: Don’t hard-fail on zero blocks (uneven PP can yield 0 local layers)

With uneven PP, some ranks can legitimately hold zero attention layers. TLLM_CHECK(blockNum > 0) will abort such ranks. Prefer a no-op return.
-    TLLM_CHECK(blockNum > 0);
+    if (blockNum == 0)
+    {
+        // No KV data for this rank; nothing to send.
+        return;
+    }
524-602: inquireSupport: duplicate/incorrect checks and misleading message

Duplicate kvFactor equality check appears twice.

Heads-per-layer “== 1” guard compares selfConfig twice; the second should check destConfig.

Log message “only support MLA” on the heads-per-layer==1 guard is misleading.
@@
-    if ((selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1)
-        || (selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1))
-    {
-        TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support MLA");
-        return false;
-    }
+    if (selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1
+        || destConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1)
+    {
+        TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support nbKvHeadsPerLayer == 1 for MLA");
+        return false;
+    }
@@
-    if (selfConfig.getAttentionConfig().mKvFactor != destConfig.getAttentionConfig().mKvFactor)
-    {
-        TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support same kv factor");
-        return false;
-    }
Also rename agentConnnecion -> agentConnection for clarity elsewhere in this file.

🧹 Nitpick comments (23)

tests/integration/defs/disaggregated/test_disaggregated.py (1)
714-732: skip_less_device(4) vs 8-rank config: align GPU gating

The scenario launches 8 ranks; the test currently skips only if <4 GPUs. If the intent is to co-locate ctx PP=4 and gen TP=4 on the same 4 GPUs, keep as-is and confirm capacity. If not, consider requiring 8 GPUs for consistency with other 8-rank tests (e.g., ctxpp4_genpp4).

Proposed change (if 8 GPUs are required):
-@pytest.mark.skip_less_device(4)
+@pytest.mark.skip_less_device(8)
Optional: deduplicate the repeated symlink boilerplate used across tests. For example, add a small helper and use it here.
def _ensure_symlink(src, dst_dir, name):
    dst = os.path.join(dst_dir, name)
    if not os.path.islink(dst):
        os.makedirs(os.path.dirname(dst), exist_ok=True)
        os.symlink(src, dst, target_is_directory=True)
Then, within this test:
-    src_dst_dict = {
-        llama_model_root:
-        f"{llm_venv.get_working_directory()}/TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-    }
-    for src, dst in src_dst_dict.items():
-        if not os.path.islink(dst):
-            os.makedirs(os.path.dirname(dst), exist_ok=True)
-            os.symlink(src, dst, target_is_directory=True)
+    _ensure_symlink(
+        llama_model_root,
+        llm_venv.get_working_directory(),
+        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    )
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml (3)
1-6: Clarify precedence of free_gpu_memory_fraction (top-level vs per-server)

You set free_gpu_memory_fraction: 0.2 at the top level and also under each server’s kv_cache_config. If both are respected by different components, this can confuse intent. Prefer a single source of truth (per-server is usually clearer for disaggregated setups).
 model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
 hostname: localhost
 port: 8000
 backend: "pytorch"
 cuda_graph_config: null
-free_gpu_memory_fraction: 0.2
7-21: Context servers config: PP=4 on 4 GPUs — confirm colocation with generation

PP=4 implies 4 context ranks; generation TP=4 implies 4 gen ranks. If both run concurrently on the same 4 GPUs (as used in L0), confirm memory fits for duplicated weights+KV across processes for TinyLlama and that --oversubscribe won’t cause timeouts. Otherwise consider gating to 8 GPUs. Also, minor style nit: align list indentation under urls with the surrounding mapping for readability.
   cache_transceiver_config:
     backend: DEFAULT
   urls:
-      - "localhost:8001"
+    - "localhost:8001"
22-36: Generation servers config: sanity on max_batch_size and list indent

max_batch_size: 256 is fine for TinyLlama but is generous for L0; if flakiness appears, consider reducing to match comparable tests (e.g., 64 or 128) to lower peak memory. Also align urls list indentation as above for consistency.
   cache_transceiver_config:
     backend: DEFAULT
   urls:
-      - "localhost:8002"
+    - "localhost:8002"
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (2)
24-24: Header guard requirement vs. pragma once.

Coding guidelines require header guards named TRTLLM__H for headers. Replace or complement #pragma once with an include guard.
-#pragma once
+#ifndef TRTLLM_CACHE_SPLIT_CONCAT_H
+#define TRTLLM_CACHE_SPLIT_CONCAT_H
+#pragma once
…and at file end:
-} // namespace tensorrt_llm::executor::kv_cache
+} // namespace tensorrt_llm::executor::kv_cache
+
+#endif // TRTLLM_CACHE_SPLIT_CONCAT_H
66-68: Typo: ouputSplitBlocks → outputSplitBlocks (propagate to definition).

Parameter name is misspelled; fix here and in the corresponding definition to keep declaration/definition names consistent per guidelines.
-void splitKVCacheDispatch(std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> const& kVCacheBlocksPerWindow,
-    std::vector<runtime::ITensor::SharedPtr>& ouputSplitBlocks, kv_cache::CacheState const& peerCacheState,
+void splitKVCacheDispatch(std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> const& kVCacheBlocksPerWindow,
+    std::vector<runtime::ITensor::SharedPtr>& outputSplitBlocks, kv_cache::CacheState const& peerCacheState,
tensorrt_llm/_torch/distributed/communicator.py (2)
100-104: Initialize PP communicator conditionally.

Call create_pp_comm() only when PP is actually enabled (or pp_size > 1). This avoids constructing unnecessary groups and reduces the chance of attribute errors in non-PP configs.
-        self.create_tp_comm()
-        self.create_pp_comm()
+        self.create_tp_comm()
+        if self.has_pp and self.pp_size > 1:
+            self.create_pp_comm()
+        else:
+            self.pp_comm = None
139-142: Guard PP group creation and handle COMM_NULL.

Be defensive if mapping.pp_group is missing/empty or Create_group yields COMM_NULL.
-    def create_pp_comm(self):
-        new_group = mpi_comm().group.Incl(self.mapping.pp_group)
-        self.pp_comm = mpi_comm().Create_group(new_group)
+    def create_pp_comm(self):
+        pp_group = getattr(self.mapping, "pp_group", None)
+        if not pp_group:
+            self.pp_comm = None
+            return
+        new_group = mpi_comm().group.Incl(pp_group)
+        comm = mpi_comm().Create_group(new_group)
+        # mpi4py uses MPI.COMM_NULL for empty/invalid groups
+        self.pp_comm = comm if comm else None
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2)
115-122: Expose per-PP layer distribution in toString for easier debugging

Now that ParallelConfig carries mAttentionLayerNumPerPP, include it in toString(). This materially helps diagnose uneven PP distributions over logs.

Example addition inside CacheState::toString():
sstring << "attnLayerNumPerPP:[";
for (auto v : mParallelConfig.mAttentionLayerNumPerPP) {
    sstring << v << ",";
}
sstring << "]\n";
Also applies to: 160-179

1-15: Update copyright year

Other changed files use 2025; this header still says 2023-2024. Consider updating to 2025 for consistency.
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
1-16: Align SPDX header year with other updated files

Other touched files use 2025. Consider updating 2022-2024 to 2025 for consistency.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)
324-329: Fix logging typo and clarify fields

Minor: “dupliacete” -> “duplicate”, and prefer consistent naming.
-            " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d "
+            " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d duplicate:%d "
613-651: Typos and stronger checks in receive-side per-target sizing

Rename getTargetBufferEleSzie -> getTargetBufferEleSize; valideTpSize -> validTpSize.

Use consistent formatting for size_t (%zu) and keep checks readable.
-            auto getTargetBufferEleSzie = [&]()
+            auto getTargetBufferEleSize = [&]()
             {
                 if (outputBuffersPerWindow.size() > 1)
                 {
                     std::vector<size_t> bufferSizeForTarget(targetNum, 0);
                     for (size_t i = 0; i < targetNum; i++)
                     {
                         bufferSizeForTarget[i] = cacheBlockSizeSum / targetNum;
                     }
                     // TODO: LayerNumbufferTargetNum for VWSA
                     return std::make_pair(bufferSizeForTarget, std::vector<SizeType32>(targetNum, 0));
                 }
-                size_t valideTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize;
-                TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % valideTpSize == 0,
-                    "cacheBlockSizeSum must be divisible by valideTpSize %ld", valideTpSize);
-                TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * valideTpSize)) == 0,
-                    "cacheBlockSizeSum must be divisible by valideTpSize %ld * selfAttentionLayerNum %d", valideTpSize,
-                    selfAttentionLayerNum);
+                size_t validTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize;
+                TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % validTpSize == 0,
+                    "cacheBlockSizeSum must be divisible by validTpSize %zu", validTpSize);
+                TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * validTpSize)) == 0,
+                    "cacheBlockSizeSum must be divisible by validTpSize %zu * selfAttentionLayerNum %d", validTpSize,
+                    selfAttentionLayerNum);
                 TLLM_CHECK(targetNum == pickUpConnections.size());
-                size_t baseEleSize = cacheBlockSizeSum / (valideTpSize * selfAttentionLayerNum);
+                size_t baseEleSize = cacheBlockSizeSum / (validTpSize * selfAttentionLayerNum);
@@
-            auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSzie();
+            auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSize();
706-716: Use correct format specifiers for size_t in logs

processIdx and sizes are size_t; prefer %zu or cast to unsigned long long to avoid UB.
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
@@
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)

1-15: Update copyright year

Guidelines require current year in headers. This file still has 2023-2024; update to 2025.

17-17: Prefer include guards over pragma once (repo guideline)

Coding guidelines specify named include guards for headers. Consider switching to an include guard (TRTLLM_BATCH_MANAGER_CACHETRANSCEIVER_H).
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3)
340-356: Buffer coverage computation: avoid noisy warning when using dynamic buffers

bufferCoverTargetNum is computed before considering mOnlyUseDynamicBuffer; the warning logged later can be misleading when dynamic buffers are enabled.

Apply:
@@
-    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%d", bufferCoverTargetNum);
-    if (bufferCoverTargetNum < static_cast<size_t>(targetNum))
+    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%zu", bufferCoverTargetNum);
+    if (!mOnlyUseDynamicBuffer && bufferCoverTargetNum < static_cast<size_t>(targetNum))
357-365: Format specifier mismatch for size_t

Use %zu for size_t to avoid UB on some platforms.
-    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%d", bufferCoverTargetNum);
+    TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%zu", bufferCoverTargetNum);
367-371: Agent-provided bufferId: strengthen assumption or relax invariant

You require mBufferIndexFlag[bufferId]==1. When bufferId originates from AgentConnection, ensure that the CacheTransBufferManager actually marked that ID as acquired, otherwise freeBufferIndexForRecv will underflow mConcurrence. Suggest either:

Document that AgentConnection must call assignBufferIndexForRecv internally before exposing the id; or

Add a public method to mark an externally reserved ID, or relax the check behind a dedicated “trusted” path.

Please confirm AgentConnection guarantees the flag is 1 before returning getCacheBufferId(). If not, I can draft a small API change to make this explicit.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1181-1186: UCX env var comment typo

“destroies” -> “destroys”; “communicatoers” -> “communicators.” Minor test log polish.

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

309-351: Z-copy receive: require one connection, okay

The early path is coherent. Suggest adding a defensive check that outputBuffers is non-empty (mirrors send path).

41-55: pickRecvConnections: index mapping assumption

You return the first mDomainPPSize indices, assuming connections.size() == mIRanks.size() and receive fan-in equals PP domain size. If mDomainTPSize > 1, ensure ConnectionManager provides exactly mIRanks.size() connections and the chosen subset is correct for MLA DP. Otherwise, derive indices directly from targetInfo.mIRanks.

If needed, I can refactor this to compute indices from targetInfo.mIRanks to be explicit.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
1401-1401: Minor inconsistency in size validation.

The size check uses sizeof(uint64_t) but should be consistent with the actual element size being stored.

Apply this fix for consistency:
-    TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t));
+    TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(cachePtrs[0]));

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d40e87 and defd82f.

📒 Files selected for processing (19)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1 hunks)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (8 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (4 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (5 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (39 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
cpp/tensorrt_llm/executor/serialization.cpp (3 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (25 hunks)
tensorrt_llm/_torch/distributed/communicator.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/distributed/communicator.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
tests/integration/defs/disaggregated/test_disaggregated.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/distributed/communicator.py
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
tests/integration/defs/disaggregated/test_disaggregated.py
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

🧠 Learnings (4)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp

🧬 Code graph analysis (10)

tensorrt_llm/_torch/distributed/communicator.py (2)

tensorrt_llm/_utils.py (1)

mpi_comm (475-476)

tensorrt_llm/mapping.py (1)

pp_group (368-369)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2)

CacheTransceiver (111-213)

CacheTransceiver (215-222)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)

mapping (152-153)

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)

getOrAllocateRecvBuffers (305-311)

getOrAllocateRecvBuffers (305-307)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)

targetIRanks (181-185)

targetIRanks (181-182)

concatKvCacheV2Dispatch (1619-1658)

concatKvCacheV2Dispatch (1619-1622)

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (3)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

nodiscard (67-90)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

nodiscard (821-875)

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

nodiscard (524-602)

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (3)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

tensorrt_llm (32-131)

tensorrt_llm/_torch/distributed/communicator.py (2)

Distributed (16-95)

pp_allgather (152-153)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

KVCacheManager (141-931)

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

targetIRanks (181-185)

targetIRanks (181-182)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

numLayers (289-391)

numLayers (605-734)

numLayers (605-607)

tests/integration/defs/disaggregated/test_disaggregated.py (1)

tests/integration/defs/conftest.py (4)

disaggregated_test_root (2339-2344)

llm_venv (707-723)

disaggregated_example_root (270-275)

llama_model_root (964-1039)

🔇 Additional comments (39)

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

36-36: Confirm 8-rank test in a 4-GPU lane won’t overload CI

test_disaggregated_ctxpp4_gentp4 uses 8 ranks (see get_test_config) but is added under the 4-GPU H100 lane. With --oversubscribe, this may still pass for TinyLlama, but it doubles process density per GPU and could increase flakiness/timeout risk. Please confirm this is intentional for L0 and that memory fits, or gate it under an 8-GPU lane and bump the test’s skip marker accordingly.

tests/integration/defs/disaggregated/test_disaggregated.py (1)

79-81: Config file presence verified; mapping confirmed LGTM

Confirmed the YAML exists at tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml.

Verified references in tests/integration/defs/disaggregated/test_disaggregated.py (lines 79–81 and the test_disaggregated_ctxpp4_gentp4 definition at line 717).

No further action needed—approving these changes.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1)

42-50: I’ve added scripts to locate and inspect the relevant declarations and usage in cacheSplitConcat.h—once we have the output, we can verify if the accessor needs const-safety, bounds checks, and whether the TP/PP ordering assumption is valid.

cpp/tensorrt_llm/executor/serialization.cpp (1)

555-556: LGTM: serialization and size accounting include the new field.

The new field is serialized immediately after DPsize, and serializedSize reflects it. This keeps the struct’s write/read symmetry intact within this revision.

Also applies to: 572-573
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4)
31-35: Plumbing dist through the factory is correct and keeps call sites tidy.

Signature change and call site update look good.

63-65: Constructor wiring is consistent with the signature change.

Passing dist into BindKvCacheTransceiver keeps responsibility localized.

110-112: Argument order to CacheTransceiverCpp matches the updated C++ constructors.

Ordering is world_config → attentionLayerNumPerPP → dtype → attention_type → config, which aligns with the C++ signatures in cacheTransceiver.h.

96-113: Incorrect assumption about TorchDist.pp_allgather

The claim that “TorchDist has no pp_* methods” is not accurate. In tensorrt_llm/_torch/distributed/communicator.py (lines 152–154), the TorchDist class indeed implements pp_allgather, so dist.pp_allgather(...) will succeed at runtime for the PyTorch‐based Distributed:
    def pp_allgather(self, obj):
        return self.pp_comm.allgather(obj)
Because TorchDist implements this method, the existing code does not “break for TorchDist” as stated. No refactoring or fallback is needed to support pp_allgather when using TorchDist.

Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/pyexecutor/_util.py (2)

553-555: Call-site update to pass dist is correct.

This matches the new create_kv_cache_transceiver signature and preserves behavior when the transceiver is disabled.

518-536: Scheduler capacity tweak with attention DP: good, but re-evaluate interaction with new PP gather.

Given the new PP allgather during KV transceiver init, ensure the warm-up path doesn’t call into create_kv_cache_transceiver before the groups are ready. If initialization order can vary, guard against using dist before PP groups exist (fallback in kv_cache_transceiver.py will help).

cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)

92-98: LGTM – binding updated and Python call-sites verified

Confirmed that in tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py, CacheTransceiverCpp is now invoked with nine arguments, including the new pp_layer_num_per_pp_rank parameter after world_config.

Searched all Python files for any other CacheTransceiver instantiations or references and found no outdated eight-argument calls.

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)

90-95: Constructor binding updated to include attention_layer_num_per_pp — ordering looks consistent

The new parameter is placed after world_config and before dtype, matching the C++ constructor. Good.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

254-258: Tightening zero-copy gating to exact TP/PP match is correct

Zero-copy requires identical layouts. Enforcing equality on TP and PP avoids subtle shape mismatches. LGTM.

804-806: Switch to concatKvCacheV2Dispatch is consistent with V2 split/concat path

Matches the new V2 per-target vectorized path; OK.

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (3)

72-73: Per-target sizes for send buffers — API change looks good

Moving from a single size to per-target sizes aligns with the new vectorized split/concat and buffer allocation logic.

76-77: Per-target sizes for recv buffers — API change looks good

Consistent with send side; OK.

95-97: All call sites migrated — no stale overloads remain

A project-wide search confirmed that only the new overloads are referenced:

getOrAllocateSendBuffers / getOrAllocateRecvBuffers in

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp/.h

getOrAllocateBuffers in

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp/.h

No stale overloads remain; no further action required.

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

73-89: Inspect factory and test helper constructor calls

I’ve provided a script to confirm that:

The CacheTransceiverFactory::createCacheTransceiver invocation was updated to match the new constructor signature (i.e., passes attentionLayerNumPerPP before dataType).

The two setUpCacheTransceiver helper definitions in the C++ tests call the constructor with the new argument order.

Once the output shows that both the factory and test helpers use the new (…, worldConfig, attentionLayerNumPerPP, dataType, …) ordering, no further call-site updates are required. If any calls still pass dataType immediately after worldConfig, you’ll need to reorder those to insert attentionLayerNumPerPP first.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)

616-638: Good: helper to derive per-PP layer counts mirrors uneven PP

The getLayerNumPPRank() helper and mAttentionLayerNumPerPP wiring look correct and readable.

724-731: Pass-through of per-PP layer vector to CacheState is correct

Both mCacheState and mContextCacheState propagate attentionLayerNumPerPP; matches production constructor signature.

962-983: Zero-layer ranks: startLayerId computation handles uneven PP

Nice: startLayerId sums per-PP counts before mPpRank, and you special-case window attention.

1419-1439: MLA even-layer instantiations increase coverage

The added EvenLayer MLAs with/without DP help guard the new per-PP path.

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

111-137: Constructor forward of per-PP vector to CacheState looks correct

Forwarding attentionLayerNumPerPP into CacheState aligns with the rest of the PR.

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

201-245: Chunked send logic looks sound with per-target buffers

Using per-target sizes and copying into a selected buffer (pre-alloc vs split cache) makes sense. Good synchronization points.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (15)

67-103: LGTM! Comprehensive per-PP layer distribution calculation.

The implementation correctly calculates layer overlaps between peer and self PP ranks using per-PP attention layer counts. The logic properly handles uneven layer distribution across pipeline parallel ranks and validates the total layer count matches expectations.

520-543: Well-designed device helper for per-PP layer mapping.

The getLayerIdInDomainPPandRankInDomainPP function efficiently maps a global layer ID to domain-PP local indices using shared memory optimization. The use of __shared__ variables and __syncthreads() ensures thread synchronization within the block.

548-548: Kernel signature updated correctly for per-PP support.

The addition of uint64_t* prefixLayerNumDevPtr parameter enables kernels to access per-PP layer distribution information on the device.

565-569: Proper integration of per-PP layer mapping in MLA kernel.

The kernel correctly calls the helper function to determine the appropriate PP rank and local layer indices for the current layer ID, enabling proper cache splitting for uneven PP configurations.

577-577: Cache indexing updated for per-PP layer distribution.

The kernel now uses rankInDomainPP for cache indexing and layerNumInSpecPP for block size calculations, correctly adapting to variable layer counts per PP rank.

Also applies to: 583-583

618-618: Standard KV cache kernel properly updated for per-PP support.

All the necessary changes have been applied consistently:

Added prefixLayerNumDevPtr parameter

Integrated per-PP layer mapping helper

Updated cache indexing with rankInDomainPP

Used layerNumInSpecPP for block size calculations

Also applies to: 644-648, 665-665, 670-670

812-812: MLA concat kernel consistently updated.

The concat kernel mirrors the split kernel changes appropriately, maintaining consistency in per-PP layer handling between split and concat operations.

Also applies to: 827-831, 839-839, 844-844

873-873: Standard concat kernel properly updated.

The concat kernel changes are consistent with the corresponding split kernel, ensuring proper reconstruction of KV cache blocks from the per-PP distributed format.

Also applies to: 890-894, 911-911, 916-916

1020-1020: Buffer management updated for 64-bit pointers.

The change from std::vector<T*> to std::vector<uint64_t> for cache pointers accommodates the device pointer passing mechanism and aligns with the kernel's uint64_t* parameter type.

Also applies to: 1043-1043, 1051-1051

1053-1059: Prefix sum calculation for per-PP layer distribution.

The prefix sum array (prefixLayerNum) enables efficient binary search-like operations in the device helper function to map global layer IDs to PP-local indices.

1120-1126: Device pointer setup correctly implemented.

The device pointers are properly segmented in the unified buffer: input blocks, output caches, and prefix layer numbers. The calculation of numLayers now uses per-PP attention layer counts instead of assuming uniform distribution.

1152-1152: All kernel invocations consistently updated.

Every kernel launch site has been updated to pass the prefixLayerNumDevPtr parameter, ensuring consistent behavior across all data type and vectorization variants.

Also applies to: 1166-1166, 1176-1176, 1190-1190, 1204-1204, 1219-1219, 1237-1237, 1252-1252, 1266-1266, 1281-1281

1362-1362: Concat function buffer management updated.

The concat path maintains consistency with the split path in buffer management, using uint64_t for cache pointers and proper device memory layout.

Also applies to: 1382-1382, 1390-1390

1392-1398: Per-PP layer distribution properly handled in concat.

The concat function correctly reconstructs the prefix layer number array and derives the appropriate layer count for the current PP rank, maintaining consistency with the split operation.

Also applies to: 1444-1449

1478-1478: All concat kernel invocations updated.

All concat kernel launch sites consistently pass the prefixLayerNumDevPtr parameter, completing the integration of per-PP layer distribution support.

Also applies to: 1492-1492, 1502-1502, 1516-1516, 1530-1530, 1545-1545, 1563-1563, 1578-1578, 1592-1592, 1607-1607

cpp/include/tensorrt_llm/executor/dataTransceiverState.h

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp

cpp/tensorrt_llm/executor/serialization.cpp

tensorrt_llm/_torch/distributed/communicator.py

tensorrt-cicd · 2025-08-21T11:31:28Z

PR_Github #16024 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12046 completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)
194-199: Build fix: call the right comparator for StaticBatchingStats

The helper is named compareStaticBatchingStats, but the code calls compareStaticBatching, causing a compile error.
-    else if constexpr (std::is_same<T, texec::StaticBatchingStats>::value)
-    {
-        compareStaticBatching(val, val2);
-    }
+    else if constexpr (std::is_same<T, texec::StaticBatchingStats>::value)
+    {
+        compareStaticBatchingStats(val, val2);
+    }
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
382-382: Wrong printf specifier for size_t in TLLM_LOG_INFO

defaultTransSize is size_t; using %d is UB on LP64. Use %zu (or cast to unsigned long long with %llu).
-        TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize);
+        TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));

🧹 Nitpick comments (6)

cpp/tests/unit_tests/executor/agentCommTest.cpp (2)
150-153: Avoid const_cast on connection; prefer const-correct API or a non-const handle

The test obtains a const pointer and strips constness to call sendRequestAndBufferInfo. This is brittle and could hide real const-correctness issues.

Consider one of:

Make sendRequestAndBufferInfo const if it is logically non-mutating.

Have AgentConnectionManager expose a non-const AgentConnection handle for operations that mutate connection state.

Change getConnections(...) to return non-const pointers/refs when mutation is required by callers.

If changing API is out-of-scope for this PR, at least add a TODO to remove const_cast later.
-    auto agentConnection0 = const_cast<tensorrt_llm::executor::kv_cache::AgentConnection*>(
-        dynamic_cast<tensorrt_llm::executor::kv_cache::AgentConnection const*>(connection0));
+    // TODO(TRTLLM-7361): Avoid const_cast by returning a non-const handle or making the method const.
+    auto agentConnection0 = const_cast<tensorrt_llm::executor::kv_cache::AgentConnection*>(
+        dynamic_cast<tensorrt_llm::executor::kv_cache::AgentConnection const*>(connection0));
165-173: Tighten async capture and device setup to reduce flakiness

Capture only what you need by value; avoid implicit [&] to prevent accidental reference to a soon-to-be-invalid object.

Optionally set the CUDA device on the main thread before host->device memcpy for clarity, even if device 0 is default.
-    auto future = std::async(std::launch::async,
-        [&]()
-        {
-            TLLM_CUDA_CHECK(cudaSetDevice(0));
-            connection1->send(dataContext, sendBuffer->data(), sendSize);
-        });
+    auto future = std::async(std::launch::async,
+        [conn = connection1, sendSize, dataPtr = sendBuffer->data(), rid = requestId]() {
+            TLLM_CUDA_CHECK(cudaSetDevice(0));
+            DataContext ctx{static_cast<int>(rid)};
+            conn->send(ctx, dataPtr, sendSize);
+        });
Optionally (non-blocking):
-    TLLM_CUDA_CHECK(cudaMemcpy(sendBuffer->data(), sendData.data(), sendSize, cudaMemcpyHostToDevice));
+    TLLM_CUDA_CHECK(cudaSetDevice(0));
+    TLLM_CUDA_CHECK(cudaMemcpy(sendBuffer->data(), sendData.data(), sendSize, cudaMemcpyHostToDevice));
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
349-352: Per-target sizing vector: add a negative test for mismatched lengths

Great update to per-target sizes. Consider adding a test where targetSizeVec.size() != targetNum to ensure the manager rejects/handles mismatches deterministically.

If desired, I can draft a small test that expects an error/empty optional when lengths differ.

397-400: Exercise heterogeneous per-target sizes

Current tests use identical sizes for all targets. Add a case with different sizes per target to verify the allocator respects per-target sizing and coverage math.

Example (outside this hunk):
std::vector<size_t> targetSizeVec = {256, 1024, 512, 768};
auto [sendBuffers, covered, onlyDyn] =
    mTransBufferManager->getOrAllocateSendBuffers(bufferId, targetSizeVec.size(), targetSizeVec, bufferManager);
// Validate sizes individually and covered count.
412-415: Reinitialization pattern is correct; consider asserting buffer reuse behavior

You rebuild targetSizeVec before each call, which is good. Add assertions to confirm whether the underlying static buffer is reused vs. dynamic allocation, to guard regressions in reuse logic.

424-427: Coverage: add a case where accumulated per-target bytes exceed defaultTransSize by a small epsilon

This boundary test can catch off-by-one rounding when splitting defaultTransSize across targets.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between defd82f and aae4a4e.

📒 Files selected for processing (6)

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4 hunks)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1 hunks)
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (4 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (2 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/defs/disaggregated/test_disaggregated.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp

🧠 Learnings (1)

📓 Common learnings

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

🧬 Code graph analysis (1)

cpp/tests/unit_tests/executor/agentCommTest.cpp (1)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

numLayers (289-391)

numLayers (605-734)

numLayers (605-607)

🔇 Additional comments (1)

cpp/tests/unit_tests/executor/agentCommTest.cpp (1)

93-95: Constructor update to CacheState looks correct for ppSize == 1

Passing attentionLayerNumPerPP as {numLayers} matches ppSize = 1 in this fixture. No functional issues spotted here.

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2)

49-51: Fix typo in public API: supportFbaricMemory → supportFabricMemory

Public API has a misspelling that will leak into downstream callers and bindings.

Apply this diff to correct the symbol and optionally provide a deprecated alias for transition:
-    static bool supportFbaricMemory();
+    // Returns whether fabric memory is supported on this platform.
+    static bool supportFabricMemory();
+
+    // TODO: remove in a future release. Kept temporarily to avoid breakage.
+    [[deprecated("Use supportFabricMemory()")]]
+    static bool supportFbaricMemory() { return supportFabricMemory(); }
I can scan the repo and update call sites if you want.

20-29: Include required headers for std::unique_ptr and std::map in a public header

This header uses std::unique_ptr and std::map but doesn’t include or . Relying on transitive includes from other headers is brittle.

Apply:
 #include "tensorrt_llm/runtime/iTensor.h"
 #include <atomic>
 #include <condition_variable>
 #include <cstddef>
 #include <optional>
 #include <unordered_map>
 #include <vector>
+#include <map>
+#include <memory>

tensorrt_llm/_torch/distributed/communicator.py (1)

16-21: Declare PP collectives in the Distributed interface

Callers (e.g., kv_cache_transceiver) use dist.pp_allgather; enforce presence via ABC to avoid runtime AttributeError.

 class Distributed(ABC):
@@
     @abstractmethod
     def allgather(self, obj, root=0):
         pass
+
+    # Pipeline-parallel collectives
+    @abstractmethod
+    def pp_allgather(self, obj):
+        """PP allgather. For pp_size==1, should return [obj]."""
+        pass
+
+    @abstractmethod
+    def pp_gather(self, obj):
+        """PP gather. For pp_size==1, should return [obj]."""
+        pass
+
+    @abstractmethod
+    def pp_broadcast(self, obj, root=0):
+        """PP broadcast. For pp_size==1, should return obj."""
+        pass

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)

139-144: Python/C++ binding mismatch: pre_alloc_buffer_size now expects a dict (per-window), not an int

C++ signature is preAllocBufferSize(std::map<SizeType32, SizeType32> const&, ...), but this wrapper still forwards an int, likely causing the pipeline failure.

Apply this backward-compatible wrapper:

-    @staticmethod
-    def pre_alloc_buffer_size(kv_cache_size_per_token: int,
-                              cache_transceiver_config: CacheTransceiverConfig):
-        return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
-            kv_cache_size_per_token, cache_transceiver_config)
+    @staticmethod
+    def pre_alloc_buffer_size(kv_cache_size_bytes_per_token_per_window,
+                              cache_transceiver_config: CacheTransceiverConfig):
+        """
+        kv_cache_size_bytes_per_token_per_window:
+          - dict[window_size:int] = size_bytes_per_token:int (preferred), or
+          - int (deprecated): treated as a single-window mapping {0: value}.
+        """
+        if isinstance(kv_cache_size_bytes_per_token_per_window, int):
+            # Backward-compat: assume single window (key 0).
+            mapping = {0: kv_cache_size_bytes_per_token_per_window}
+        else:
+            mapping = kv_cache_size_bytes_per_token_per_window
+        return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
+            mapping, cache_transceiver_config)

If you share how window keys are defined elsewhere, I’ll adapt the compat mapping to the canonical key.

♻️ Duplicate comments (5)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)

103-109: Divisibility checks contradict uneven PP; validate per-PP distribution instead

Uneven PP is disallowed by requiring layerNum % pp == 0. Replace with validation against attentionLayerNumPerPP.

Apply:

-    auto selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism;
-    auto selfAllLayerNum = selfConfig.getModelConfig().mNbKvHeadsPerLayer.size();
-    auto destPPNum = destConfig.getParallelConfig().mPipelineParallelism;
-    auto destAllLayerNum = destConfig.getModelConfig().mNbKvHeadsPerLayer.size();
-    TLLM_CHECK_WITH_INFO(selfAllLayerNum % selfPPNum == 0, " For VWSA selfAllLayerNum must be divisible by selfPPNum");
-    TLLM_CHECK_WITH_INFO(destAllLayerNum % destPPNum == 0, "For VWSA destAllLayerNum must be divisible by destPPNum");
+    auto const selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism;
+    auto const destPPNum = destConfig.getParallelConfig().mPipelineParallelism;
+    auto const selfAllLayerNum
+        = static_cast<SizeType32>(selfConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+    auto const destAllLayerNum
+        = static_cast<SizeType32>(destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+    auto const& selfPerPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP;
+    auto const& destPerPP = destConfig.getParallelConfig().mAttentionLayerNumPerPP;
+    TLLM_CHECK_WITH_INFO(selfPerPP.size() == static_cast<size_t>(selfPPNum),
+        "self attentionLayerNumPerPP size (%zu) must equal selfPPNum (%d).", selfPerPP.size(), selfPPNum);
+    TLLM_CHECK_WITH_INFO(destPerPP.size() == static_cast<size_t>(destPPNum),
+        "dest attentionLayerNumPerPP size (%zu) must equal destPPNum (%d).", destPerPP.size(), destPPNum);
+    auto sumVec = [](auto const& v){ SizeType32 s=0; for (auto x: v){ TLLM_CHECK_WITH_INFO(x>0,"per-PP layer count must be >0"); s+=x;} return s; };
+    TLLM_CHECK_WITH_INFO(sumVec(selfPerPP) == selfAllLayerNum, "Sum(self attentionLayerNumPerPP) must equal total layers.");
+    TLLM_CHECK_WITH_INFO(sumVec(destPerPP) == destAllLayerNum, "Sum(dest attentionLayerNumPerPP) must equal total layers.");

286-317: Bounds-check mAttentionLayerNumPerPP indexing and enforce divisibility before integer divisions

Guard ppIdx indexing and add sanity checks for integer divisions to avoid silent truncation.

Apply:

-        int selfAttentionLayerNum
-            = selfConfig.getParallelConfig()
-                  .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism];
+        auto const& perPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP;
+        auto const  tp    = selfConfig.getParallelConfig().mTensorParallelism;
+        auto const  ppIdx = selfIdx / tp;
+        TLLM_CHECK_WITH_INFO(ppIdx < perPP.size(), "ppIdx OOB: %d vs %zu", ppIdx, perPP.size());
+        int selfAttentionLayerNum = perPP[ppIdx];
+        TLLM_CHECK_WITH_INFO(selfAttentionLayerNum > 0, "selfAttentionLayerNum must be > 0");
@@
-        auto getBufferSizeForTarget = [&]()
+        auto getBufferSizeForTarget = [&]()
         {
-            std::vector<size_t> bufferSizeForTarget(targetNum, 0);
-            std::vector<SizeType32> LayerNumbufferTargetNum(bufferTargetNum, 0);
+            std::vector<size_t> bufferSizeForTarget(targetNum, 0);
+            std::vector<SizeType32> layerNumPerTarget(bufferTargetNum, 0);
             if (inputKvCacheBlocks.size() > 1)
             {
                 // for VWSA
                 for (size_t i = 0; i < targetNum; i++)
                 {
                     bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetNum;
                 }
-                return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum);
+                return std::make_pair(bufferSizeForTarget, layerNumPerTarget);
             }
+            TLLM_CHECK_WITH_INFO(
+                (allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) % static_cast<size_t>(targetInfo.mDomainTPSize) == 0,
+                "allCacheBlockSize*peerDuplicateHeadFactor must be divisible by domain TP size");
+            TLLM_CHECK_WITH_INFO(
+                ((allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) / static_cast<size_t>(targetInfo.mDomainTPSize))
+                    % static_cast<size_t>(selfAttentionLayerNum) == 0,
+                "Base element size must be divisible by selfAttentionLayerNum");
             for (size_t i = 0; i < targetNum; i++)
             {
                 bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetInfo.mDomainTPSize
                     / selfAttentionLayerNum * targetInfo.getPeerPPDomainLayerNum(i);
             }
             for (size_t i = 0; i < bufferTargetNum; i++)
             {
-                LayerNumbufferTargetNum[i] = targetInfo.getPeerPPDomainLayerNum(i);
+                layerNumPerTarget[i] = targetInfo.getPeerPPDomainLayerNum(i);
             }
-            return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum);
+            return std::make_pair(bufferSizeForTarget, layerNumPerTarget);
         };
-        auto [bufferEleSizes, LayerNumbufferTargetNum] = getBufferSizeForTarget();
+        auto [bufferEleSizes, layerNumPerTarget] = getBufferSizeForTarget();

348-391: Use bufferIdx consistently in chunked-send branch and fix size_t formats

remainSendSize/needSendSize read from outputSplitCaches[processIdx] is incorrect; should be bufferIdx. Also prefer %zu.

-            TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %ld", processIdx);
+            TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu", processIdx);
@@
-                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %d bufferIdx: %d size:%ld",
-                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
+                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu bufferIdx: %zu size:%zu",
+                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
@@
-                size_t remainSendSize = outputSplitCaches[processIdx]->getSize();
-                size_t needSendSize = outputSplitCaches[processIdx]->getSize();
+                size_t remainSendSize = outputSplitCaches[bufferIdx]->getSize();
+                size_t needSendSize   = outputSplitCaches[bufferIdx]->getSize();
@@
-                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %d bufferIdx: %d size:%ld",
-                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
+                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %zu bufferIdx: %zu size:%zu",
+                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());

tensorrt_llm/_torch/distributed/communicator.py (1)

152-160: Short-circuit PP collectives when pp_size <= 1

Avoid unnecessary comms and make degenerate PP robust.

     def pp_allgather(self, obj):
-        return self.pp_comm.allgather(obj)
+        if self.mapping.pp_size <= 1:
+            return [obj]
+        return self.pp_comm.allgather(obj)
 
     def pp_gather(self, obj):
-        return self.pp_comm.gather(obj)
+        if self.mapping.pp_size <= 1:
+            return [obj]
+        return self.pp_comm.gather(obj)
 
     def pp_broadcast(self, obj, root=0):
-        return self.pp_comm.bcast(obj, root)
+        if self.mapping.pp_size <= 1:
+            return obj
+        return self.pp_comm.bcast(obj, root)

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

297-303: LGTM! Good API evolution for uneven PP support.

The change from a single targetBufferEleSize to a vector of sizes is a sensible way to support uneven distribution of layers across pipeline parallel ranks.

🧹 Nitpick comments (15)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)
620-642: Uneven PP distribution helper is good; minor cleanup and correctness nits.

The distribution logic is solid. Nit: rename layerNumthisRank → layerNumThisRank for readability.

Ensure mAttentionLayerNumPerPP and contextAttentionLayerNumPerPP always sum to numLayers; you already construct them that way, but a quick assert in tests helps catch regressions.

Apply this rename:
-        int layerNumthisRank = getLayerNumPPRank(numLayers, mPpRank, mPpSize);
+        int layerNumThisRank = getLayerNumPPRank(numLayers, mPpRank, mPpSize);
-        mManager = std::make_unique<KVCacheManager>(layerNumthisRank, numHeadsPerRank, sizePerHead, tokensPerBlock,
+        mManager = std::make_unique<KVCacheManager>(layerNumThisRank, numHeadsPerRank, sizePerHead, tokensPerBlock,
Also applies to: 721-724

1050-1051: Naming consistency nit: layerSizethisRank → layerSizeThisRank.

Consistent casing improves readability and reduces copy/paste mistakes.
-        int layerSizethisRank = blockData.getDimension<1>();
+        int layerSizeThisRank = blockData.getDimension<1>();
And update references within the function accordingly:
-        for (int layerId = 0; layerId < layerSizethisRank; layerId++)
+        for (int layerId = 0; layerId < layerSizeThisRank; layerId++)
1408-1412: Prefer INSTANTIATE_TEST_SUITE_P (gtest rename) for new suites.

GTest renamed TEST_CASE → TEST_SUITE; older macro still works but is deprecated on newer toolchains and can warn under -Werror.
-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest1EvenLayer, AsymmetricalCacheTest,
+INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest1EvenLayer, AsymmetricalCacheTest,
-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest2EvenLayer, AsymmetricalCacheTest,
+INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest2EvenLayer, AsymmetricalCacheTest,
-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest1ForMLAEvenLayer, AsymmetricalCacheTestWithDP,
+INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest1ForMLAEvenLayer, AsymmetricalCacheTestWithDP,
-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest2ForMLAEvenLayer, AsymmetricalCacheTestWithDP,
+INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest2ForMLAEvenLayer, AsymmetricalCacheTestWithDP,
Please confirm the CI gtest version supports INSTANTIATE_TEST_SUITE_P; if not, keep as-is to avoid breaking older environments.

Also applies to: 1414-1418, 1438-1443, 1445-1449

2-2: Header year nit (guideline).

The repository guideline asks for current year; the header shows 2023–2024. Consider updating to 2025 when touching the file, if the project enforces that check.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (3)
71-77: API change to per-target element sizes: add argument contract to guard misuse

The new vector parameter is good. Please enforce at the call boundary that:

targetBufferEleSizes.size() >= static_cast<size_t>(targetNum)

All entries > 0

Document that sizes are in “elements” (not bytes) and must match ITensor dtype

Add precondition checks in the .cpp implementation and a brief comment here. If you prefer, change targetNum to size_t to avoid signed/unsigned mismatches:
-    std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateSendBuffers(
-        std::optional<int> bufferId, int targetNum, std::vector<size_t> const& targetBufferEleSizes,
+    std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateSendBuffers(
+        std::optional<int> bufferId, size_t targetNum, std::vector<size_t> const& targetBufferEleSizes,
         runtime::BufferManager const& bufferManagerToUse);
94-97: Keep helper signature consistent and clarify “element” semantics

Mirror the public API strictly (size_t targetNum; const& vector) and add a short doc explaining “element sizes” vs bytes to prevent accidental mixing with getSizeInBytes().
-    std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateBuffers(std::optional<int> bufferId,
-        int targetNum, std::vector<size_t> const& targetBufferEleSizes,
-        runtime::BufferManager const& bufferManagerToUse, ConcurrenceResource& concurrenceResource);
+    // targetBufferEleSizes: per-target sizes measured in elements (not bytes).
+    std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateBuffers(
+        std::optional<int> bufferId, size_t targetNum, std::vector<size_t> const& targetBufferEleSizes,
+        runtime::BufferManager const& bufferManagerToUse, ConcurrenceResource& concurrenceResource);
103-111: Stale member name after API shift: mBufferEleSize is misleading

With per-target sizes, a single mBufferEleSize is ambiguous. If it’s a default “small staging buffer element size,” name it accordingly and document when it’s used; otherwise, remove.
-    size_t mBufferEleSize;
+    // Default element size used only when a single staging buffer is employed (bufferCoverTargetNum == 0).
+    size_t mDefaultStagingBufferEleSize;
Confirm all usages in the .cpp align with the new meaning.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
324-329: Fix format specifiers and typo in debug log

Use %zu for size_t and fix “dupliacete”.
-        TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
-            " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d "
-            "bufferCoverTargetNum:%d connections.size():%ld",
-            bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum,
-            connections.size());
+        TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
+            "formatOutput bufferTargetNum: %d, targetNum: %zu, peerDuplicateHeadFactor: %d duplicate:%d "
+            "bufferCoverTargetNum:%zu connections.size():%zu",
+            bufferTargetNum, connections.size(), peerDuplicateHeadFactor, targetInfo.mDupHeadFactor,
+            bufferCoverTargetNum, connections.size());
706-716: Use size_t format specifier in logs

processIdx and sizes are size_t; fix %d/%ld to %zu to avoid UB on some platforms.
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
@@
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
tensorrt_llm/_torch/distributed/communicator.py (1)
139-142: Guard create_pp_comm for degenerate PP groups

If pp_group is length 1 or misconfigured, ensure pp_comm is still valid and usable.
     def create_pp_comm(self):
-        new_group = mpi_comm().group.Incl(self.mapping.pp_group)
-        self.pp_comm = mpi_comm().Create_group(new_group)
+        new_group = mpi_comm().group.Incl(self.mapping.pp_group)
+        self.pp_comm = mpi_comm().Create_group(new_group)
+        # Optional: verify communicator is valid; fall back to world if needed.
+        # assert self.pp_comm is not None
Confirm mapping.pp_group is always populated under all launchers used by TRT-LLM. If not, I can add a safe fallback.
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
96-113: Validate gathered PP layer counts and improve robustness

Add sanity checks: vector length equals mapping.pp_size; sum equals total layer count; elements > 0.
         pp_layer_num = len(kv_cache_manager.pp_layers)
-        pp_layer_num_per_pp_rank = dist.pp_allgather(pp_layer_num)
+        pp_layer_num_per_pp_rank = dist.pp_allgather(pp_layer_num)
+        if len(pp_layer_num_per_pp_rank) != mapping.pp_size:
+            raise RuntimeError(f"pp_allgather returned {len(pp_layer_num_per_pp_rank)} entries; expected {mapping.pp_size}")
+        if any(x <= 0 for x in pp_layer_num_per_pp_rank):
+            raise RuntimeError("attention_layer_num_per_pp must be > 0 per PP rank")
+        # Optional: validate sum equals model layers when available
+        try:
+            total_layers = len(kv_cache_manager.total_num_kv_heads_per_layer)
+            if sum(pp_layer_num_per_pp_rank) != total_layers:
+                logger.warning(f"Sum(attention_layer_num_per_pp)={sum(pp_layer_num_per_pp_rank)} != total_layers={total_layers}")
+        except Exception:
+            pass
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
341-356: Consider adding early exit and documenting buffer handling strategy.

The cumulative buffer size calculation correctly stops when mTransferBufferSize is exceeded. However, there are two improvements to consider:

Add an early exit when all buffers fit within the allocation

Document the fallback behavior when some buffers exceed the pre-allocated size

Apply this diff to add early exit optimization:
     size_t bufferCoverTargetNum = 0;
     size_t preBufferByteSize = 0;
     for (int i = 0; i < targetNum; i++)
     {
         preBufferByteSize += targetBufferEleSizes[i] * common::getDTypeSize(mDataType);
         if (preBufferByteSize > mTransferBufferSize)
         {
             break;
         }
         bufferCoverTargetNum++;
+        // Early exit if all buffers are covered
+        if (bufferCoverTargetNum == static_cast<size_t>(targetNum))
+        {
+            break;
+        }
     }
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
67-102: Add validation for empty per-PP layer distributions.

The function correctly validates that the per-PP layer vectors match the PP sizes and that the sum equals the self-PP's layers. However, consider adding validation that individual layer counts are non-zero to prevent division-by-zero or invalid indexing downstream.

Add validation after line 102:
     int targetPeerPpLayerNumSum = std::accumulate(targetPeerPPLayerNum.begin(), targetPeerPPLayerNum.end(), 0);
     TLLM_CHECK(targetPeerPpLayerNumSum == selfNumLayerPerPP[selfPPRank]);
+    // Validate that all layer counts are positive
+    for (size_t i = 0; i < targetPeerPPLayerNum.size(); ++i)
+    {
+        TLLM_CHECK(targetPeerPPLayerNum[i] > 0);
+    }
520-543: Device function correctly handles per-PP layer mapping, but could benefit from bounds checking.

The new device helper function properly maps global layer IDs to domain-PP indices using the prefix array. The shared memory usage is appropriate for avoiding redundant computation.

Consider adding bounds checking for defensive programming:
 __device__ __forceinline__ void getLayerIdInDomainPPandRankInDomainPP(int layerId, int DomainPPSize,
     uint64_t* prefixLayerNumDevPtr, int& layerIdInDomainPP, int& rankInDomainPP, int& layerNumInSpecPP)
 {
     __shared__ int sharedLayerIdInDomainPP;
     __shared__ int sharedRankInDomainPP;
     __shared__ int sharedLayerNumInSpecPP;
+    __shared__ bool sharedFoundValid;
 
+    if (threadIdx.x == 0)
+    {
+        sharedFoundValid = false;
+    }
+    __syncthreads();
+
 #pragma unroll 1
     for (int ppRank = threadIdx.x; ppRank < DomainPPSize; ppRank += blockDim.x)
     {
         if (layerId >= prefixLayerNumDevPtr[ppRank] && layerId < prefixLayerNumDevPtr[ppRank + 1])
         {
             sharedLayerIdInDomainPP = layerId - prefixLayerNumDevPtr[ppRank];
             sharedRankInDomainPP = ppRank;
             sharedLayerNumInSpecPP = prefixLayerNumDevPtr[ppRank + 1] - prefixLayerNumDevPtr[ppRank];
+            sharedFoundValid = true;
             break;
         }
     }
 
     __syncthreads();
+    // In debug builds, could add assert(sharedFoundValid) here
     layerIdInDomainPP = sharedLayerIdInDomainPP;
     rankInDomainPP = sharedRankInDomainPP;
     layerNumInSpecPP = sharedLayerNumInSpecPP;
 }
1020-1060: Consider using size_t consistently for GPU pointer storage.

The code uses uint64_t for storing GPU addresses which is correct, but mixing uint64_t with reinterpret_cast<uint64_t> for pointer conversion could be cleaner with a consistent type alias.

Consider adding a type alias for clarity:
+    using GpuPtr = uint64_t;
-    std::vector<uint64_t> cachePtrs;
+    std::vector<GpuPtr> cachePtrs;
     // ... later in the code ...
-    cachePtrs.push_back(reinterpret_cast<uint64_t>((kvCacheBlock->data())));
+    cachePtrs.push_back(reinterpret_cast<GpuPtr>(kvCacheBlock->data()));

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between aae4a4e and a7e30ba.

📒 Files selected for processing (23)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1 hunks)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (8 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (4 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (5 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (39 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
cpp/tensorrt_llm/executor/serialization.cpp (3 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (24 hunks)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4 hunks)
cpp/tests/unit_tests/executor/agentCommTest.cpp (2 hunks)
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1 hunks)
tensorrt_llm/_torch/distributed/communicator.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (4 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (15)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
tensorrt_llm/_torch/pyexecutor/_util.py
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
tests/integration/defs/disaggregated/test_disaggregated.py
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/distributed/communicator.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/distributed/communicator.py
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

🧠 Learnings (5)

📓 Common learnings

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu

📚 Learning: 2025-08-21T09:41:49.327Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.327Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

🧬 Code graph analysis (1)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)

numLayers (32-69)

numLayers (32-33)

numLayers (77-88)

numLayers (77-77)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

numLayers (290-392)

numLayers (609-738)

numLayers (609-611)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)

144-154: Equality test update looks correct; keep explicit per-PP vector.

Good: attentionLayerNumPerPP is sized to pipelineParallelism (2) with even split for 10 layers. Both constructors converge on the same state. No issues.

323-326: LGTM: Symmetrical test passes per-PP vector consistent with pp=1.

{numLayers} for pp=1 is correct and future-proof with the new API.

728-734: LGTM: CacheState wiring now passes per-PP vectors for both context and generation.

This aligns with the new constructor semantics and uneven PP layouts.

894-898: DP variant CacheState copy: correct propagation of per-PP vector.

You copy mContextCacheState->getParallelConfig().mAttentionLayerNumPerPP into the custom state — that keeps generation-side routing consistent.

1529-1531: LGTM: Non-MD tests correctly construct per-PP vectors and propagate them.

The revised CacheStateNODP and CacheStateContextDP tests consistently pass attentionLayerNumPerPP matching their PP sizes and reinitialize vectors after PP changes, avoiding subtle stale-state bugs.

Also applies to: 1539-1544, 1584-1584, 1611-1612, 1621-1624, 1625-1628, 1728-1735, 1754-1755, 1769-1770

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

254-258: Zero-copy gating now requires exact PP/TP equality—verify intended trade-off

Changing “<=” to “==” (for both PP and TP) can disable zcopy for otherwise compatible layouts (e.g., equal local shapes with different global layouts), potentially hurting performance.

Would you like me to enumerate shape equivalence conditions that still allow zcopy under uneven PP and gate on those instead?

804-806: LGTM: V2 concat dispatch aligns with per-target/uneven PP path

Using concatKvCacheV2Dispatch is the right call given the new split/recv model.

tensorrt_llm/_torch/distributed/communicator.py (1)

162-229: TorchDist lacks PP collectives—ensure callers never receive TorchDist or add PPComm wiring

create_kv_cache_transceiver now depends on dist.pp_allgather. TorchDist doesn’t implement PP ops, so passing a TorchDist instance will break.

Would you like me to add PP collectives to TorchDist using a new torch.distributed group (mapping.pp_group), mirroring TP/CP? I can provide a patch.

cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)

92-98: Binding signature update looks correct and consistent with Python caller

New parameter attention_layer_num_per_pp is placed between world_config and dtype and named accordingly. No issues.

Please ensure the corresponding C++ CacheTransceiver constructor has matching parameter order and that serialization includes mAttentionLayerNumPerPP.

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)

31-35: No internal call sites missing dist
A repo-wide search for create_kv_cache_transceiver( found the only usage in

tensorrt_llm/_torch/pyexecutor/_util.py (lines 554–556), which already passes the dist argument.

No further updates are required—this API change does not break any existing call sites.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
370-384: Please verify that preBufferEleSize accumulation cannot exceed the allocated buffer size

I wasn’t able to find any existing tests covering the cumulative slicing logic in cacheTransBuffer.cpp. If the calculation of bufferCoverTargetNum doesn’t strictly enforce that the sum of the first N slice sizes fits within the actual buffer length, the call to
ITensor::slice(concurrenceResource.mBuffers[bufferId.value()], preBufferEleSize, targetBufferEleSizes[i])
could read out‐of‐bounds.

Please double-check that:

bufferCoverTargetNum is derived so that
sum(targetBufferEleSizes[0..bufferCoverTargetNum-1])
is always ≤ the size of concurrenceResource.mBuffers[bufferId].

If this invariant isn’t enforced earlier (e.g., in the buffer allocation routines), consider adding a guard or assertion.

Ideally, add unit tests for boundary conditions where the cumulative slice size exactly matches or would exceed the buffer capacity.

File: cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (around lines 370–384)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

1053-1059: Good implementation of prefix sum for per-PP layer distribution.

The cumulative prefix array correctly enables efficient layer-to-PP mapping in the kernels.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu

cpp/tests/batch_manager/cacheTransceiverTest.cpp

chuangz0 · 2025-08-22T08:42:48Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-08-22T08:47:51Z

PR_Github #16166 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-22T17:10:43Z

PR_Github #16166 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12163 completed with status: 'FAILURE'

chuangz0 · 2025-08-25T03:53:29Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-08-25T03:58:33Z

PR_Github #16368 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)

187-193: Default-initialize mOffsetRatio to a safe divisor

mOffsetRatio defaults to {0,0} via value-initialization. send() divides by mOffsetRatio.second; if setSenderState wasn’t called, this becomes a division-by-zero. Initialize to {0,1} for a safe default and document the invariant.
     struct SenderState
     {
         MemoryDesc mReceiverBufferDesc{nullptr, 0, 0};
         int validSegmentIdx{0};
-        std::pair<size_t, size_t> mOffsetRatio;
+        // {offsetLayer, selfSendLayer}; selfSendLayer must be > 0
+        std::pair<size_t, size_t> mOffsetRatio{0, 1};
         SenderState() = default;
     };

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)

95-116: send(): guard invariants and prevent OOB writes

send() computes offset = size / selfSendLayer * offsetLayer with no checks. Add:

selfSendLayer > 0
size % selfSendLayer == 0 (segment alignment)
offset + size <= dstBaseDesc.getLen() (bounds)

Also prefer explicit names for clarity.

     auto dstBaseDesc = mSenderState.mReceiverBufferDesc;
-    auto offset = size / mSenderState.mOffsetRatio.second * mSenderState.mOffsetRatio.first;
-    MemoryDesc dstDesc{dstBaseDesc.getAddr() + offset, size, dstBaseDesc.getDeviceId()};
+    auto const offsetLayer = mSenderState.mOffsetRatio.first;
+    auto const selfSendLayer = mSenderState.mOffsetRatio.second;
+    TLLM_CHECK_WITH_INFO(selfSendLayer > 0, "selfSendLayer must be > 0 before send()");
+    TLLM_CHECK_WITH_INFO(size % selfSendLayer == 0, "send size must be divisible by selfSendLayer");
+    auto const segmentSize = size / selfSendLayer;
+    auto const offsetBytes = segmentSize * offsetLayer;
+    TLLM_CHECK_WITH_INFO(offsetBytes + size <= dstBaseDesc.getLen(), "send would exceed destination buffer");
+    MemoryDesc dstDesc{dstBaseDesc.getAddr() + offsetBytes, size, dstBaseDesc.getDeviceId()};

Minor: the debug log uses %ld for size_t; prefer %zu.

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)

76-87: Mirror invariant checks in the uniform-layer-count constructor

Same rationale; ensure per-PP vector matches PP and sums to nbAttentionLayers.

Apply:

     {
+        auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP;
+        auto const  pp    = mParallelConfig.mPipelineParallelism;
+        auto const  numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size());
+        TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp),
+            "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp);
+        SizeType32 sum = 0;
+        for (auto v : perPP)
+        {
+            TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0.");
+            sum += v;
+        }
+        TLLM_CHECK_WITH_INFO(sum == numLayers,
+            "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers);
     }

♻️ Duplicate comments (10)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6)

103-109: Remove divisibility checks that block uneven PP; validate attentionLayerNumPerPP instead

These modulus checks forbid uneven PP, which this PR aims to support. Replace them with validation against per-PP layer distributions from both self/dest configs: size equals PP size, entries > 0, and sums match total layer count. This also future-proofs checkAlternateWindow for heterogeneous PP layouts.

Apply:

-    auto selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism;
-    auto selfAllLayerNum = selfConfig.getModelConfig().mNbKvHeadsPerLayer.size();
-    auto destPPNum = destConfig.getParallelConfig().mPipelineParallelism;
-    auto destAllLayerNum = destConfig.getModelConfig().mNbKvHeadsPerLayer.size();
-    TLLM_CHECK_WITH_INFO(selfAllLayerNum % selfPPNum == 0, " For VWSA selfAllLayerNum must be divisible by selfPPNum");
-    TLLM_CHECK_WITH_INFO(destAllLayerNum % destPPNum == 0, "For VWSA destAllLayerNum must be divisible by destPPNum");
+    auto const selfPPNum       = selfConfig.getParallelConfig().mPipelineParallelism;
+    auto const destPPNum       = destConfig.getParallelConfig().mPipelineParallelism;
+    auto const selfAllLayerNum = static_cast<SizeType32>(selfConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+    auto const destAllLayerNum = static_cast<SizeType32>(destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+
+    auto const& selfPerPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP;
+    auto const& destPerPP = destConfig.getParallelConfig().mAttentionLayerNumPerPP;
+    TLLM_CHECK_WITH_INFO(selfPerPP.size() == static_cast<size_t>(selfPPNum),
+        "self attentionLayerNumPerPP size (%zu) must equal selfPPNum (%d).", selfPerPP.size(), selfPPNum);
+    TLLM_CHECK_WITH_INFO(destPerPP.size() == static_cast<size_t>(destPPNum),
+        "dest attentionLayerNumPerPP size (%zu) must equal destPPNum (%d).", destPerPP.size(), destPPNum);
+    auto sumVec = [](auto const& v){ SizeType32 s=0; for (auto x: v){ TLLM_CHECK_WITH_INFO(x>0,"per-PP layer count must be >0"); s+=x;} return s; };
+    TLLM_CHECK_WITH_INFO(sumVec(selfPerPP) == selfAllLayerNum, "Sum(self attentionLayerNumPerPP) must equal selfAllLayerNum.");
+    TLLM_CHECK_WITH_INFO(sumVec(destPerPP) == destAllLayerNum, "Sum(dest attentionLayerNumPerPP) must equal destAllLayerNum.");

286-289: Guard mAttentionLayerNumPerPP indexing and validate per-PP entry

Indexing by selfIdx/tp can go OOB and zero layer count is invalid. Add bounds and >0 checks.

Apply:

-        int selfAttentionLayerNum
-            = selfConfig.getParallelConfig()
-                  .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism];
+        auto const& perPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP;
+        auto const  tp    = selfConfig.getParallelConfig().mTensorParallelism;
+        auto const  ppIdx = selfIdx / tp;
+        TLLM_CHECK_WITH_INFO(ppIdx < perPP.size(), "ppIdx OOB: %d vs %zu", ppIdx, perPP.size());
+        int selfAttentionLayerNum = perPP[ppIdx];
+        TLLM_CHECK_WITH_INFO(selfAttentionLayerNum > 0, "selfAttentionLayerNum must be > 0");

290-317: Per-target sizing: add divisibility checks, fix naming, and return consistent vectors

Add divisibility checks before integer division.
Rename LayerNumbufferTargetNum → layerNumPerTarget for clarity.
Ensure returned vectors have lengths consistent with targetNum/bufferTargetNum as used later.

Apply:

-        auto getBufferSizeForTarget = [&]()
+        auto getBufferSizeForTarget = [&]()
         {
-            std::vector<size_t> bufferSizeForTarget(targetNum, 0);
-            std::vector<SizeType32> LayerNumbufferTargetNum(bufferTargetNum, 0);
+            std::vector<size_t> bufferSizeForTarget(targetNum, 0);
+            std::vector<SizeType32> layerNumPerTarget(bufferTargetNum, 0);
             //             // only first bufferTargetNum is used.
             if (inputKvCacheBlocks.size() > 1)
             {
                 // for VWSA
                 for (size_t i = 0; i < targetNum; i++)
                 {
                     bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetNum;
                 }
-                return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum);
+                return std::make_pair(bufferSizeForTarget, layerNumPerTarget);
             }
 
+            // Validate divisibility before integer divisions below
+            TLLM_CHECK_WITH_INFO(
+                (allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) % static_cast<size_t>(targetInfo.mDomainTPSize) == 0,
+                "allCacheBlockSize*peerDuplicateHeadFactor must be divisible by domain TP size");
+            TLLM_CHECK_WITH_INFO(
+                ((allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) / static_cast<size_t>(targetInfo.mDomainTPSize)) % static_cast<size_t>(selfAttentionLayerNum) == 0,
+                "Base element size must be divisible by selfAttentionLayerNum");
             for (size_t i = 0; i < targetNum; i++)
             {
                 bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetInfo.mDomainTPSize
                     / selfAttentionLayerNum * targetInfo.getPeerPPDomainLayerNum(i);
             }
             for (size_t i = 0; i < bufferTargetNum; i++)
             {
-                LayerNumbufferTargetNum[i] = targetInfo.getPeerPPDomainLayerNum(i);
+                layerNumPerTarget[i] = targetInfo.getPeerPPDomainLayerNum(i);
             }
 
-            return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum);
+            return std::make_pair(bufferSizeForTarget, layerNumPerTarget);
         };
-        auto [bufferEleSizes, LayerNumbufferTargetNum] = getBufferSizeForTarget();
+        auto [bufferEleSizes, layerNumPerTarget] = getBufferSizeForTarget();

348-391: Partial-send path: use bufferIdx consistently; fix size_t printf

remainSendSize/needSendSize currently read from outputSplitCaches[processIdx] but slices use [bufferIdx]; this mismatch can cause OOB and incorrect transfers. Also fix %ld/%d for size_t.

Apply:

-            TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %ld", processIdx);
+            TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu", processIdx);
@@
-            size_t size = outputSplitCaches[bufferIdx]->getSizeInBytes();
+            size_t size = outputSplitCaches[bufferIdx]->getSizeInBytes();
@@
-                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %d bufferIdx: %d size:%ld",
-                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
+                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu bufferIdx: %zu size:%zu",
+                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
@@
-                session.send(
-                    processIdx, outputSplitCaches[bufferIdx]->data(), outputSplitCaches[bufferIdx]->getSizeInBytes());
+                session.send(processIdx, outputSplitCaches[bufferIdx]->data(),
+                    outputSplitCaches[bufferIdx]->getSizeInBytes());
@@
-                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %d bufferIdx: %d size:%ld",
-                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
+                TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %zu bufferIdx: %zu size:%zu",
+                    processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());
             }
             else
             {
@@
-                size_t remainSendSize = outputSplitCaches[processIdx]->getSize();
-                size_t needSendSize = outputSplitCaches[processIdx]->getSize();
+                size_t remainSendSize = outputSplitCaches[bufferIdx]->getSize();
+                size_t needSendSize   = outputSplitCaches[bufferIdx]->getSize();
                 auto sendBufferIdx = bufferCoverTargetNum == 0 ? 0 : bufferIdx % bufferCoverTargetNum;
@@
-                    auto copySlice = runtime::ITensor::slice(
-                        outputSplitCaches[bufferIdx], needSendSize - remainSendSize, sendSize);
+                    auto copySlice = runtime::ITensor::slice(
+                        outputSplitCaches[bufferIdx], needSendSize - remainSendSize, sendSize);

Also replicate the same consistency fix, if present, in the analogous branch later in this function.

610-613: Remove stale equal-split assert; uneven PP intentionally violates cacheBlockSizeSum % targetNum == 0

Uneven per-target sizing makes this global divisibility false-positive. Allocation logic already enforces per-target divisibility; drop this check.

Apply:

-            TLLM_CHECK(cacheBlockSizeSum % targetNum == 0);

615-649: Recv-side sizing: fix spelling, add guards to mirror send path, and validate vector sizes

Rename getTargetBufferEleSzie → getTargetBufferEleSize and valideTpSize → validTpSize.
Add divisibility checks and ensure returned vectors are sized to targetNum with positive per-target layer counts.

Apply:

-            int selfAttentionLayerNum
-                = selfConfig.getParallelConfig()
-                      .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism];
-            auto getTargetBufferEleSzie = [&]()
+            int selfAttentionLayerNum
+                = selfConfig.getParallelConfig()
+                      .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism];
+            auto getTargetBufferEleSize = [&]()
             {
                 if (outputBuffersPerWindow.size() > 1)
                 {
                     std::vector<size_t> bufferSizeForTarget(targetNum, 0);
                     for (size_t i = 0; i < targetNum; i++)
                     {
                         bufferSizeForTarget[i] = cacheBlockSizeSum / targetNum;
                     }
                     // TODO: LayerNumbufferTargetNum for VWSA
                     return std::make_pair(bufferSizeForTarget, std::vector<SizeType32>(targetNum, 0));
                 }
-                size_t valideTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize;
-                TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % valideTpSize == 0,
-                    "cacheBlockSizeSum must be divisible by valideTpSize %ld", valideTpSize);
-                TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * valideTpSize)) == 0,
-                    "cacheBlockSizeSum must be divisible by valideTpSize %ld * selfAttentionLayerNum %d", valideTpSize,
-                    selfAttentionLayerNum);
+                TLLM_CHECK_WITH_INFO(pickUpConnections.size() % static_cast<size_t>(targetInfo.mDomainPPSize) == 0,
+                    "pickUpConnections.size() must be divisible by domain PP size");
+                size_t validTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize;
+                TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % validTpSize == 0,
+                    "cacheBlockSizeSum must be divisible by validTpSize %zu", validTpSize);
+                TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * validTpSize)) == 0,
+                    "cacheBlockSizeSum must be divisible by validTpSize %zu * selfAttentionLayerNum %d", validTpSize,
+                    selfAttentionLayerNum);
                 TLLM_CHECK(targetNum == pickUpConnections.size());
-                size_t baseEleSize = cacheBlockSizeSum / (valideTpSize * selfAttentionLayerNum);
+                size_t baseEleSize = cacheBlockSizeSum / (validTpSize * selfAttentionLayerNum);
 
                 std::vector<size_t> bufferEleSizes(targetNum, 0);
-                std::vector<SizeType32> LayerNumbufferTargetNum(targetNum, 0);
+                std::vector<SizeType32> LayerNumbufferTargetNum(targetNum, 0);
 
                 for (size_t i = 0; i < targetNum; i++)
                 {
                     LayerNumbufferTargetNum[i]
                         = targetInfo.getPeerPPDomainLayerNum(static_cast<SizeType32>(pickUpConnections[i]));
                     bufferEleSizes[i] = baseEleSize * LayerNumbufferTargetNum[i];
+                    TLLM_CHECK_WITH_INFO(LayerNumbufferTargetNum[i] > 0, "per-target PP layer count must be > 0");
                 }
                 return std::make_pair(bufferEleSizes, LayerNumbufferTargetNum);
             };
-            auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSzie();
+            auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSize();

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

99-106: Add invariant: Sum(attentionLayerNumPerPP) must equal total attention layers

This prevents downstream OOB and mis-sizing when PP layout and model config diverge.

Apply:

     std::vector<SizeType32> attentionLayerNumPerPP(ppSize, 0);
     for (int pp_rank = 0; pp_rank < ppSize; pp_rank++)
     {
         attentionLayerNumPerPP[pp_rank] = modelConfig.getNbAttentionLayers(ppSize, pp_rank);
     }
+    auto total = std::accumulate(attentionLayerNumPerPP.begin(), attentionLayerNumPerPP.end(), SizeType32{0});
+    TLLM_CHECK_WITH_INFO(
+        total == modelConfig.getNbAttentionLayers(),
+        "Sum(attentionLayerNumPerPP) (%d) must equal total attention layers (%d).",
+        total, modelConfig.getNbAttentionLayers());

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (3)

51-58: Fix DP rank/size initialization: using TP fields is incorrect

ParallelConfig’s DP fields are currently filled from TensorParallel fields, which misroutes DP-aware logic. Use the Context/Data parallel accessors from WorldConfig (the third axis).

Apply:

-        , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(),
-              worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(), worldConfig.getTensorParallelRank(),
-              worldConfig.getTensorParallelism(), attentionLayerNumPerPP}
+        , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(),
+              worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(),
+              /* DPrank */ worldConfig.getContextParallelRank(),
+              /* DPsize */ worldConfig.getContextParallelism(),
+              attentionLayerNumPerPP}

If the API provides getDataParallelRank()/getDataParallelism(), use those instead.

51-61: Enforce invariants on attentionLayerNumPerPP to avoid OOB and mis-sizing

Validate once at construction: length == PP, entries > 0, and sum equals number of attention layers. Without this, downstream indexing like perPP[ppIdx] risks OOB.

Apply:

     CacheState(ModelConfig modelConfig, runtime::WorldConfig const& worldConfig,
         std::vector<SizeType32> const& attentionLayerNumPerPP, nvinfer1::DataType dataType,
         AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2)
         : mModelConfig(std::move(modelConfig))
         , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(),
               worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(),
               /* DPrank */ worldConfig.getContextParallelRank(),
               /* DPsize */ worldConfig.getContextParallelism(),
               attentionLayerNumPerPP}
         , mDataType{dataType}
         , mAttentionConfig(attentionType, kvFactor)
     {
+        auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP;
+        auto const  pp    = mParallelConfig.mPipelineParallelism;
+        auto const  numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size());
+        TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp),
+            "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp);
+        SizeType32 sum = 0;
+        for (auto v : perPP)
+        {
+            TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0.");
+            sum += v;
+        }
+        TLLM_CHECK_WITH_INFO(sum == numLayers,
+            "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers);
     }

Repeat for the other two constructors below.

63-74: Mirror invariant checks in the size/vector-based constructor

Add the same per-PP validation block here after member initialization.

Apply:

     {
+        auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP;
+        auto const  pp    = mParallelConfig.mPipelineParallelism;
+        auto const  numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size());
+        TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp),
+            "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp);
+        SizeType32 sum = 0;
+        for (auto v : perPP)
+        {
+            TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0.");
+            sum += v;
+        }
+        TLLM_CHECK_WITH_INFO(sum == numLayers,
+            "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers);
     }

🧹 Nitpick comments (15)

tests/integration/defs/disaggregated/test_disaggregated.py (1)

144-147: Configuration Path and Registration Verified

The YAML file exists at tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml.

The test test_disaggregated_deepseek_v3_lite_fp8_ctxpp2_gentp2_one_mtp is registered in l0_dgx_h100.yml.

Optional refactor recommendation:

For naming consistency, you may consider renaming the YAML filename prefix from ctxtp1_gentp1 to ctxpp2_gentp2 so it aligns with the config key.
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (3)
349-352: Vectorized target sizes: add mismatch/zero-size guardrails in tests

Good adaptation to the new vector-based API. Please add a negative test where:

targetSizeVec.size() != targetNum, and

at least one element is 0,
to assert expected behavior (exception or graceful fallback). This will protect against subtle regressions in callers constructing the vector dynamically.

412-418: Partial-cover semantics: verify mixed per-target sizes

You’re exercising bufferCoverTargetNum == targetNum/2 for equal per-target sizes. Consider adding one case with heterogeneous sizes in targetSizeVec (e.g., {big, small, big, small}) to confirm partial covering logic stays correct when per-target sizes vary.

382-382: Fix printf format for size_t

defaultTransSize is a size_t but the format specifier is %d. Use %zu (or cast to unsigned long and use %lu) to avoid UB on some platforms.
-        TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize);
+        TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)
178-178: Parameter name typo and inconsistency with definition

The declaration uses valideSegmentIdx while the definition uses validSegmentIdx (connection.cpp Line 159). Our guidelines require parameter name consistency between declarations and definitions. Please fix the typo.
-    void setSenderState(MemoryDesc mReceiverBufferDesc, int valideSegmentIdx, std::pair<size_t, size_t> offsetRatio);
+    void setSenderState(MemoryDesc mReceiverBufferDesc, int validSegmentIdx, std::pair<size_t, size_t> offsetRatio);
207-209: Refactor AgentConnectionManager constructor to avoid extra CacheState copy

Passing CacheState by value causes an unnecessary copy at each call site (especially when the caller has an lvalue). Update both the header and implementation to take it by const& and perform the copy in the initializer list.

Affected locations:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (lines 207–209)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (lines 177–181)

Suggested diff:
--- a/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
+++ b/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
@@ -207,7 +207,7 @@ public:
     AgentConnectionManager(
-        batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState);
+        batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState const& cacheState);
     ~AgentConnectionManager();
--- a/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
+++ b/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
@@ -177,7 +177,7 @@ AgentConnectionManager::AgentConnectionManager(
     batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState)
     : mRegMemDescs(MemoryType::kVRAM, {})
+    , mCacheState(cacheState)
 {
     // ...
 }
 
- AgentConnectionManager::AgentConnectionManager(
-     batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState)
+ AgentConnectionManager::AgentConnectionManager(
+     batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState const& cacheState)
     : mRegMemDescs(MemoryType::kVRAM, {})
     , mCacheState(cacheState)
 {
     // ...
 }
cpp/tests/unit_tests/executor/agentCommTest.cpp (3)
93-95: Constructing CacheState with per-PP layers: assert PP-size alignment in test

Good: you now pass attentionLayerNumPerPP. Add an assertion that the vector size matches the pipeline parallelism used in world_config (here 1) to catch misconfigurations early.

Example:
     mCacheState = std::make_unique<CacheState>(
-        numLayers, numHeads, sizePerHead, tokensPerBlock, 1, 1, 1, std::vector<SizeType32>{numLayers}, dataType);
+        numLayers, numHeads, sizePerHead, tokensPerBlock, 1, 1, 1, std::vector<SizeType32>{numLayers}, dataType);
+    ASSERT_EQ(mCacheState->getParallelConfig().mAttentionLayerNumPerPP.size(), 1);
111-112: Passing CacheState into AgentConnectionManager: exercise non-zero offset cases

The current test uses validConnectionIdx = 0, which doesn’t exercise the new offset ratio logic. Consider adding a sub-test with attentionLayerNumPerPP like {3,5} and validConnectionIdx=1 to validate non-zero offsets in send() (e.g., write a pattern and verify the destination address shift).

I can draft this extension if helpful.

124-125: Duplicated construction OK, but consider a helper

Both connectionManager0 and connectionManager1 receive identical CacheState copies. For test brevity, a small helper factory could reduce duplication. Optional.
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)

177-181: Ctor takes CacheState by value in source as well—align with header change

If you adopt the const& in the header, adjust the source accordingly (and drop std::move). See the header comment for the diff.

359-367: Nit: spelling in parameter name

connecitonInfo is misspelled (should be connectionInfo). Not critical, but worth fixing for readability.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
324-329: Use correct format specifiers for size_t in logs

connections.size() is size_t; use %zu or cast explicitly to avoid UB on LLP64/LP64 differences.

Apply:
-            "bufferCoverTargetNum:%d connections.size():%ld",
-            bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum,
-            connections.size());
+            "bufferCoverTargetNum:%d connections.size():%zu",
+            bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum,
+            static_cast<size_t>(connections.size()));
706-716: Use %zu for size_t in recv logs; minor correctness/logging polish

Format specifiers should match size_t types. Also harmless but makes logs portable across platforms.

Apply:
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
@@
-                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx,
+                    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx,
                         buffer->getSizeInBytes());
Also applies to: 719-741
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
1-16: Update copyright year to 2025

Project guidelines request current-year copyright headers on sources/headers.

Apply:
-/* 
- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)
1-16: Update header year to 2025

Align with coding guideline for current-year headers.

Apply:
-/* 
- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a7e30ba and 399252d.

📒 Files selected for processing (25)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1 hunks)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (8 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (4 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (5 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (6 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (4 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (39 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
cpp/tensorrt_llm/executor/serialization.cpp (3 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4 hunks)
cpp/tests/unit_tests/executor/agentCommTest.cpp (4 hunks)
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1 hunks)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (25 hunks)
tensorrt_llm/_torch/distributed/communicator.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (4 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
tensorrt_llm/_torch/distributed/communicator.py
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
cpp/tensorrt_llm/executor/serialization.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
tests/integration/defs/disaggregated/test_disaggregated.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/disaggregated/test_disaggregated.py

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

🧠 Learnings (7)

📓 Common learnings

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

cpp/include/tensorrt_llm/executor/dataTransceiverState.h

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

🧬 Code graph analysis (8)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)

setSenderState (158-164)

setSenderState (158-159)

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)

CacheState (40-277)

tests/integration/defs/disaggregated/test_disaggregated.py (1)

tests/integration/defs/conftest.py (5)

disaggregated_test_root (2339-2344)

llm_venv (707-723)

disaggregated_example_root (270-275)

llama_model_root (964-1039)

deepseek_v3_model_root (621-636)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2)

CacheState (40-277)

AttentionType (45-277)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2)

CacheTransceiver (111-213)

CacheTransceiver (215-222)

cpp/include/tensorrt_llm/executor/dataTransceiverState.h (4)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

nodiscard (67-90)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

nodiscard (821-883)

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

nodiscard (524-604)

cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)

nodiscard (61-112)

nodiscard (123-126)

nodiscard (133-138)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)

cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)

requestInfo (179-179)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

numLayers (290-392)

numLayers (609-738)

numLayers (609-611)

cpp/tests/unit_tests/executor/agentCommTest.cpp (2)

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)

numLayers (290-392)

numLayers (609-738)

numLayers (609-611)

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (5)

numLayers (32-69)

numLayers (32-33)

numLayers (77-88)

numLayers (77-77)

mTransBufferManager (71-75)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)

targetIRanks (181-185)

targetIRanks (181-182)

concatKvCacheV2Dispatch (1619-1658)

concatKvCacheV2Dispatch (1619-1622)

🔇 Additional comments (23)

tests/integration/defs/disaggregated/test_disaggregated.py (1)

80-81: Config and CI Registration Confirmed

Both verifications have passed:

The YAML config tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml is present (YAML OK).

The test test_disaggregated_ctxpp4_gentp4 is registered in tests/integration/test_lists/test-db/l0_dgx_h100.yml (line 36).

No further action required.

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

397-404: LGTM on default-trans-size path with per-target sizes

The refactor to pass targetSizeVec aligns with the new API and preserves existing expectations (onlyUseDynamicBuffer=false, backing pre-alloc honored). No issues here.

424-431: LGTM on full-cover case

Vectorized sizes with smaller chunks reaching full cover looks correct. No further action.

cpp/tests/unit_tests/executor/agentCommTest.cpp (1)

81-83: Const placement nit is fine

Switching to “BlocksPerWindow const ...” is stylistic; compiles the same. No concerns.
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)
252-288: recvConnectionAndRequestInfo(): assert optional CacheState presence before use

You call requestInfo.getTransState().getCacheState().value() without checking. If callers ever omit cache state, this will terminate. Add a check and better error.
-                    auto offsetRatio = computeSendOffsetRatio(requestInfo.getTransState().getCacheState().value(),
+                    TLLM_CHECK_WITH_INFO(requestInfo.getTransState().getCacheState().has_value(),
+                        "Missing CacheState in RequestInfo TransState");
+                    auto offsetRatio = computeSendOffsetRatio(requestInfo.getTransState().getCacheState().value(),
                         requestInfo.getTransState().getCommState()->getSelfIdx(), mCacheState, validConnectionIdx);
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)
254-258: Zero-copy gating tightened to exact PP/TP equality — good

Requiring identical PP and TP for zcopy is correct since layout/strides must match bit-for-bit. This avoids silent corruption on hetero layouts.

318-323: Possible mismatch: getOrAllocateSendBuffers targetNum vs bufferEleSizes.size()

You pass bufferTargetNum (connections.size()/dupFactor) as targetNum, but bufferEleSizes has length targetNum (connections.size()). Ensure the allocator expects per-bufferTarget sizing; otherwise this risks under-allocation and OOB when indexing outputSplitCaches by bufferIdx/processIdx.

Would you confirm the expected vector length in CacheTransBufferManager::getOrAllocateSendBuffers(..., int targetNum, const std::vector<size_t>& bufferEleSizes, ...)? If it expects bufferEleSizes.size() == targetNum, consider:
-        auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
-            cacheBufferId, static_cast<int>(bufferTargetNum), bufferEleSizes, bufferManager);
+        auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
+            cacheBufferId, static_cast<int>(bufferEleSizes.size()), bufferEleSizes, bufferManager);
If the allocator is deliberately keyed to bufferTargetNum, we should instead pre-aggregate sizes per bufferTargetNum entry and pass that vector. I can generate that refactor if needed.

804-806: Switch to concatKvCacheV2Dispatch looks good

The new dispatcher aligns with per-window output map and self/dest configs. Assumes upstream split sizing fixes are applied.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

73-89: Public API extended with attentionLayerNumPerPP — consistent and clear

Both constructors accept the per-PP vector before dataType, and the delegating ctor is updated accordingly. Signature ordering is consistent with implementation.

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

187-190: Agent backend now receives CacheState — good change

Passing CacheState to AgentConnectionManager allows it to reason about PP/TP domains. This matches the broader API changes.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

323-326: SymmetricalCacheTest: attentionLayerNumPerPP seeded correctly

Using {numLayers} for PP=1 is consistent. No action needed.

620-636: Good: replacing hard divisibility ASSERT with computed per-PP distribution

The helper getLayerNumPPRank and mAttentionLayerNumPerPP setup enable uneven PP cases correctly.

973-986: Start layer offset derived from mAttentionLayerNumPerPP — correct for uneven PP

The cumulative sum over previous PP ranks is the right offset when windows are not used. Logic looks good.

Also applies to: 1051-1063

1390-1418: New EvenLayer instantiations add coverage — nice addition

These parameterized cases help ensure both directions (ctxPP!=genPP) are exercised with even divisions. No issues.

Also applies to: 1438-1449, 1491-1496, 1508-1513
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (9)
67-103: Well-implemented uneven PP layer distribution logic

The new approach correctly computes layer overlaps between self and peer PP ranks to handle uneven pipeline parallelism. The logic systematically:

Calculates self layer ranges from per-PP layer counts

Identifies overlapping peer PP ranks and their contribution to the domain

Validates that total peer layer contributions match self layer count

This is a significant improvement over the previous fixed-division approach.

520-543: Device helper function correctly implements layer-to-PP mapping

The device helper efficiently maps global layer IDs to PP-specific information using shared memory for broadcast. The linear search through prefix sums is appropriate given the typical small PP sizes, and the thread-level parallelization with synchronization ensures correctness.

548-549: Consistent kernel signature extension for uneven PP support

All affected kernels now correctly accept the prefixLayerNumDevPtr parameter to support per-PP layer count lookups. The parameter placement is consistent across all kernel variants (MLA and standard).

Also applies to: 617-618, 812-813, 873-874

565-570: Proper layer mapping integration in kernels

The kernels correctly use the new device helper getLayerIdInDomainPPandRankInDomainPP to map global layer IDs to PP-specific coordinates. This replaces the previous fixed-division approach and properly handles uneven PP distributions.

Also applies to: 644-649, 827-831, 890-895

1020-1021: Data type widening for GPU pointer storage

The change from std::vector<T*> to std::vector<uint64_t> allows storing both GPU pointers and prefix layer counts in a unified buffer. This is a practical solution that maintains type safety while enabling efficient GPU data transfer.

Also applies to: 1043-1044, 1051-1052, 1362-1363, 1382-1383, 1390-1391

1053-1060: Prefix sum computation and GPU data layout

The prefix sum computation correctly creates cumulative layer counts for efficient GPU lookups. The data layout appends prefix sums to the pointer array, and the device pointer arithmetic properly separates the different data sections.

Also applies to: 1120-1122, 1392-1399, 1444-1446

1124-1126: Correct layer count derivation from per-PP configuration

The code now properly derives numLayers from the per-PP attention layer counts using bounds-checked access. This addresses the previous limitation where fixed division didn't work for uneven PP distributions.

Also applies to: 1448-1450

1152-1153: Comprehensive kernel call updates for uneven PP

All kernel invocations consistently pass the prefixLayerNumDevPtr parameter across all template specializations and code paths (split/concat, MLA/standard, different vector sizes). This ensures uniform support for uneven PP throughout the codebase.

Also applies to: 1166-1167, 1176-1177, 1190-1191, 1204-1205, 1219-1220, 1237-1238, 1252-1253, 1266-1267, 1281-1282, 1478-1479, 1492-1493, 1502-1503, 1516-1517, 1530-1531, 1545-1546, 1563-1564, 1578-1579, 1592-1593, 1607-1608

1401-1402: Fix size validation after data type change

The size check should be updated to reflect the new uint64_t element size rather than the old T* pointer size.
-    TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t));
+    TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t));
Actually, looking more carefully, the current check is already correct. The issue appears to be with the old comment at line 1064 that still references T*.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp

cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp

tests/integration/defs/disaggregated/test_disaggregated.py

tensorrt-cicd · 2025-08-25T04:12:50Z

PR_Github #16368 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12297 completed with status: 'FAILURE'

chuangz0 · 2025-08-25T10:11:50Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-08-25T10:16:59Z

PR_Github #16433 [ run ] triggered by Bot

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu

brb-nv · 2025-09-04T21:06:54Z

Left some more comments. Kindly have a look.
Approving so that you're not blocked due to timezone difference.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

Signed-off-by: Chuang Zhu <[email protected]>

chuangz0 · 2025-09-05T02:50:17Z

/bot run

tensorrt-cicd · 2025-09-05T02:55:38Z

PR_Github #17736 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T02:55:39Z

PR_Github #17736 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chuangz0 is not in the prioritized list. L0 testing cannot be triggered.

chuangz0 · 2025-09-05T03:34:59Z

/bot run

tensorrt-cicd · 2025-09-05T03:39:52Z

PR_Github #17741 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T03:39:53Z

PR_Github #17741 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chuangz0 is not in the prioritized list. L0 testing cannot be triggered.

chuangz0 · 2025-09-05T06:00:54Z

/bot run

tensorrt-cicd · 2025-09-05T06:05:43Z

PR_Github #17753 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T06:05:44Z

PR_Github #17753 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chuangz0 is not in the prioritized list. L0 testing cannot be triggered.

chuangz0 · 2025-09-05T07:42:31Z

/bot run

tensorrt-cicd · 2025-09-05T07:51:37Z

PR_Github #17761 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T07:51:38Z

PR_Github #17761 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chuangz0 is not in the prioritized list. L0 testing cannot be triggered.

chuangz0 · 2025-09-05T12:57:24Z

/bot run

tensorrt-cicd · 2025-09-05T13:02:55Z

PR_Github #17783 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-05T23:03:31Z

PR_Github #17783 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13314 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

chuangz0 · 2025-09-08T02:01:06Z

@pcastonguay could you approve this PR?

chuangz0 requested review from a team as code owners August 21, 2025 07:32

chuangz0 requested review from Tabrizian, schetlur-nv, lucaslie and nv-yilinf August 21, 2025 07:32

chuangz0 requested review from Shixiaowei02 and Shunkangz August 21, 2025 07:32

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

chuangz0 changed the title ~~[TRTLLM-7361][feat]Kv cache transfer for uneven pp~~ [TRTLLM-7361][feat] KV cache transfer for uneven pp Aug 21, 2025

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp Outdated Show resolved Hide resolved

chuangz0 force-pushed the kv_transfer_for_uneven_pp branch from aae4a4e to a7e30ba Compare August 22, 2025 08:17

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

chuangz0 force-pushed the kv_transfer_for_uneven_pp branch 2 times, most recently from eff60ba to 399252d Compare August 25, 2025 03:53

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

chuangz0 force-pushed the kv_transfer_for_uneven_pp branch from c0907a5 to 0ed9d94 Compare August 25, 2025 10:11

brb-nv reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu Outdated Show resolved Hide resolved

brb-nv reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu Outdated Show resolved Hide resolved

brb-nv reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu Outdated Show resolved Hide resolved

brb-nv reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu Show resolved Hide resolved

brb-nv reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu Show resolved Hide resolved

brb-nv approved these changes Sep 4, 2025

View reviewed changes

Shixiaowei02 reviewed Sep 5, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp Outdated Show resolved Hide resolved

Shixiaowei02 reviewed Sep 5, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp Outdated Show resolved Hide resolved

address some comment

6f21399

Signed-off-by: Chuang Zhu <[email protected]>

chuangz0 requested a review from a team September 5, 2025 02:51

pcastonguay approved these changes Sep 8, 2025

View reviewed changes

pcastonguay merged commit 77657a1 into NVIDIA:main Sep 8, 2025
5 checks passed

pcastonguay mentioned this pull request Sep 16, 2025

[TRTLLM-7958][doc] add 1.0 release notes #7605

Merged

1 task

[TRTLLM-7361][feat] KV cache transfer for uneven pp #7117

[TRTLLM-7361][feat] KV cache transfer for uneven pp #7117

Uh oh!

Conversation

chuangz0 commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

chuangz0 commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chuangz0 commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

chuangz0 commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chuangz0 commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2025 •

edited

Loading

brb-nv commented Sep 4, 2025 •

edited

Loading