-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-7361][feat] KV cache transfer for uneven pp #7117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThreads per-pipeline-parallel (per-PP) attention-layer counts through CacheState and CacheTransceiver APIs, buffer allocation/transfer logic, CUDA split/concat kernels, agent connection offsetting, (de)serialization, Python bindings, MP/PP collectives, and tests/configs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python caller
participant Dist as Distributed (TP/PP)
participant Binder as BindKvCacheTransceiver (py)
participant Cpp as CacheTransceiver (C++)
participant State as CacheState
Py->>Binder: create_kv_cache_transceiver(mapping, dist, kv_cache_manager, ...)
Binder->>Dist: pp_allgather(pp_layer_num)
Dist-->>Binder: pp_layer_num_per_pp_rank
Binder->>Cpp: CacheTransceiver(..., attentionLayerNumPerPP=pp_layer_num_per_pp_rank, dtype, ...)
Cpp->>State: CacheState(..., attentionLayerNumPerPP, dtype, ...)
State-->>Cpp: constructed
Cpp-->>Binder: instance
Binder-->>Py: transceiver
sequenceDiagram
autonumber
participant Fmt as CacheFormatter/MLAFormatter
participant Buf as CacheTransBufferManager
participant Dev as Device Buffers
Fmt->>Fmt: compute per-target bufferEleSizes & LayerNumPerTarget
Fmt->>Buf: getOrAllocateSendBuffers(bufferId, targetNum, bufferEleSizes, BM)
Buf->>Buf: compute per-element coverage & allocate slices
Buf-->>Fmt: (tensors, bufferCoverTargetNum, onlyUseDynamicBuffer)
Fmt->>Dev: slice/copy/send per target using element sizes
Note over Fmt,Dev: Receive path mirrors with getOrAllocateRecvBuffers
sequenceDiagram
autonumber
participant Mgr as AgentConnectionManager
participant Conn as AgentConnection
participant Peer as Remote Agent
Mgr->>Mgr: compute per-peer domain layer counts
Mgr->>Conn: setSenderState(receiverDesc, validIdx, offsetRatio)
Conn->>Peer: send(buffer, dst_addr + offset)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --add-multi-gpu-test |
PR_Github #16024 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
76-76
: pybind11 typo: use py::class_ and add missing namespace alias
- py::classh should be py::class_ (underscore), otherwise it won’t compile.
- Missing namespace alias: py is used but namespace py = pybind11; is not declared locally (don’t rely on transitive aliases).
Apply this diff:
@@ -void tb::CacheTransceiverBindings::initBindings(py::module_& m) +namespace py = pybind11; + +void tb::CacheTransceiverBindings::initBindings(py::module_& m) @@ - py::classh<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver") + py::class_<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver") @@ - py::classh<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver") + py::class_<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")Also applies to: 88-88
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
680-685
: Bug: cacheType shadowing prevents kSELFKONLY from taking effectThe inner ‘auto cacheType’ shadows the outer variable; cacheType remains kSELF, breaking K-only tests and coverage.
Fix:
- CacheType cacheType = CacheType::kSELF; - if (kvFactor == 1) - { - auto cacheType = CacheType::kSELFKONLY; - } + CacheType cacheType = (kvFactor == 1) ? CacheType::kSELFKONLY : CacheType::kSELF;cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
125-126
: Don’t hard-fail on zero blocks (uneven PP can yield 0 local layers)With uneven PP, some ranks can legitimately hold zero attention layers. TLLM_CHECK(blockNum > 0) will abort such ranks. Prefer a no-op return.
- TLLM_CHECK(blockNum > 0); + if (blockNum == 0) + { + // No KV data for this rank; nothing to send. + return; + }
524-602
: inquireSupport: duplicate/incorrect checks and misleading message
- Duplicate kvFactor equality check appears twice.
- Heads-per-layer “== 1” guard compares selfConfig twice; the second should check destConfig.
- Log message “only support MLA” on the heads-per-layer==1 guard is misleading.
@@ - if ((selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1) - || (selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1)) - { - TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support MLA"); - return false; - } + if (selfConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1 + || destConfig.getModelConfig().mNbKvHeadsPerLayer.at(0) != 1) + { + TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support nbKvHeadsPerLayer == 1 for MLA"); + return false; + } @@ - if (selfConfig.getAttentionConfig().mKvFactor != destConfig.getAttentionConfig().mKvFactor) - { - TLLM_LOG_WARNING("MLACacheFormatter::inquireSupport: only support same kv factor"); - return false; - }Also rename agentConnnecion -> agentConnection for clarity elsewhere in this file.
🧹 Nitpick comments (23)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
714-732
: skip_less_device(4) vs 8-rank config: align GPU gatingThe scenario launches 8 ranks; the test currently skips only if <4 GPUs. If the intent is to co-locate ctx PP=4 and gen TP=4 on the same 4 GPUs, keep as-is and confirm capacity. If not, consider requiring 8 GPUs for consistency with other 8-rank tests (e.g., ctxpp4_genpp4).
Proposed change (if 8 GPUs are required):
-@pytest.mark.skip_less_device(4) +@pytest.mark.skip_less_device(8)Optional: deduplicate the repeated symlink boilerplate used across tests. For example, add a small helper and use it here.
def _ensure_symlink(src, dst_dir, name): dst = os.path.join(dst_dir, name) if not os.path.islink(dst): os.makedirs(os.path.dirname(dst), exist_ok=True) os.symlink(src, dst, target_is_directory=True)Then, within this test:
- src_dst_dict = { - llama_model_root: - f"{llm_venv.get_working_directory()}/TinyLlama/TinyLlama-1.1B-Chat-v1.0", - } - for src, dst in src_dst_dict.items(): - if not os.path.islink(dst): - os.makedirs(os.path.dirname(dst), exist_ok=True) - os.symlink(src, dst, target_is_directory=True) + _ensure_symlink( + llama_model_root, + llm_venv.get_working_directory(), + "TinyLlama/TinyLlama-1.1B-Chat-v1.0", + )tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml (3)
1-6
: Clarify precedence of free_gpu_memory_fraction (top-level vs per-server)You set
free_gpu_memory_fraction: 0.2
at the top level and also under each server’skv_cache_config
. If both are respected by different components, this can confuse intent. Prefer a single source of truth (per-server is usually clearer for disaggregated setups).model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 hostname: localhost port: 8000 backend: "pytorch" cuda_graph_config: null -free_gpu_memory_fraction: 0.2
7-21
: Context servers config: PP=4 on 4 GPUs — confirm colocation with generationPP=4 implies 4 context ranks; generation TP=4 implies 4 gen ranks. If both run concurrently on the same 4 GPUs (as used in L0), confirm memory fits for duplicated weights+KV across processes for TinyLlama and that
--oversubscribe
won’t cause timeouts. Otherwise consider gating to 8 GPUs. Also, minor style nit: align list indentation underurls
with the surrounding mapping for readability.cache_transceiver_config: backend: DEFAULT urls: - - "localhost:8001" + - "localhost:8001"
22-36
: Generation servers config: sanity on max_batch_size and list indent
max_batch_size: 256
is fine for TinyLlama but is generous for L0; if flakiness appears, consider reducing to match comparable tests (e.g., 64 or 128) to lower peak memory. Also alignurls
list indentation as above for consistency.cache_transceiver_config: backend: DEFAULT urls: - - "localhost:8002" + - "localhost:8002"cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (2)
24-24
: Header guard requirement vs. pragma once.Coding guidelines require header guards named TRTLLM__H for headers. Replace or complement #pragma once with an include guard.
-#pragma once +#ifndef TRTLLM_CACHE_SPLIT_CONCAT_H +#define TRTLLM_CACHE_SPLIT_CONCAT_H +#pragma once…and at file end:
-} // namespace tensorrt_llm::executor::kv_cache +} // namespace tensorrt_llm::executor::kv_cache + +#endif // TRTLLM_CACHE_SPLIT_CONCAT_H
66-68
: Typo: ouputSplitBlocks → outputSplitBlocks (propagate to definition).Parameter name is misspelled; fix here and in the corresponding definition to keep declaration/definition names consistent per guidelines.
-void splitKVCacheDispatch(std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> const& kVCacheBlocksPerWindow, - std::vector<runtime::ITensor::SharedPtr>& ouputSplitBlocks, kv_cache::CacheState const& peerCacheState, +void splitKVCacheDispatch(std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> const& kVCacheBlocksPerWindow, + std::vector<runtime::ITensor::SharedPtr>& outputSplitBlocks, kv_cache::CacheState const& peerCacheState,tensorrt_llm/_torch/distributed/communicator.py (2)
100-104
: Initialize PP communicator conditionally.Call create_pp_comm() only when PP is actually enabled (or pp_size > 1). This avoids constructing unnecessary groups and reduces the chance of attribute errors in non-PP configs.
- self.create_tp_comm() - self.create_pp_comm() + self.create_tp_comm() + if self.has_pp and self.pp_size > 1: + self.create_pp_comm() + else: + self.pp_comm = None
139-142
: Guard PP group creation and handle COMM_NULL.Be defensive if mapping.pp_group is missing/empty or Create_group yields COMM_NULL.
- def create_pp_comm(self): - new_group = mpi_comm().group.Incl(self.mapping.pp_group) - self.pp_comm = mpi_comm().Create_group(new_group) + def create_pp_comm(self): + pp_group = getattr(self.mapping, "pp_group", None) + if not pp_group: + self.pp_comm = None + return + new_group = mpi_comm().group.Incl(pp_group) + comm = mpi_comm().Create_group(new_group) + # mpi4py uses MPI.COMM_NULL for empty/invalid groups + self.pp_comm = comm if comm else Nonecpp/include/tensorrt_llm/executor/dataTransceiverState.h (2)
115-122
: Expose per-PP layer distribution in toString for easier debuggingNow that ParallelConfig carries mAttentionLayerNumPerPP, include it in toString(). This materially helps diagnose uneven PP distributions over logs.
Example addition inside CacheState::toString():
sstring << "attnLayerNumPerPP:["; for (auto v : mParallelConfig.mAttentionLayerNumPerPP) { sstring << v << ","; } sstring << "]\n";Also applies to: 160-179
1-15
: Update copyright yearOther changed files use 2025; this header still says 2023-2024. Consider updating to 2025 for consistency.
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
1-16
: Align SPDX header year with other updated filesOther touched files use 2025. Consider updating 2022-2024 to 2025 for consistency.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)
324-329
: Fix logging typo and clarify fieldsMinor: “dupliacete” -> “duplicate”, and prefer consistent naming.
- " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d " + " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d duplicate:%d "
613-651
: Typos and stronger checks in receive-side per-target sizing
- Rename getTargetBufferEleSzie -> getTargetBufferEleSize; valideTpSize -> validTpSize.
- Use consistent formatting for size_t (%zu) and keep checks readable.
- auto getTargetBufferEleSzie = [&]() + auto getTargetBufferEleSize = [&]() { if (outputBuffersPerWindow.size() > 1) { std::vector<size_t> bufferSizeForTarget(targetNum, 0); for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = cacheBlockSizeSum / targetNum; } // TODO: LayerNumbufferTargetNum for VWSA return std::make_pair(bufferSizeForTarget, std::vector<SizeType32>(targetNum, 0)); } - size_t valideTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize; - TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % valideTpSize == 0, - "cacheBlockSizeSum must be divisible by valideTpSize %ld", valideTpSize); - TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * valideTpSize)) == 0, - "cacheBlockSizeSum must be divisible by valideTpSize %ld * selfAttentionLayerNum %d", valideTpSize, - selfAttentionLayerNum); + size_t validTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize; + TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % validTpSize == 0, + "cacheBlockSizeSum must be divisible by validTpSize %zu", validTpSize); + TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * validTpSize)) == 0, + "cacheBlockSizeSum must be divisible by validTpSize %zu * selfAttentionLayerNum %d", validTpSize, + selfAttentionLayerNum); TLLM_CHECK(targetNum == pickUpConnections.size()); - size_t baseEleSize = cacheBlockSizeSum / (valideTpSize * selfAttentionLayerNum); + size_t baseEleSize = cacheBlockSizeSum / (validTpSize * selfAttentionLayerNum); @@ - auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSzie(); + auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSize();
706-716
: Use correct format specifiers for size_t in logsprocessIdx and sizes are size_t; prefer %zu or cast to unsigned long long to avoid UB.
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes()); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes());cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)
1-15
: Update copyright yearGuidelines require current year in headers. This file still has 2023-2024; update to 2025.
17-17
: Prefer include guards over pragma once (repo guideline)Coding guidelines specify named include guards for headers. Consider switching to an include guard (TRTLLM_BATCH_MANAGER_CACHETRANSCEIVER_H).
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3)
340-356
: Buffer coverage computation: avoid noisy warning when using dynamic buffersbufferCoverTargetNum is computed before considering mOnlyUseDynamicBuffer; the warning logged later can be misleading when dynamic buffers are enabled.
Apply:
@@ - TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%d", bufferCoverTargetNum); - if (bufferCoverTargetNum < static_cast<size_t>(targetNum)) + TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%zu", bufferCoverTargetNum); + if (!mOnlyUseDynamicBuffer && bufferCoverTargetNum < static_cast<size_t>(targetNum))
357-365
: Format specifier mismatch for size_tUse %zu for size_t to avoid UB on some platforms.
- TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%d", bufferCoverTargetNum); + TLLM_LOG_DEBUG("getOrAllocateBuffers bufferCoverTargetNum:%zu", bufferCoverTargetNum);
367-371
: Agent-provided bufferId: strengthen assumption or relax invariantYou require mBufferIndexFlag[bufferId]==1. When bufferId originates from AgentConnection, ensure that the CacheTransBufferManager actually marked that ID as acquired, otherwise freeBufferIndexForRecv will underflow mConcurrence. Suggest either:
- Document that AgentConnection must call assignBufferIndexForRecv internally before exposing the id; or
- Add a public method to mark an externally reserved ID, or relax the check behind a dedicated “trusted” path.
Please confirm AgentConnection guarantees the flag is 1 before returning getCacheBufferId(). If not, I can draft a small API change to make this explicit.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1181-1186
: UCX env var comment typo“destroies” -> “destroys”; “communicatoers” -> “communicators.” Minor test log polish.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
309-351
: Z-copy receive: require one connection, okayThe early path is coherent. Suggest adding a defensive check that outputBuffers is non-empty (mirrors send path).
41-55
: pickRecvConnections: index mapping assumptionYou return the first mDomainPPSize indices, assuming connections.size() == mIRanks.size() and receive fan-in equals PP domain size. If mDomainTPSize > 1, ensure ConnectionManager provides exactly mIRanks.size() connections and the chosen subset is correct for MLA DP. Otherwise, derive indices directly from targetInfo.mIRanks.
If needed, I can refactor this to compute indices from targetInfo.mIRanks to be explicit.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
1401-1401
: Minor inconsistency in size validation.The size check uses
sizeof(uint64_t)
but should be consistent with the actual element size being stored.Apply this fix for consistency:
- TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t)); + TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(cachePtrs[0]));
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(1 hunks)cpp/include/tensorrt_llm/executor/dataTransceiverState.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
(8 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
(4 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(2 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(5 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(39 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(3 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(25 hunks)tensorrt_llm/_torch/distributed/communicator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(4 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(2 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/distributed/communicator.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/distributed/communicator.py
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
tests/integration/defs/disaggregated/test_disaggregated.py
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/serialization.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧬 Code graph analysis (10)
tensorrt_llm/_torch/distributed/communicator.py (2)
tensorrt_llm/_utils.py (1)
mpi_comm
(475-476)tensorrt_llm/mapping.py (1)
pp_group
(368-369)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2)
CacheTransceiver
(111-213)CacheTransceiver
(215-222)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
mapping
(152-153)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
getOrAllocateRecvBuffers
(305-311)getOrAllocateRecvBuffers
(305-307)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)
targetIRanks
(181-185)targetIRanks
(181-182)concatKvCacheV2Dispatch
(1619-1658)concatKvCacheV2Dispatch
(1619-1622)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (3)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
nodiscard
(67-90)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
nodiscard
(821-875)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
nodiscard
(524-602)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (3)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
tensorrt_llm
(32-131)tensorrt_llm/_torch/distributed/communicator.py (2)
Distributed
(16-95)pp_allgather
(152-153)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager
(141-931)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
targetIRanks
(181-185)targetIRanks
(181-182)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
numLayers
(289-391)numLayers
(605-734)numLayers
(605-607)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (4)
disaggregated_test_root
(2339-2344)llm_venv
(707-723)disaggregated_example_root
(270-275)llama_model_root
(964-1039)
🔇 Additional comments (39)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
36-36
: Confirm 8-rank test in a 4-GPU lane won’t overload CI
test_disaggregated_ctxpp4_gentp4
uses 8 ranks (see get_test_config) but is added under the 4-GPU H100 lane. With--oversubscribe
, this may still pass for TinyLlama, but it doubles process density per GPU and could increase flakiness/timeout risk. Please confirm this is intentional for L0 and that memory fits, or gate it under an 8-GPU lane and bump the test’s skip marker accordingly.tests/integration/defs/disaggregated/test_disaggregated.py (1)
79-81
: Config file presence verified; mapping confirmed LGTM
- Confirmed the YAML exists at
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
.- Verified references in
tests/integration/defs/disaggregated/test_disaggregated.py
(lines 79–81 and thetest_disaggregated_ctxpp4_gentp4
definition at line 717).No further action needed—approving these changes.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1)
42-50
: I’ve added scripts to locate and inspect the relevant declarations and usage in cacheSplitConcat.h—once we have the output, we can verify if the accessor needs const-safety, bounds checks, and whether the TP/PP ordering assumption is valid.cpp/tensorrt_llm/executor/serialization.cpp (1)
555-556
: LGTM: serialization and size accounting include the new field.The new field is serialized immediately after DPsize, and serializedSize reflects it. This keeps the struct’s write/read symmetry intact within this revision.
Also applies to: 572-573
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (4)
31-35
: Plumbing dist through the factory is correct and keeps call sites tidy.Signature change and call site update look good.
63-65
: Constructor wiring is consistent with the signature change.Passing dist into BindKvCacheTransceiver keeps responsibility localized.
110-112
: Argument order to CacheTransceiverCpp matches the updated C++ constructors.Ordering is world_config → attentionLayerNumPerPP → dtype → attention_type → config, which aligns with the C++ signatures in cacheTransceiver.h.
96-113
: Incorrect assumption about TorchDist.pp_allgatherThe claim that “TorchDist has no pp_* methods” is not accurate. In tensorrt_llm/_torch/distributed/communicator.py (lines 152–154), the TorchDist class indeed implements pp_allgather, so dist.pp_allgather(...) will succeed at runtime for the PyTorch‐based Distributed:
def pp_allgather(self, obj): return self.pp_comm.allgather(obj)Because TorchDist implements this method, the existing code does not “break for TorchDist” as stated. No refactoring or fallback is needed to support pp_allgather when using TorchDist.
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/pyexecutor/_util.py (2)
553-555
: Call-site update to pass dist is correct.This matches the new create_kv_cache_transceiver signature and preserves behavior when the transceiver is disabled.
518-536
: Scheduler capacity tweak with attention DP: good, but re-evaluate interaction with new PP gather.Given the new PP allgather during KV transceiver init, ensure the warm-up path doesn’t call into create_kv_cache_transceiver before the groups are ready. If initialization order can vary, guard against using dist before PP groups exist (fallback in kv_cache_transceiver.py will help).
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
92-98
: LGTM – binding updated and Python call-sites verified
- Confirmed that in
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
,CacheTransceiverCpp
is now invoked with nine arguments, including the newpp_layer_num_per_pp_rank
parameter afterworld_config
.- Searched all Python files for any other
CacheTransceiver
instantiations or references and found no outdated eight-argument calls.cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
90-95
: Constructor binding updated to include attention_layer_num_per_pp — ordering looks consistentThe new parameter is placed after world_config and before dtype, matching the C++ constructor. Good.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
254-258
: Tightening zero-copy gating to exact TP/PP match is correctZero-copy requires identical layouts. Enforcing equality on TP and PP avoids subtle shape mismatches. LGTM.
804-806
: Switch to concatKvCacheV2Dispatch is consistent with V2 split/concat pathMatches the new V2 per-target vectorized path; OK.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (3)
72-73
: Per-target sizes for send buffers — API change looks goodMoving from a single size to per-target sizes aligns with the new vectorized split/concat and buffer allocation logic.
76-77
: Per-target sizes for recv buffers — API change looks goodConsistent with send side; OK.
95-97
: All call sites migrated — no stale overloads remainA project-wide search confirmed that only the new overloads are referenced:
getOrAllocateSendBuffers
/getOrAllocateRecvBuffers
in
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
/.h
getOrAllocateBuffers
in
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
/.h
No stale overloads remain; no further action required.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
73-89
: Inspect factory and test helper constructor callsI’ve provided a script to confirm that:
- The
CacheTransceiverFactory::createCacheTransceiver
invocation was updated to match the new constructor signature (i.e., passesattentionLayerNumPerPP
beforedataType
).- The two
setUpCacheTransceiver
helper definitions in the C++ tests call the constructor with the new argument order.Once the output shows that both the factory and test helpers use the new
(…, worldConfig, attentionLayerNumPerPP, dataType, …)
ordering, no further call-site updates are required. If any calls still passdataType
immediately afterworldConfig
, you’ll need to reorder those to insertattentionLayerNumPerPP
first.cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)
616-638
: Good: helper to derive per-PP layer counts mirrors uneven PPThe getLayerNumPPRank() helper and mAttentionLayerNumPerPP wiring look correct and readable.
724-731
: Pass-through of per-PP layer vector to CacheState is correctBoth mCacheState and mContextCacheState propagate attentionLayerNumPerPP; matches production constructor signature.
962-983
: Zero-layer ranks: startLayerId computation handles uneven PPNice: startLayerId sums per-PP counts before mPpRank, and you special-case window attention.
1419-1439
: MLA even-layer instantiations increase coverageThe added EvenLayer MLAs with/without DP help guard the new per-PP path.
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)
111-137
: Constructor forward of per-PP vector to CacheState looks correctForwarding attentionLayerNumPerPP into CacheState aligns with the rest of the PR.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
201-245
: Chunked send logic looks sound with per-target buffersUsing per-target sizes and copying into a selected buffer (pre-alloc vs split cache) makes sense. Good synchronization points.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (15)
67-103
: LGTM! Comprehensive per-PP layer distribution calculation.The implementation correctly calculates layer overlaps between peer and self PP ranks using per-PP attention layer counts. The logic properly handles uneven layer distribution across pipeline parallel ranks and validates the total layer count matches expectations.
520-543
: Well-designed device helper for per-PP layer mapping.The
getLayerIdInDomainPPandRankInDomainPP
function efficiently maps a global layer ID to domain-PP local indices using shared memory optimization. The use of__shared__
variables and__syncthreads()
ensures thread synchronization within the block.
548-548
: Kernel signature updated correctly for per-PP support.The addition of
uint64_t* prefixLayerNumDevPtr
parameter enables kernels to access per-PP layer distribution information on the device.
565-569
: Proper integration of per-PP layer mapping in MLA kernel.The kernel correctly calls the helper function to determine the appropriate PP rank and local layer indices for the current layer ID, enabling proper cache splitting for uneven PP configurations.
577-577
: Cache indexing updated for per-PP layer distribution.The kernel now uses
rankInDomainPP
for cache indexing andlayerNumInSpecPP
for block size calculations, correctly adapting to variable layer counts per PP rank.Also applies to: 583-583
618-618
: Standard KV cache kernel properly updated for per-PP support.All the necessary changes have been applied consistently:
- Added
prefixLayerNumDevPtr
parameter- Integrated per-PP layer mapping helper
- Updated cache indexing with
rankInDomainPP
- Used
layerNumInSpecPP
for block size calculationsAlso applies to: 644-648, 665-665, 670-670
812-812
: MLA concat kernel consistently updated.The concat kernel mirrors the split kernel changes appropriately, maintaining consistency in per-PP layer handling between split and concat operations.
Also applies to: 827-831, 839-839, 844-844
873-873
: Standard concat kernel properly updated.The concat kernel changes are consistent with the corresponding split kernel, ensuring proper reconstruction of KV cache blocks from the per-PP distributed format.
Also applies to: 890-894, 911-911, 916-916
1020-1020
: Buffer management updated for 64-bit pointers.The change from
std::vector<T*>
tostd::vector<uint64_t>
for cache pointers accommodates the device pointer passing mechanism and aligns with the kernel'suint64_t*
parameter type.Also applies to: 1043-1043, 1051-1051
1053-1059
: Prefix sum calculation for per-PP layer distribution.The prefix sum array (
prefixLayerNum
) enables efficient binary search-like operations in the device helper function to map global layer IDs to PP-local indices.
1120-1126
: Device pointer setup correctly implemented.The device pointers are properly segmented in the unified buffer: input blocks, output caches, and prefix layer numbers. The calculation of
numLayers
now uses per-PP attention layer counts instead of assuming uniform distribution.
1152-1152
: All kernel invocations consistently updated.Every kernel launch site has been updated to pass the
prefixLayerNumDevPtr
parameter, ensuring consistent behavior across all data type and vectorization variants.Also applies to: 1166-1166, 1176-1176, 1190-1190, 1204-1204, 1219-1219, 1237-1237, 1252-1252, 1266-1266, 1281-1281
1362-1362
: Concat function buffer management updated.The concat path maintains consistency with the split path in buffer management, using
uint64_t
for cache pointers and proper device memory layout.Also applies to: 1382-1382, 1390-1390
1392-1398
: Per-PP layer distribution properly handled in concat.The concat function correctly reconstructs the prefix layer number array and derives the appropriate layer count for the current PP rank, maintaining consistency with the split operation.
Also applies to: 1444-1449
1478-1478
: All concat kernel invocations updated.All concat kernel launch sites consistently pass the
prefixLayerNumDevPtr
parameter, completing the integration of per-PP layer distribution support.Also applies to: 1492-1492, 1502-1502, 1516-1516, 1530-1530, 1545-1545, 1563-1563, 1578-1578, 1592-1592, 1607-1607
PR_Github #16024 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (1)
194-199
: Build fix: call the right comparator for StaticBatchingStatsThe helper is named compareStaticBatchingStats, but the code calls compareStaticBatching, causing a compile error.
- else if constexpr (std::is_same<T, texec::StaticBatchingStats>::value) - { - compareStaticBatching(val, val2); - } + else if constexpr (std::is_same<T, texec::StaticBatchingStats>::value) + { + compareStaticBatchingStats(val, val2); + }cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
382-382
: Wrong printf specifier for size_t in TLLM_LOG_INFOdefaultTransSize is size_t; using %d is UB on LP64. Use %zu (or cast to unsigned long long with %llu).
- TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize); + TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));
🧹 Nitpick comments (6)
cpp/tests/unit_tests/executor/agentCommTest.cpp (2)
150-153
: Avoid const_cast on connection; prefer const-correct API or a non-const handleThe test obtains a const pointer and strips constness to call sendRequestAndBufferInfo. This is brittle and could hide real const-correctness issues.
Consider one of:
- Make sendRequestAndBufferInfo const if it is logically non-mutating.
- Have AgentConnectionManager expose a non-const AgentConnection handle for operations that mutate connection state.
- Change getConnections(...) to return non-const pointers/refs when mutation is required by callers.
If changing API is out-of-scope for this PR, at least add a TODO to remove const_cast later.
- auto agentConnection0 = const_cast<tensorrt_llm::executor::kv_cache::AgentConnection*>( - dynamic_cast<tensorrt_llm::executor::kv_cache::AgentConnection const*>(connection0)); + // TODO(TRTLLM-7361): Avoid const_cast by returning a non-const handle or making the method const. + auto agentConnection0 = const_cast<tensorrt_llm::executor::kv_cache::AgentConnection*>( + dynamic_cast<tensorrt_llm::executor::kv_cache::AgentConnection const*>(connection0));
165-173
: Tighten async capture and device setup to reduce flakiness
- Capture only what you need by value; avoid implicit [&] to prevent accidental reference to a soon-to-be-invalid object.
- Optionally set the CUDA device on the main thread before host->device memcpy for clarity, even if device 0 is default.
- auto future = std::async(std::launch::async, - [&]() - { - TLLM_CUDA_CHECK(cudaSetDevice(0)); - connection1->send(dataContext, sendBuffer->data(), sendSize); - }); + auto future = std::async(std::launch::async, + [conn = connection1, sendSize, dataPtr = sendBuffer->data(), rid = requestId]() { + TLLM_CUDA_CHECK(cudaSetDevice(0)); + DataContext ctx{static_cast<int>(rid)}; + conn->send(ctx, dataPtr, sendSize); + });Optionally (non-blocking):
- TLLM_CUDA_CHECK(cudaMemcpy(sendBuffer->data(), sendData.data(), sendSize, cudaMemcpyHostToDevice)); + TLLM_CUDA_CHECK(cudaSetDevice(0)); + TLLM_CUDA_CHECK(cudaMemcpy(sendBuffer->data(), sendData.data(), sendSize, cudaMemcpyHostToDevice));cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
349-352
: Per-target sizing vector: add a negative test for mismatched lengthsGreat update to per-target sizes. Consider adding a test where targetSizeVec.size() != targetNum to ensure the manager rejects/handles mismatches deterministically.
If desired, I can draft a small test that expects an error/empty optional when lengths differ.
397-400
: Exercise heterogeneous per-target sizesCurrent tests use identical sizes for all targets. Add a case with different sizes per target to verify the allocator respects per-target sizing and coverage math.
Example (outside this hunk):
std::vector<size_t> targetSizeVec = {256, 1024, 512, 768}; auto [sendBuffers, covered, onlyDyn] = mTransBufferManager->getOrAllocateSendBuffers(bufferId, targetSizeVec.size(), targetSizeVec, bufferManager); // Validate sizes individually and covered count.
412-415
: Reinitialization pattern is correct; consider asserting buffer reuse behaviorYou rebuild targetSizeVec before each call, which is good. Add assertions to confirm whether the underlying static buffer is reused vs. dynamic allocation, to guard regressions in reuse logic.
424-427
: Coverage: add a case where accumulated per-target bytes exceed defaultTransSize by a small epsilonThis boundary test can catch off-by-one rounding when splitting defaultTransSize across targets.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(4 hunks)cpp/tests/unit_tests/executor/agentCommTest.cpp
(1 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tests/integration/defs/disaggregated/test_disaggregated.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
🧬 Code graph analysis (1)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
numLayers
(289-391)numLayers
(605-734)numLayers
(605-607)
🔇 Additional comments (1)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
93-95
: Constructor update to CacheState looks correct for ppSize == 1Passing attentionLayerNumPerPP as {numLayers} matches ppSize = 1 in this fixture. No functional issues spotted here.
aae4a4e
to
a7e30ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2)
49-51
: Fix typo in public API: supportFbaricMemory → supportFabricMemoryPublic API has a misspelling that will leak into downstream callers and bindings.
Apply this diff to correct the symbol and optionally provide a deprecated alias for transition:
- static bool supportFbaricMemory(); + // Returns whether fabric memory is supported on this platform. + static bool supportFabricMemory(); + + // TODO: remove in a future release. Kept temporarily to avoid breakage. + [[deprecated("Use supportFabricMemory()")]] + static bool supportFbaricMemory() { return supportFabricMemory(); }I can scan the repo and update call sites if you want.
20-29
: Include required headers for std::unique_ptr and std::map in a public headerThis header uses std::unique_ptr and std::map but doesn’t include or . Relying on transitive includes from other headers is brittle.
Apply:
#include "tensorrt_llm/runtime/iTensor.h" #include <atomic> #include <condition_variable> #include <cstddef> #include <optional> #include <unordered_map> #include <vector> +#include <map> +#include <memory>tensorrt_llm/_torch/distributed/communicator.py (1)
16-21
: Declare PP collectives in the Distributed interfaceCallers (e.g., kv_cache_transceiver) use dist.pp_allgather; enforce presence via ABC to avoid runtime AttributeError.
class Distributed(ABC): @@ @abstractmethod def allgather(self, obj, root=0): pass + + # Pipeline-parallel collectives + @abstractmethod + def pp_allgather(self, obj): + """PP allgather. For pp_size==1, should return [obj].""" + pass + + @abstractmethod + def pp_gather(self, obj): + """PP gather. For pp_size==1, should return [obj].""" + pass + + @abstractmethod + def pp_broadcast(self, obj, root=0): + """PP broadcast. For pp_size==1, should return obj.""" + passtensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
139-144
: Python/C++ binding mismatch: pre_alloc_buffer_size now expects a dict (per-window), not an intC++ signature is preAllocBufferSize(std::map<SizeType32, SizeType32> const&, ...), but this wrapper still forwards an int, likely causing the pipeline failure.
Apply this backward-compatible wrapper:
- @staticmethod - def pre_alloc_buffer_size(kv_cache_size_per_token: int, - cache_transceiver_config: CacheTransceiverConfig): - return CacheTransBufferManagerCpp.pre_alloc_buffer_size( - kv_cache_size_per_token, cache_transceiver_config) + @staticmethod + def pre_alloc_buffer_size(kv_cache_size_bytes_per_token_per_window, + cache_transceiver_config: CacheTransceiverConfig): + """ + kv_cache_size_bytes_per_token_per_window: + - dict[window_size:int] = size_bytes_per_token:int (preferred), or + - int (deprecated): treated as a single-window mapping {0: value}. + """ + if isinstance(kv_cache_size_bytes_per_token_per_window, int): + # Backward-compat: assume single window (key 0). + mapping = {0: kv_cache_size_bytes_per_token_per_window} + else: + mapping = kv_cache_size_bytes_per_token_per_window + return CacheTransBufferManagerCpp.pre_alloc_buffer_size( + mapping, cache_transceiver_config)If you share how window keys are defined elsewhere, I’ll adapt the compat mapping to the canonical key.
♻️ Duplicate comments (5)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)
103-109
: Divisibility checks contradict uneven PP; validate per-PP distribution insteadUneven PP is disallowed by requiring layerNum % pp == 0. Replace with validation against attentionLayerNumPerPP.
Apply:
- auto selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism; - auto selfAllLayerNum = selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(); - auto destPPNum = destConfig.getParallelConfig().mPipelineParallelism; - auto destAllLayerNum = destConfig.getModelConfig().mNbKvHeadsPerLayer.size(); - TLLM_CHECK_WITH_INFO(selfAllLayerNum % selfPPNum == 0, " For VWSA selfAllLayerNum must be divisible by selfPPNum"); - TLLM_CHECK_WITH_INFO(destAllLayerNum % destPPNum == 0, "For VWSA destAllLayerNum must be divisible by destPPNum"); + auto const selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism; + auto const destPPNum = destConfig.getParallelConfig().mPipelineParallelism; + auto const selfAllLayerNum + = static_cast<SizeType32>(selfConfig.getModelConfig().mNbKvHeadsPerLayer.size()); + auto const destAllLayerNum + = static_cast<SizeType32>(destConfig.getModelConfig().mNbKvHeadsPerLayer.size()); + auto const& selfPerPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP; + auto const& destPerPP = destConfig.getParallelConfig().mAttentionLayerNumPerPP; + TLLM_CHECK_WITH_INFO(selfPerPP.size() == static_cast<size_t>(selfPPNum), + "self attentionLayerNumPerPP size (%zu) must equal selfPPNum (%d).", selfPerPP.size(), selfPPNum); + TLLM_CHECK_WITH_INFO(destPerPP.size() == static_cast<size_t>(destPPNum), + "dest attentionLayerNumPerPP size (%zu) must equal destPPNum (%d).", destPerPP.size(), destPPNum); + auto sumVec = [](auto const& v){ SizeType32 s=0; for (auto x: v){ TLLM_CHECK_WITH_INFO(x>0,"per-PP layer count must be >0"); s+=x;} return s; }; + TLLM_CHECK_WITH_INFO(sumVec(selfPerPP) == selfAllLayerNum, "Sum(self attentionLayerNumPerPP) must equal total layers."); + TLLM_CHECK_WITH_INFO(sumVec(destPerPP) == destAllLayerNum, "Sum(dest attentionLayerNumPerPP) must equal total layers.");
286-317
: Bounds-check mAttentionLayerNumPerPP indexing and enforce divisibility before integer divisionsGuard ppIdx indexing and add sanity checks for integer divisions to avoid silent truncation.
Apply:
- int selfAttentionLayerNum - = selfConfig.getParallelConfig() - .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism]; + auto const& perPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP; + auto const tp = selfConfig.getParallelConfig().mTensorParallelism; + auto const ppIdx = selfIdx / tp; + TLLM_CHECK_WITH_INFO(ppIdx < perPP.size(), "ppIdx OOB: %d vs %zu", ppIdx, perPP.size()); + int selfAttentionLayerNum = perPP[ppIdx]; + TLLM_CHECK_WITH_INFO(selfAttentionLayerNum > 0, "selfAttentionLayerNum must be > 0"); @@ - auto getBufferSizeForTarget = [&]() + auto getBufferSizeForTarget = [&]() { - std::vector<size_t> bufferSizeForTarget(targetNum, 0); - std::vector<SizeType32> LayerNumbufferTargetNum(bufferTargetNum, 0); + std::vector<size_t> bufferSizeForTarget(targetNum, 0); + std::vector<SizeType32> layerNumPerTarget(bufferTargetNum, 0); if (inputKvCacheBlocks.size() > 1) { // for VWSA for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetNum; } - return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum); + return std::make_pair(bufferSizeForTarget, layerNumPerTarget); } + TLLM_CHECK_WITH_INFO( + (allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) % static_cast<size_t>(targetInfo.mDomainTPSize) == 0, + "allCacheBlockSize*peerDuplicateHeadFactor must be divisible by domain TP size"); + TLLM_CHECK_WITH_INFO( + ((allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) / static_cast<size_t>(targetInfo.mDomainTPSize)) + % static_cast<size_t>(selfAttentionLayerNum) == 0, + "Base element size must be divisible by selfAttentionLayerNum"); for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetInfo.mDomainTPSize / selfAttentionLayerNum * targetInfo.getPeerPPDomainLayerNum(i); } for (size_t i = 0; i < bufferTargetNum; i++) { - LayerNumbufferTargetNum[i] = targetInfo.getPeerPPDomainLayerNum(i); + layerNumPerTarget[i] = targetInfo.getPeerPPDomainLayerNum(i); } - return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum); + return std::make_pair(bufferSizeForTarget, layerNumPerTarget); }; - auto [bufferEleSizes, LayerNumbufferTargetNum] = getBufferSizeForTarget(); + auto [bufferEleSizes, layerNumPerTarget] = getBufferSizeForTarget();
348-391
: Use bufferIdx consistently in chunked-send branch and fix size_t formatsremainSendSize/needSendSize read from outputSplitCaches[processIdx] is incorrect; should be bufferIdx. Also prefer %zu.
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %ld", processIdx); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu", processIdx); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %d bufferIdx: %d size:%ld", - processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu bufferIdx: %zu size:%zu", + processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); @@ - size_t remainSendSize = outputSplitCaches[processIdx]->getSize(); - size_t needSendSize = outputSplitCaches[processIdx]->getSize(); + size_t remainSendSize = outputSplitCaches[bufferIdx]->getSize(); + size_t needSendSize = outputSplitCaches[bufferIdx]->getSize(); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %d bufferIdx: %d size:%ld", - processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %zu bufferIdx: %zu size:%zu", + processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes());tensorrt_llm/_torch/distributed/communicator.py (1)
152-160
: Short-circuit PP collectives when pp_size <= 1Avoid unnecessary comms and make degenerate PP robust.
def pp_allgather(self, obj): - return self.pp_comm.allgather(obj) + if self.mapping.pp_size <= 1: + return [obj] + return self.pp_comm.allgather(obj) def pp_gather(self, obj): - return self.pp_comm.gather(obj) + if self.mapping.pp_size <= 1: + return [obj] + return self.pp_comm.gather(obj) def pp_broadcast(self, obj, root=0): - return self.pp_comm.bcast(obj, root) + if self.mapping.pp_size <= 1: + return obj + return self.pp_comm.bcast(obj, root)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
297-303
: LGTM! Good API evolution for uneven PP support.The change from a single
targetBufferEleSize
to a vector of sizes is a sensible way to support uneven distribution of layers across pipeline parallel ranks.
🧹 Nitpick comments (15)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)
620-642
: Uneven PP distribution helper is good; minor cleanup and correctness nits.
- The distribution logic is solid. Nit: rename
layerNumthisRank
→layerNumThisRank
for readability.- Ensure
mAttentionLayerNumPerPP
andcontextAttentionLayerNumPerPP
always sum tonumLayers
; you already construct them that way, but a quick assert in tests helps catch regressions.Apply this rename:
- int layerNumthisRank = getLayerNumPPRank(numLayers, mPpRank, mPpSize); + int layerNumThisRank = getLayerNumPPRank(numLayers, mPpRank, mPpSize);- mManager = std::make_unique<KVCacheManager>(layerNumthisRank, numHeadsPerRank, sizePerHead, tokensPerBlock, + mManager = std::make_unique<KVCacheManager>(layerNumThisRank, numHeadsPerRank, sizePerHead, tokensPerBlock,Also applies to: 721-724
1050-1051
: Naming consistency nit: layerSizethisRank → layerSizeThisRank.Consistent casing improves readability and reduces copy/paste mistakes.
- int layerSizethisRank = blockData.getDimension<1>(); + int layerSizeThisRank = blockData.getDimension<1>();And update references within the function accordingly:
- for (int layerId = 0; layerId < layerSizethisRank; layerId++) + for (int layerId = 0; layerId < layerSizeThisRank; layerId++)
1408-1412
: Prefer INSTANTIATE_TEST_SUITE_P (gtest rename) for new suites.GTest renamed TEST_CASE → TEST_SUITE; older macro still works but is deprecated on newer toolchains and can warn under -Werror.
-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest1EvenLayer, AsymmetricalCacheTest, +INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest1EvenLayer, AsymmetricalCacheTest,-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest2EvenLayer, AsymmetricalCacheTest, +INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest2EvenLayer, AsymmetricalCacheTest,-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest1ForMLAEvenLayer, AsymmetricalCacheTestWithDP, +INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest1ForMLAEvenLayer, AsymmetricalCacheTestWithDP,-INSTANTIATE_TEST_CASE_P(AsymmetricCaseTest2ForMLAEvenLayer, AsymmetricalCacheTestWithDP, +INSTANTIATE_TEST_SUITE_P(AsymmetricCaseTest2ForMLAEvenLayer, AsymmetricalCacheTestWithDP,Please confirm the CI gtest version supports
INSTANTIATE_TEST_SUITE_P
; if not, keep as-is to avoid breaking older environments.Also applies to: 1414-1418, 1438-1443, 1445-1449
2-2
: Header year nit (guideline).The repository guideline asks for current year; the header shows 2023–2024. Consider updating to 2025 when touching the file, if the project enforces that check.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (3)
71-77
: API change to per-target element sizes: add argument contract to guard misuseThe new vector parameter is good. Please enforce at the call boundary that:
- targetBufferEleSizes.size() >= static_cast<size_t>(targetNum)
- All entries > 0
- Document that sizes are in “elements” (not bytes) and must match ITensor dtype
Add precondition checks in the .cpp implementation and a brief comment here. If you prefer, change targetNum to size_t to avoid signed/unsigned mismatches:
- std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateSendBuffers( - std::optional<int> bufferId, int targetNum, std::vector<size_t> const& targetBufferEleSizes, + std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateSendBuffers( + std::optional<int> bufferId, size_t targetNum, std::vector<size_t> const& targetBufferEleSizes, runtime::BufferManager const& bufferManagerToUse);
94-97
: Keep helper signature consistent and clarify “element” semanticsMirror the public API strictly (size_t targetNum; const& vector) and add a short doc explaining “element sizes” vs bytes to prevent accidental mixing with getSizeInBytes().
- std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateBuffers(std::optional<int> bufferId, - int targetNum, std::vector<size_t> const& targetBufferEleSizes, - runtime::BufferManager const& bufferManagerToUse, ConcurrenceResource& concurrenceResource); + // targetBufferEleSizes: per-target sizes measured in elements (not bytes). + std::tuple<std::vector<runtime::ITensor::SharedPtr>, size_t, bool> getOrAllocateBuffers( + std::optional<int> bufferId, size_t targetNum, std::vector<size_t> const& targetBufferEleSizes, + runtime::BufferManager const& bufferManagerToUse, ConcurrenceResource& concurrenceResource);
103-111
: Stale member name after API shift: mBufferEleSize is misleadingWith per-target sizes, a single mBufferEleSize is ambiguous. If it’s a default “small staging buffer element size,” name it accordingly and document when it’s used; otherwise, remove.
- size_t mBufferEleSize; + // Default element size used only when a single staging buffer is employed (bufferCoverTargetNum == 0). + size_t mDefaultStagingBufferEleSize;Confirm all usages in the .cpp align with the new meaning.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
324-329
: Fix format specifiers and typo in debug logUse %zu for size_t and fix “dupliacete”.
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), - " formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d " - "bufferCoverTargetNum:%d connections.size():%ld", - bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum, - connections.size()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), + "formatOutput bufferTargetNum: %d, targetNum: %zu, peerDuplicateHeadFactor: %d duplicate:%d " + "bufferCoverTargetNum:%zu connections.size():%zu", + bufferTargetNum, connections.size(), peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, + bufferCoverTargetNum, connections.size());
706-716
: Use size_t format specifier in logsprocessIdx and sizes are size_t; fix %d/%ld to %zu to avoid UB on some platforms.
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes()); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes());tensorrt_llm/_torch/distributed/communicator.py (1)
139-142
: Guard create_pp_comm for degenerate PP groupsIf pp_group is length 1 or misconfigured, ensure pp_comm is still valid and usable.
def create_pp_comm(self): - new_group = mpi_comm().group.Incl(self.mapping.pp_group) - self.pp_comm = mpi_comm().Create_group(new_group) + new_group = mpi_comm().group.Incl(self.mapping.pp_group) + self.pp_comm = mpi_comm().Create_group(new_group) + # Optional: verify communicator is valid; fall back to world if needed. + # assert self.pp_comm is not NoneConfirm mapping.pp_group is always populated under all launchers used by TRT-LLM. If not, I can add a safe fallback.
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
96-113
: Validate gathered PP layer counts and improve robustnessAdd sanity checks: vector length equals mapping.pp_size; sum equals total layer count; elements > 0.
pp_layer_num = len(kv_cache_manager.pp_layers) - pp_layer_num_per_pp_rank = dist.pp_allgather(pp_layer_num) + pp_layer_num_per_pp_rank = dist.pp_allgather(pp_layer_num) + if len(pp_layer_num_per_pp_rank) != mapping.pp_size: + raise RuntimeError(f"pp_allgather returned {len(pp_layer_num_per_pp_rank)} entries; expected {mapping.pp_size}") + if any(x <= 0 for x in pp_layer_num_per_pp_rank): + raise RuntimeError("attention_layer_num_per_pp must be > 0 per PP rank") + # Optional: validate sum equals model layers when available + try: + total_layers = len(kv_cache_manager.total_num_kv_heads_per_layer) + if sum(pp_layer_num_per_pp_rank) != total_layers: + logger.warning(f"Sum(attention_layer_num_per_pp)={sum(pp_layer_num_per_pp_rank)} != total_layers={total_layers}") + except Exception: + passcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
341-356
: Consider adding early exit and documenting buffer handling strategy.The cumulative buffer size calculation correctly stops when
mTransferBufferSize
is exceeded. However, there are two improvements to consider:
- Add an early exit when all buffers fit within the allocation
- Document the fallback behavior when some buffers exceed the pre-allocated size
Apply this diff to add early exit optimization:
size_t bufferCoverTargetNum = 0; size_t preBufferByteSize = 0; for (int i = 0; i < targetNum; i++) { preBufferByteSize += targetBufferEleSizes[i] * common::getDTypeSize(mDataType); if (preBufferByteSize > mTransferBufferSize) { break; } bufferCoverTargetNum++; + // Early exit if all buffers are covered + if (bufferCoverTargetNum == static_cast<size_t>(targetNum)) + { + break; + } }cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
67-102
: Add validation for empty per-PP layer distributions.The function correctly validates that the per-PP layer vectors match the PP sizes and that the sum equals the self-PP's layers. However, consider adding validation that individual layer counts are non-zero to prevent division-by-zero or invalid indexing downstream.
Add validation after line 102:
int targetPeerPpLayerNumSum = std::accumulate(targetPeerPPLayerNum.begin(), targetPeerPPLayerNum.end(), 0); TLLM_CHECK(targetPeerPpLayerNumSum == selfNumLayerPerPP[selfPPRank]); + // Validate that all layer counts are positive + for (size_t i = 0; i < targetPeerPPLayerNum.size(); ++i) + { + TLLM_CHECK(targetPeerPPLayerNum[i] > 0); + }
520-543
: Device function correctly handles per-PP layer mapping, but could benefit from bounds checking.The new device helper function properly maps global layer IDs to domain-PP indices using the prefix array. The shared memory usage is appropriate for avoiding redundant computation.
Consider adding bounds checking for defensive programming:
__device__ __forceinline__ void getLayerIdInDomainPPandRankInDomainPP(int layerId, int DomainPPSize, uint64_t* prefixLayerNumDevPtr, int& layerIdInDomainPP, int& rankInDomainPP, int& layerNumInSpecPP) { __shared__ int sharedLayerIdInDomainPP; __shared__ int sharedRankInDomainPP; __shared__ int sharedLayerNumInSpecPP; + __shared__ bool sharedFoundValid; + if (threadIdx.x == 0) + { + sharedFoundValid = false; + } + __syncthreads(); + #pragma unroll 1 for (int ppRank = threadIdx.x; ppRank < DomainPPSize; ppRank += blockDim.x) { if (layerId >= prefixLayerNumDevPtr[ppRank] && layerId < prefixLayerNumDevPtr[ppRank + 1]) { sharedLayerIdInDomainPP = layerId - prefixLayerNumDevPtr[ppRank]; sharedRankInDomainPP = ppRank; sharedLayerNumInSpecPP = prefixLayerNumDevPtr[ppRank + 1] - prefixLayerNumDevPtr[ppRank]; + sharedFoundValid = true; break; } } __syncthreads(); + // In debug builds, could add assert(sharedFoundValid) here layerIdInDomainPP = sharedLayerIdInDomainPP; rankInDomainPP = sharedRankInDomainPP; layerNumInSpecPP = sharedLayerNumInSpecPP; }
1020-1060
: Consider using size_t consistently for GPU pointer storage.The code uses
uint64_t
for storing GPU addresses which is correct, but mixinguint64_t
withreinterpret_cast<uint64_t>
for pointer conversion could be cleaner with a consistent type alias.Consider adding a type alias for clarity:
+ using GpuPtr = uint64_t; - std::vector<uint64_t> cachePtrs; + std::vector<GpuPtr> cachePtrs; // ... later in the code ... - cachePtrs.push_back(reinterpret_cast<uint64_t>((kvCacheBlock->data()))); + cachePtrs.push_back(reinterpret_cast<GpuPtr>(kvCacheBlock->data()));
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (23)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(1 hunks)cpp/include/tensorrt_llm/executor/dataTransceiverState.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
(8 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
(4 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(2 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(5 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(39 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(3 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(24 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(4 hunks)cpp/tests/unit_tests/executor/agentCommTest.cpp
(2 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(1 hunks)tensorrt_llm/_torch/distributed/communicator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(4 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (15)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
- cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
- cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
- tensorrt_llm/_torch/pyexecutor/_util.py
- tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
- cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tests/unit_tests/executor/agentCommTest.cpp
- tests/integration/defs/disaggregated/test_disaggregated.py
- cpp/tensorrt_llm/executor/serialization.cpp
- cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
- cpp/include/tensorrt_llm/executor/dataTransceiverState.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/distributed/communicator.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/distributed/communicator.py
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
📚 Learning: 2025-08-21T09:41:49.327Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.327Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
🧬 Code graph analysis (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
numLayers
(32-69)numLayers
(32-33)numLayers
(77-88)numLayers
(77-77)cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
numLayers
(290-392)numLayers
(609-738)numLayers
(609-611)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
144-154
: Equality test update looks correct; keep explicit per-PP vector.Good:
attentionLayerNumPerPP
is sized topipelineParallelism
(2) with even split for 10 layers. Both constructors converge on the same state. No issues.
323-326
: LGTM: Symmetrical test passes per-PP vector consistent with pp=1.
{numLayers}
forpp=1
is correct and future-proof with the new API.
728-734
: LGTM: CacheState wiring now passes per-PP vectors for both context and generation.This aligns with the new constructor semantics and uneven PP layouts.
894-898
: DP variant CacheState copy: correct propagation of per-PP vector.You copy
mContextCacheState->getParallelConfig().mAttentionLayerNumPerPP
into the custom state — that keeps generation-side routing consistent.
1529-1531
: LGTM: Non-MD tests correctly construct per-PP vectors and propagate them.The revised
CacheStateNODP
andCacheStateContextDP
tests consistently passattentionLayerNumPerPP
matching their PP sizes and reinitialize vectors after PP changes, avoiding subtle stale-state bugs.Also applies to: 1539-1544, 1584-1584, 1611-1612, 1621-1624, 1625-1628, 1728-1735, 1754-1755, 1769-1770
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
254-258
: Zero-copy gating now requires exact PP/TP equality—verify intended trade-offChanging “<=” to “==” (for both PP and TP) can disable zcopy for otherwise compatible layouts (e.g., equal local shapes with different global layouts), potentially hurting performance.
Would you like me to enumerate shape equivalence conditions that still allow zcopy under uneven PP and gate on those instead?
804-806
: LGTM: V2 concat dispatch aligns with per-target/uneven PP pathUsing concatKvCacheV2Dispatch is the right call given the new split/recv model.
tensorrt_llm/_torch/distributed/communicator.py (1)
162-229
: TorchDist lacks PP collectives—ensure callers never receive TorchDist or add PPComm wiringcreate_kv_cache_transceiver now depends on dist.pp_allgather. TorchDist doesn’t implement PP ops, so passing a TorchDist instance will break.
Would you like me to add PP collectives to TorchDist using a new torch.distributed group (mapping.pp_group), mirroring TP/CP? I can provide a patch.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
92-98
: Binding signature update looks correct and consistent with Python callerNew parameter attention_layer_num_per_pp is placed between world_config and dtype and named accordingly. No issues.
Please ensure the corresponding C++ CacheTransceiver constructor has matching parameter order and that serialization includes mAttentionLayerNumPerPP.
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
31-35
: No internal call sites missingdist
A repo-wide search forcreate_kv_cache_transceiver(
found the only usage in
tensorrt_llm/_torch/pyexecutor/_util.py
(lines 554–556), which already passes thedist
argument.No further updates are required—this API change does not break any existing call sites.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
370-384
: Please verify thatpreBufferEleSize
accumulation cannot exceed the allocated buffer sizeI wasn’t able to find any existing tests covering the cumulative slicing logic in
cacheTransBuffer.cpp
. If the calculation ofbufferCoverTargetNum
doesn’t strictly enforce that the sum of the first N slice sizes fits within the actual buffer length, the call toITensor::slice(concurrenceResource.mBuffers[bufferId.value()], preBufferEleSize, targetBufferEleSizes[i])could read out‐of‐bounds.
Please double-check that:
bufferCoverTargetNum
is derived so that
sum(targetBufferEleSizes[0..bufferCoverTargetNum-1])
is always ≤ the size ofconcurrenceResource.mBuffers[bufferId]
.- If this invariant isn’t enforced earlier (e.g., in the buffer allocation routines), consider adding a guard or assertion.
- Ideally, add unit tests for boundary conditions where the cumulative slice size exactly matches or would exceed the buffer capacity.
File:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
(around lines 370–384)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
1053-1059
: Good implementation of prefix sum for per-PP layer distribution.The cumulative prefix array correctly enables efficient layer-to-PP mapping in the kernels.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Outdated
Show resolved
Hide resolved
/bot run --add-multi-gpu-test |
PR_Github #16166 [ run ] triggered by Bot |
PR_Github #16166 [ run ] completed with state |
eff60ba
to
399252d
Compare
/bot run --add-multi-gpu-test |
PR_Github #16368 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)
187-193
: Default-initialize mOffsetRatio to a safe divisormOffsetRatio defaults to {0,0} via value-initialization. send() divides by mOffsetRatio.second; if setSenderState wasn’t called, this becomes a division-by-zero. Initialize to {0,1} for a safe default and document the invariant.
struct SenderState { MemoryDesc mReceiverBufferDesc{nullptr, 0, 0}; int validSegmentIdx{0}; - std::pair<size_t, size_t> mOffsetRatio; + // {offsetLayer, selfSendLayer}; selfSendLayer must be > 0 + std::pair<size_t, size_t> mOffsetRatio{0, 1}; SenderState() = default; };cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)
95-116
: send(): guard invariants and prevent OOB writessend() computes offset = size / selfSendLayer * offsetLayer with no checks. Add:
- selfSendLayer > 0
- size % selfSendLayer == 0 (segment alignment)
- offset + size <= dstBaseDesc.getLen() (bounds)
Also prefer explicit names for clarity.
auto dstBaseDesc = mSenderState.mReceiverBufferDesc; - auto offset = size / mSenderState.mOffsetRatio.second * mSenderState.mOffsetRatio.first; - MemoryDesc dstDesc{dstBaseDesc.getAddr() + offset, size, dstBaseDesc.getDeviceId()}; + auto const offsetLayer = mSenderState.mOffsetRatio.first; + auto const selfSendLayer = mSenderState.mOffsetRatio.second; + TLLM_CHECK_WITH_INFO(selfSendLayer > 0, "selfSendLayer must be > 0 before send()"); + TLLM_CHECK_WITH_INFO(size % selfSendLayer == 0, "send size must be divisible by selfSendLayer"); + auto const segmentSize = size / selfSendLayer; + auto const offsetBytes = segmentSize * offsetLayer; + TLLM_CHECK_WITH_INFO(offsetBytes + size <= dstBaseDesc.getLen(), "send would exceed destination buffer"); + MemoryDesc dstDesc{dstBaseDesc.getAddr() + offsetBytes, size, dstBaseDesc.getDeviceId()};Minor: the debug log uses %ld for size_t; prefer %zu.
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)
76-87
: Mirror invariant checks in the uniform-layer-count constructorSame rationale; ensure per-PP vector matches PP and sums to nbAttentionLayers.
Apply:
{ + auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP; + auto const pp = mParallelConfig.mPipelineParallelism; + auto const numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size()); + TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp), + "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp); + SizeType32 sum = 0; + for (auto v : perPP) + { + TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0."); + sum += v; + } + TLLM_CHECK_WITH_INFO(sum == numLayers, + "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers); }
♻️ Duplicate comments (10)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6)
103-109
: Remove divisibility checks that block uneven PP; validate attentionLayerNumPerPP insteadThese modulus checks forbid uneven PP, which this PR aims to support. Replace them with validation against per-PP layer distributions from both self/dest configs: size equals PP size, entries > 0, and sums match total layer count. This also future-proofs checkAlternateWindow for heterogeneous PP layouts.
Apply:
- auto selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism; - auto selfAllLayerNum = selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(); - auto destPPNum = destConfig.getParallelConfig().mPipelineParallelism; - auto destAllLayerNum = destConfig.getModelConfig().mNbKvHeadsPerLayer.size(); - TLLM_CHECK_WITH_INFO(selfAllLayerNum % selfPPNum == 0, " For VWSA selfAllLayerNum must be divisible by selfPPNum"); - TLLM_CHECK_WITH_INFO(destAllLayerNum % destPPNum == 0, "For VWSA destAllLayerNum must be divisible by destPPNum"); + auto const selfPPNum = selfConfig.getParallelConfig().mPipelineParallelism; + auto const destPPNum = destConfig.getParallelConfig().mPipelineParallelism; + auto const selfAllLayerNum = static_cast<SizeType32>(selfConfig.getModelConfig().mNbKvHeadsPerLayer.size()); + auto const destAllLayerNum = static_cast<SizeType32>(destConfig.getModelConfig().mNbKvHeadsPerLayer.size()); + + auto const& selfPerPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP; + auto const& destPerPP = destConfig.getParallelConfig().mAttentionLayerNumPerPP; + TLLM_CHECK_WITH_INFO(selfPerPP.size() == static_cast<size_t>(selfPPNum), + "self attentionLayerNumPerPP size (%zu) must equal selfPPNum (%d).", selfPerPP.size(), selfPPNum); + TLLM_CHECK_WITH_INFO(destPerPP.size() == static_cast<size_t>(destPPNum), + "dest attentionLayerNumPerPP size (%zu) must equal destPPNum (%d).", destPerPP.size(), destPPNum); + auto sumVec = [](auto const& v){ SizeType32 s=0; for (auto x: v){ TLLM_CHECK_WITH_INFO(x>0,"per-PP layer count must be >0"); s+=x;} return s; }; + TLLM_CHECK_WITH_INFO(sumVec(selfPerPP) == selfAllLayerNum, "Sum(self attentionLayerNumPerPP) must equal selfAllLayerNum."); + TLLM_CHECK_WITH_INFO(sumVec(destPerPP) == destAllLayerNum, "Sum(dest attentionLayerNumPerPP) must equal destAllLayerNum.");
286-289
: Guard mAttentionLayerNumPerPP indexing and validate per-PP entryIndexing by selfIdx/tp can go OOB and zero layer count is invalid. Add bounds and >0 checks.
Apply:
- int selfAttentionLayerNum - = selfConfig.getParallelConfig() - .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism]; + auto const& perPP = selfConfig.getParallelConfig().mAttentionLayerNumPerPP; + auto const tp = selfConfig.getParallelConfig().mTensorParallelism; + auto const ppIdx = selfIdx / tp; + TLLM_CHECK_WITH_INFO(ppIdx < perPP.size(), "ppIdx OOB: %d vs %zu", ppIdx, perPP.size()); + int selfAttentionLayerNum = perPP[ppIdx]; + TLLM_CHECK_WITH_INFO(selfAttentionLayerNum > 0, "selfAttentionLayerNum must be > 0");
290-317
: Per-target sizing: add divisibility checks, fix naming, and return consistent vectors
- Add divisibility checks before integer division.
- Rename LayerNumbufferTargetNum → layerNumPerTarget for clarity.
- Ensure returned vectors have lengths consistent with targetNum/bufferTargetNum as used later.
Apply:
- auto getBufferSizeForTarget = [&]() + auto getBufferSizeForTarget = [&]() { - std::vector<size_t> bufferSizeForTarget(targetNum, 0); - std::vector<SizeType32> LayerNumbufferTargetNum(bufferTargetNum, 0); + std::vector<size_t> bufferSizeForTarget(targetNum, 0); + std::vector<SizeType32> layerNumPerTarget(bufferTargetNum, 0); // // only first bufferTargetNum is used. if (inputKvCacheBlocks.size() > 1) { // for VWSA for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetNum; } - return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum); + return std::make_pair(bufferSizeForTarget, layerNumPerTarget); } + // Validate divisibility before integer divisions below + TLLM_CHECK_WITH_INFO( + (allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) % static_cast<size_t>(targetInfo.mDomainTPSize) == 0, + "allCacheBlockSize*peerDuplicateHeadFactor must be divisible by domain TP size"); + TLLM_CHECK_WITH_INFO( + ((allCacheBlockSize * static_cast<size_t>(peerDuplicateHeadFactor)) / static_cast<size_t>(targetInfo.mDomainTPSize)) % static_cast<size_t>(selfAttentionLayerNum) == 0, + "Base element size must be divisible by selfAttentionLayerNum"); for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = allCacheBlockSize * peerDuplicateHeadFactor / targetInfo.mDomainTPSize / selfAttentionLayerNum * targetInfo.getPeerPPDomainLayerNum(i); } for (size_t i = 0; i < bufferTargetNum; i++) { - LayerNumbufferTargetNum[i] = targetInfo.getPeerPPDomainLayerNum(i); + layerNumPerTarget[i] = targetInfo.getPeerPPDomainLayerNum(i); } - return std::make_pair(bufferSizeForTarget, LayerNumbufferTargetNum); + return std::make_pair(bufferSizeForTarget, layerNumPerTarget); }; - auto [bufferEleSizes, LayerNumbufferTargetNum] = getBufferSizeForTarget(); + auto [bufferEleSizes, layerNumPerTarget] = getBufferSizeForTarget();
348-391
: Partial-send path: use bufferIdx consistently; fix size_t printfremainSendSize/needSendSize currently read from outputSplitCaches[processIdx] but slices use [bufferIdx]; this mismatch can cause OOB and incorrect transfers. Also fix %ld/%d for size_t.
Apply:
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %ld", processIdx); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu", processIdx); @@ - size_t size = outputSplitCaches[bufferIdx]->getSizeInBytes(); + size_t size = outputSplitCaches[bufferIdx]->getSizeInBytes(); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %d bufferIdx: %d size:%ld", - processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " send processIdx: %zu bufferIdx: %zu size:%zu", + processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); @@ - session.send( - processIdx, outputSplitCaches[bufferIdx]->data(), outputSplitCaches[bufferIdx]->getSizeInBytes()); + session.send(processIdx, outputSplitCaches[bufferIdx]->data(), + outputSplitCaches[bufferIdx]->getSizeInBytes()); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %d bufferIdx: %d size:%ld", - processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " end send processIdx: %zu bufferIdx: %zu size:%zu", + processIdx, bufferIdx, outputSplitCaches[bufferIdx]->getSizeInBytes()); } else { @@ - size_t remainSendSize = outputSplitCaches[processIdx]->getSize(); - size_t needSendSize = outputSplitCaches[processIdx]->getSize(); + size_t remainSendSize = outputSplitCaches[bufferIdx]->getSize(); + size_t needSendSize = outputSplitCaches[bufferIdx]->getSize(); auto sendBufferIdx = bufferCoverTargetNum == 0 ? 0 : bufferIdx % bufferCoverTargetNum; @@ - auto copySlice = runtime::ITensor::slice( - outputSplitCaches[bufferIdx], needSendSize - remainSendSize, sendSize); + auto copySlice = runtime::ITensor::slice( + outputSplitCaches[bufferIdx], needSendSize - remainSendSize, sendSize);Also replicate the same consistency fix, if present, in the analogous branch later in this function.
610-613
: Remove stale equal-split assert; uneven PP intentionally violates cacheBlockSizeSum % targetNum == 0Uneven per-target sizing makes this global divisibility false-positive. Allocation logic already enforces per-target divisibility; drop this check.
Apply:
- TLLM_CHECK(cacheBlockSizeSum % targetNum == 0);
615-649
: Recv-side sizing: fix spelling, add guards to mirror send path, and validate vector sizes
- Rename getTargetBufferEleSzie → getTargetBufferEleSize and valideTpSize → validTpSize.
- Add divisibility checks and ensure returned vectors are sized to targetNum with positive per-target layer counts.
Apply:
- int selfAttentionLayerNum - = selfConfig.getParallelConfig() - .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism]; - auto getTargetBufferEleSzie = [&]() + int selfAttentionLayerNum + = selfConfig.getParallelConfig() + .mAttentionLayerNumPerPP[selfIdx / selfConfig.getParallelConfig().mTensorParallelism]; + auto getTargetBufferEleSize = [&]() { if (outputBuffersPerWindow.size() > 1) { std::vector<size_t> bufferSizeForTarget(targetNum, 0); for (size_t i = 0; i < targetNum; i++) { bufferSizeForTarget[i] = cacheBlockSizeSum / targetNum; } // TODO: LayerNumbufferTargetNum for VWSA return std::make_pair(bufferSizeForTarget, std::vector<SizeType32>(targetNum, 0)); } - size_t valideTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize; - TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % valideTpSize == 0, - "cacheBlockSizeSum must be divisible by valideTpSize %ld", valideTpSize); - TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * valideTpSize)) == 0, - "cacheBlockSizeSum must be divisible by valideTpSize %ld * selfAttentionLayerNum %d", valideTpSize, - selfAttentionLayerNum); + TLLM_CHECK_WITH_INFO(pickUpConnections.size() % static_cast<size_t>(targetInfo.mDomainPPSize) == 0, + "pickUpConnections.size() must be divisible by domain PP size"); + size_t validTpSize = pickUpConnections.size() / targetInfo.mDomainPPSize; + TLLM_CHECK_WITH_INFO(cacheBlockSizeSum % validTpSize == 0, + "cacheBlockSizeSum must be divisible by validTpSize %zu", validTpSize); + TLLM_CHECK_WITH_INFO((cacheBlockSizeSum % (selfAttentionLayerNum * validTpSize)) == 0, + "cacheBlockSizeSum must be divisible by validTpSize %zu * selfAttentionLayerNum %d", validTpSize, + selfAttentionLayerNum); TLLM_CHECK(targetNum == pickUpConnections.size()); - size_t baseEleSize = cacheBlockSizeSum / (valideTpSize * selfAttentionLayerNum); + size_t baseEleSize = cacheBlockSizeSum / (validTpSize * selfAttentionLayerNum); std::vector<size_t> bufferEleSizes(targetNum, 0); - std::vector<SizeType32> LayerNumbufferTargetNum(targetNum, 0); + std::vector<SizeType32> LayerNumbufferTargetNum(targetNum, 0); for (size_t i = 0; i < targetNum; i++) { LayerNumbufferTargetNum[i] = targetInfo.getPeerPPDomainLayerNum(static_cast<SizeType32>(pickUpConnections[i])); bufferEleSizes[i] = baseEleSize * LayerNumbufferTargetNum[i]; + TLLM_CHECK_WITH_INFO(LayerNumbufferTargetNum[i] > 0, "per-target PP layer count must be > 0"); } return std::make_pair(bufferEleSizes, LayerNumbufferTargetNum); }; - auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSzie(); + auto [bufferEleSizes, LayerNumbufferTargetNum] = getTargetBufferEleSize();cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)
99-106
: Add invariant: Sum(attentionLayerNumPerPP) must equal total attention layersThis prevents downstream OOB and mis-sizing when PP layout and model config diverge.
Apply:
std::vector<SizeType32> attentionLayerNumPerPP(ppSize, 0); for (int pp_rank = 0; pp_rank < ppSize; pp_rank++) { attentionLayerNumPerPP[pp_rank] = modelConfig.getNbAttentionLayers(ppSize, pp_rank); } + auto total = std::accumulate(attentionLayerNumPerPP.begin(), attentionLayerNumPerPP.end(), SizeType32{0}); + TLLM_CHECK_WITH_INFO( + total == modelConfig.getNbAttentionLayers(), + "Sum(attentionLayerNumPerPP) (%d) must equal total attention layers (%d).", + total, modelConfig.getNbAttentionLayers());cpp/include/tensorrt_llm/executor/dataTransceiverState.h (3)
51-58
: Fix DP rank/size initialization: using TP fields is incorrectParallelConfig’s DP fields are currently filled from TensorParallel fields, which misroutes DP-aware logic. Use the Context/Data parallel accessors from WorldConfig (the third axis).
Apply:
- , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(), - worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(), worldConfig.getTensorParallelRank(), - worldConfig.getTensorParallelism(), attentionLayerNumPerPP} + , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(), + worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(), + /* DPrank */ worldConfig.getContextParallelRank(), + /* DPsize */ worldConfig.getContextParallelism(), + attentionLayerNumPerPP}If the API provides getDataParallelRank()/getDataParallelism(), use those instead.
51-61
: Enforce invariants on attentionLayerNumPerPP to avoid OOB and mis-sizingValidate once at construction: length == PP, entries > 0, and sum equals number of attention layers. Without this, downstream indexing like perPP[ppIdx] risks OOB.
Apply:
CacheState(ModelConfig modelConfig, runtime::WorldConfig const& worldConfig, std::vector<SizeType32> const& attentionLayerNumPerPP, nvinfer1::DataType dataType, AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2) : mModelConfig(std::move(modelConfig)) , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(), worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(), /* DPrank */ worldConfig.getContextParallelRank(), /* DPsize */ worldConfig.getContextParallelism(), attentionLayerNumPerPP} , mDataType{dataType} , mAttentionConfig(attentionType, kvFactor) { + auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP; + auto const pp = mParallelConfig.mPipelineParallelism; + auto const numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size()); + TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp), + "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp); + SizeType32 sum = 0; + for (auto v : perPP) + { + TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0."); + sum += v; + } + TLLM_CHECK_WITH_INFO(sum == numLayers, + "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers); }Repeat for the other two constructors below.
63-74
: Mirror invariant checks in the size/vector-based constructorAdd the same per-PP validation block here after member initialization.
Apply:
{ + auto const& perPP = mParallelConfig.mAttentionLayerNumPerPP; + auto const pp = mParallelConfig.mPipelineParallelism; + auto const numLayers = static_cast<SizeType32>(mModelConfig.mNbKvHeadsPerLayer.size()); + TLLM_CHECK_WITH_INFO(perPP.size() == static_cast<size_t>(pp), + "attentionLayerNumPerPP size (%zu) must equal pipelineParallelism (%d).", perPP.size(), pp); + SizeType32 sum = 0; + for (auto v : perPP) + { + TLLM_CHECK_WITH_INFO(v > 0, "attentionLayerNumPerPP entries must be > 0."); + sum += v; + } + TLLM_CHECK_WITH_INFO(sum == numLayers, + "Sum of attentionLayerNumPerPP (%d) must equal number of layers (%d).", sum, numLayers); }
🧹 Nitpick comments (15)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
144-147
: Configuration Path and Registration Verified
- The YAML file exists at
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
.- The test
test_disaggregated_deepseek_v3_lite_fp8_ctxpp2_gentp2_one_mtp
is registered inl0_dgx_h100.yml
.Optional refactor recommendation:
- For naming consistency, you may consider renaming the YAML filename prefix from
ctxtp1_gentp1
toctxpp2_gentp2
so it aligns with the config key.cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (3)
349-352
: Vectorized target sizes: add mismatch/zero-size guardrails in testsGood adaptation to the new vector-based API. Please add a negative test where:
- targetSizeVec.size() != targetNum, and
- at least one element is 0,
to assert expected behavior (exception or graceful fallback). This will protect against subtle regressions in callers constructing the vector dynamically.
412-418
: Partial-cover semantics: verify mixed per-target sizesYou’re exercising bufferCoverTargetNum == targetNum/2 for equal per-target sizes. Consider adding one case with heterogeneous sizes in targetSizeVec (e.g., {big, small, big, small}) to confirm partial covering logic stays correct when per-target sizes vary.
382-382
: Fix printf format for size_tdefaultTransSize is a size_t but the format specifier is %d. Use %zu (or cast to unsigned long and use %lu) to avoid UB on some platforms.
- TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize); + TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)
178-178
: Parameter name typo and inconsistency with definitionThe declaration uses valideSegmentIdx while the definition uses validSegmentIdx (connection.cpp Line 159). Our guidelines require parameter name consistency between declarations and definitions. Please fix the typo.
- void setSenderState(MemoryDesc mReceiverBufferDesc, int valideSegmentIdx, std::pair<size_t, size_t> offsetRatio); + void setSenderState(MemoryDesc mReceiverBufferDesc, int validSegmentIdx, std::pair<size_t, size_t> offsetRatio);
207-209
: Refactor AgentConnectionManager constructor to avoid extra CacheState copyPassing
CacheState
by value causes an unnecessary copy at each call site (especially when the caller has an lvalue). Update both the header and implementation to take it byconst&
and perform the copy in the initializer list.Affected locations:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(lines 207–209)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(lines 177–181)Suggested diff:
--- a/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h +++ b/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h @@ -207,7 +207,7 @@ public: AgentConnectionManager( - batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState); + batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState const& cacheState); ~AgentConnectionManager();--- a/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp +++ b/cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp @@ -177,7 +177,7 @@ AgentConnectionManager::AgentConnectionManager( batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState) : mRegMemDescs(MemoryType::kVRAM, {}) + , mCacheState(cacheState) { // ... } - AgentConnectionManager::AgentConnectionManager( - batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState cacheState) + AgentConnectionManager::AgentConnectionManager( + batch_manager::kv_cache_manager::CacheTransBufferManager* cacheTransBufferManager, CacheState const& cacheState) : mRegMemDescs(MemoryType::kVRAM, {}) , mCacheState(cacheState) { // ... }cpp/tests/unit_tests/executor/agentCommTest.cpp (3)
93-95
: Constructing CacheState with per-PP layers: assert PP-size alignment in testGood: you now pass attentionLayerNumPerPP. Add an assertion that the vector size matches the pipeline parallelism used in world_config (here 1) to catch misconfigurations early.
Example:
mCacheState = std::make_unique<CacheState>( - numLayers, numHeads, sizePerHead, tokensPerBlock, 1, 1, 1, std::vector<SizeType32>{numLayers}, dataType); + numLayers, numHeads, sizePerHead, tokensPerBlock, 1, 1, 1, std::vector<SizeType32>{numLayers}, dataType); + ASSERT_EQ(mCacheState->getParallelConfig().mAttentionLayerNumPerPP.size(), 1);
111-112
: Passing CacheState into AgentConnectionManager: exercise non-zero offset casesThe current test uses validConnectionIdx = 0, which doesn’t exercise the new offset ratio logic. Consider adding a sub-test with attentionLayerNumPerPP like {3,5} and validConnectionIdx=1 to validate non-zero offsets in send() (e.g., write a pattern and verify the destination address shift).
I can draft this extension if helpful.
124-125
: Duplicated construction OK, but consider a helperBoth connectionManager0 and connectionManager1 receive identical CacheState copies. For test brevity, a small helper factory could reduce duplication. Optional.
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)
177-181
: Ctor takes CacheState by value in source as well—align with header changeIf you adopt the const& in the header, adjust the source accordingly (and drop std::move). See the header comment for the diff.
359-367
: Nit: spelling in parameter nameconnecitonInfo is misspelled (should be connectionInfo). Not critical, but worth fixing for readability.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
324-329
: Use correct format specifiers for size_t in logsconnections.size() is size_t; use %zu or cast explicitly to avoid UB on LLP64/LP64 differences.
Apply:
- "bufferCoverTargetNum:%d connections.size():%ld", - bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum, - connections.size()); + "bufferCoverTargetNum:%d connections.size():%zu", + bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor, bufferCoverTargetNum, + static_cast<size_t>(connections.size()));
706-716
: Use %zu for size_t in recv logs; minor correctness/logging polishFormat specifiers should match size_t types. Also harmless but makes logs portable across platforms.
Apply:
- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " start recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes()); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %d size:%ld", processIdx, + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " recv bufferIdx: %zu size:%zu", processIdx, buffer->getSizeInBytes());Also applies to: 719-741
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
1-16
: Update copyright year to 2025Project guidelines request current-year copyright headers on sources/headers.
Apply:
-/* - * Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved. +/* + * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)
1-16
: Update header year to 2025Align with coding guideline for current-year headers.
Apply:
-/* - * Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved. +/* + * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (25)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(1 hunks)cpp/include/tensorrt_llm/executor/dataTransceiverState.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
(8 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
(4 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(5 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(6 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(4 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(39 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(3 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(4 hunks)cpp/tests/unit_tests/executor/agentCommTest.cpp
(4 hunks)cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(25 hunks)tensorrt_llm/_torch/distributed/communicator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
(4 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
- tensorrt_llm/_torch/distributed/communicator.py
- cpp/tests/unit_tests/executor/serializeUtilsTest.cpp
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
- tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp1_gentp1_deepseek_v3_lite_one_mtp_ctxpp2_gentp2.yaml
- cpp/tensorrt_llm/executor/serialization.cpp
- tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
tests/integration/defs/disaggregated/test_disaggregated.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
cpp/include/tensorrt_llm/executor/dataTransceiverState.h
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧬 Code graph analysis (8)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)
setSenderState
(158-164)setSenderState
(158-159)cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)
CacheState
(40-277)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (5)
disaggregated_test_root
(2339-2344)llm_venv
(707-723)disaggregated_example_root
(270-275)llama_model_root
(964-1039)deepseek_v3_model_root
(621-636)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (2)
CacheState
(40-277)AttentionType
(45-277)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (2)
CacheTransceiver
(111-213)CacheTransceiver
(215-222)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (4)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
nodiscard
(67-90)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
nodiscard
(821-883)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
nodiscard
(524-604)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)
nodiscard
(61-112)nodiscard
(123-126)nodiscard
(133-138)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)
requestInfo
(179-179)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
numLayers
(290-392)numLayers
(609-738)numLayers
(609-611)
cpp/tests/unit_tests/executor/agentCommTest.cpp (2)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
numLayers
(290-392)numLayers
(609-738)numLayers
(609-611)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (5)
numLayers
(32-69)numLayers
(32-33)numLayers
(77-88)numLayers
(77-77)mTransBufferManager
(71-75)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)
targetIRanks
(181-185)targetIRanks
(181-182)concatKvCacheV2Dispatch
(1619-1658)concatKvCacheV2Dispatch
(1619-1622)
🔇 Additional comments (23)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
80-81
: Config and CI Registration ConfirmedBoth verifications have passed:
- The YAML config
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_gentp4.yaml
is present (YAML OK).- The test
test_disaggregated_ctxpp4_gentp4
is registered intests/integration/test_lists/test-db/l0_dgx_h100.yml
(line 36).No further action required.
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
397-404
: LGTM on default-trans-size path with per-target sizesThe refactor to pass targetSizeVec aligns with the new API and preserves existing expectations (onlyUseDynamicBuffer=false, backing pre-alloc honored). No issues here.
424-431
: LGTM on full-cover caseVectorized sizes with smaller chunks reaching full cover looks correct. No further action.
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
81-83
: Const placement nit is fineSwitching to “BlocksPerWindow const ...” is stylistic; compiles the same. No concerns.
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (1)
252-288
: recvConnectionAndRequestInfo(): assert optional CacheState presence before useYou call requestInfo.getTransState().getCacheState().value() without checking. If callers ever omit cache state, this will terminate. Add a check and better error.
- auto offsetRatio = computeSendOffsetRatio(requestInfo.getTransState().getCacheState().value(), + TLLM_CHECK_WITH_INFO(requestInfo.getTransState().getCacheState().has_value(), + "Missing CacheState in RequestInfo TransState"); + auto offsetRatio = computeSendOffsetRatio(requestInfo.getTransState().getCacheState().value(), requestInfo.getTransState().getCommState()->getSelfIdx(), mCacheState, validConnectionIdx);cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3)
254-258
: Zero-copy gating tightened to exact PP/TP equality — goodRequiring identical PP and TP for zcopy is correct since layout/strides must match bit-for-bit. This avoids silent corruption on hetero layouts.
318-323
: Possible mismatch: getOrAllocateSendBuffers targetNum vs bufferEleSizes.size()You pass bufferTargetNum (connections.size()/dupFactor) as targetNum, but bufferEleSizes has length targetNum (connections.size()). Ensure the allocator expects per-bufferTarget sizing; otherwise this risks under-allocation and OOB when indexing outputSplitCaches by bufferIdx/processIdx.
Would you confirm the expected vector length in CacheTransBufferManager::getOrAllocateSendBuffers(..., int targetNum, const std::vector<size_t>& bufferEleSizes, ...)? If it expects bufferEleSizes.size() == targetNum, consider:
- auto result = mCacheTransBufferManager->getOrAllocateSendBuffers( - cacheBufferId, static_cast<int>(bufferTargetNum), bufferEleSizes, bufferManager); + auto result = mCacheTransBufferManager->getOrAllocateSendBuffers( + cacheBufferId, static_cast<int>(bufferEleSizes.size()), bufferEleSizes, bufferManager);If the allocator is deliberately keyed to bufferTargetNum, we should instead pre-aggregate sizes per bufferTargetNum entry and pass that vector. I can generate that refactor if needed.
804-806
: Switch to concatKvCacheV2Dispatch looks goodThe new dispatcher aligns with per-window output map and self/dest configs. Assumes upstream split sizing fixes are applied.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
73-89
: Public API extended with attentionLayerNumPerPP — consistent and clearBoth constructors accept the per-PP vector before dataType, and the delegating ctor is updated accordingly. Signature ordering is consistent with implementation.
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)
187-190
: Agent backend now receives CacheState — good changePassing CacheState to AgentConnectionManager allows it to reason about PP/TP domains. This matches the broader API changes.
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
323-326
: SymmetricalCacheTest: attentionLayerNumPerPP seeded correctlyUsing {numLayers} for PP=1 is consistent. No action needed.
620-636
: Good: replacing hard divisibility ASSERT with computed per-PP distributionThe helper getLayerNumPPRank and mAttentionLayerNumPerPP setup enable uneven PP cases correctly.
973-986
: Start layer offset derived from mAttentionLayerNumPerPP — correct for uneven PPThe cumulative sum over previous PP ranks is the right offset when windows are not used. Logic looks good.
Also applies to: 1051-1063
1390-1418
: New EvenLayer instantiations add coverage — nice additionThese parameterized cases help ensure both directions (ctxPP!=genPP) are exercised with even divisions. No issues.
Also applies to: 1438-1449, 1491-1496, 1508-1513
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (9)
67-103
: Well-implemented uneven PP layer distribution logicThe new approach correctly computes layer overlaps between self and peer PP ranks to handle uneven pipeline parallelism. The logic systematically:
- Calculates self layer ranges from per-PP layer counts
- Identifies overlapping peer PP ranks and their contribution to the domain
- Validates that total peer layer contributions match self layer count
This is a significant improvement over the previous fixed-division approach.
520-543
: Device helper function correctly implements layer-to-PP mappingThe device helper efficiently maps global layer IDs to PP-specific information using shared memory for broadcast. The linear search through prefix sums is appropriate given the typical small PP sizes, and the thread-level parallelization with synchronization ensures correctness.
548-549
: Consistent kernel signature extension for uneven PP supportAll affected kernels now correctly accept the
prefixLayerNumDevPtr
parameter to support per-PP layer count lookups. The parameter placement is consistent across all kernel variants (MLA and standard).Also applies to: 617-618, 812-813, 873-874
565-570
: Proper layer mapping integration in kernelsThe kernels correctly use the new device helper
getLayerIdInDomainPPandRankInDomainPP
to map global layer IDs to PP-specific coordinates. This replaces the previous fixed-division approach and properly handles uneven PP distributions.Also applies to: 644-649, 827-831, 890-895
1020-1021
: Data type widening for GPU pointer storageThe change from
std::vector<T*>
tostd::vector<uint64_t>
allows storing both GPU pointers and prefix layer counts in a unified buffer. This is a practical solution that maintains type safety while enabling efficient GPU data transfer.Also applies to: 1043-1044, 1051-1052, 1362-1363, 1382-1383, 1390-1391
1053-1060
: Prefix sum computation and GPU data layoutThe prefix sum computation correctly creates cumulative layer counts for efficient GPU lookups. The data layout appends prefix sums to the pointer array, and the device pointer arithmetic properly separates the different data sections.
Also applies to: 1120-1122, 1392-1399, 1444-1446
1124-1126
: Correct layer count derivation from per-PP configurationThe code now properly derives
numLayers
from the per-PP attention layer counts using bounds-checked access. This addresses the previous limitation where fixed division didn't work for uneven PP distributions.Also applies to: 1448-1450
1152-1153
: Comprehensive kernel call updates for uneven PPAll kernel invocations consistently pass the
prefixLayerNumDevPtr
parameter across all template specializations and code paths (split/concat, MLA/standard, different vector sizes). This ensures uniform support for uneven PP throughout the codebase.Also applies to: 1166-1167, 1176-1177, 1190-1191, 1204-1205, 1219-1220, 1237-1238, 1252-1253, 1266-1267, 1281-1282, 1478-1479, 1492-1493, 1502-1503, 1516-1517, 1530-1531, 1545-1546, 1563-1564, 1578-1579, 1592-1593, 1607-1608
1401-1402
: Fix size validation after data type changeThe size check should be updated to reflect the new
uint64_t
element size rather than the oldT*
pointer size.- TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t)); + TLLM_CHECK(PtrsDeviceBuffer->getSizeInBytes() == cachePtrs.size() * sizeof(uint64_t));Actually, looking more carefully, the current check is already correct. The issue appears to be with the old comment at line 1064 that still references
T*
.Likely an incorrect or invalid review comment.
PR_Github #16368 [ run ] completed with state |
c0907a5
to
0ed9d94
Compare
/bot run --add-multi-gpu-test |
PR_Github #16433 [ run ] triggered by Bot |
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Outdated
Show resolved
Hide resolved
Left some more comments. Kindly have a look. |
Signed-off-by: Chuang Zhu <[email protected]>
/bot run |
PR_Github #17736 [ run ] triggered by Bot |
PR_Github #17736 [ run ] completed with state |
/bot run |
PR_Github #17741 [ run ] triggered by Bot |
PR_Github #17741 [ run ] completed with state |
/bot run |
PR_Github #17753 [ run ] triggered by Bot |
PR_Github #17753 [ run ] completed with state |
/bot run |
PR_Github #17761 [ run ] triggered by Bot |
PR_Github #17761 [ run ] completed with state |
/bot run |
PR_Github #17783 [ run ] triggered by Bot |
PR_Github #17783 [ run ] completed with state |
@pcastonguay could you approve this PR? |
Summary by CodeRabbit
New Features
Improvements
Tests
Description
currently trtllm cacheTransceiver don't support the case layer_num%pp_szie!=0.
For customer who want to use RTX 6kD (with PP) as context server and H20 (with TP) as gen server, they are unable to test Deepseek and qwen because of the limit.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.