-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] Update TargetInfo to accommodate CP in disagg #7224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a context-parallelism (CP) dimension to KV-cache rank/domain computation: decomposes ranks including CP, validates CP counts, partitions CP domains (mDomainCPSize, peerCPRankStart/End), extends rank generation across TP×CP×PP, returns CP-inclusive TargetRanksInfo, and updates CacheState construction and unit tests to exercise CP-aware transitions. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Test as Unit Test
participant WCc as WorldConfig (context)
participant WCg as WorldConfig (generation)
participant Func as TargetRanksInfoForDP
participant TRI as TargetRanksInfo
Test->>WCc: construct context WC (TP,PP,CP)
Test->>WCg: construct generation WC (TP,PP,CP)
Test->>Func: call TargetRanksInfoForDP(contextWC, genWC)
activate Func
note right of Func: decompose self/peer TP,PP,CP\nvalidate CP power-of-two
rect rgba(220,235,255,0.35)
note right of Func: compute domain sizes:\nmDomainPPSize, mDomainTPSize, mDomainCPSize
loop peerPPRankStart..peerPPRankEnd
loop peerCPRankStart..peerCPRankEnd
loop peerTPRankStart..peerTPRankEnd
Func->>Func: irank = k*(peerTPNum*peerCPNum) + j*peerTPNum + i
Func-->>Func: append irank
end
end
end
end
Func-->>Test: return TRI {mDomainPPSize,mDomainTPSize,mDomainCPSize,ranks,...}
deactivate Func
Test->>Test: assert ranks and domain sizes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)
1441-1447
: Good abstraction: ParallelConfig alias + helper improve readability; consider adding cp>1 coverageThe alias and the small builder lambda make the scenarios concise and less error-prone. To exercise the CP dimension that this PR targets, please add at least one case with cpSize > 1 (e.g., contextCP=2/genCP=2 and a disagg case like contextCP=1/genCP=2) to ensure TargetRanksInfoForDP behavior and needSendCache remain correct when CP is involved. Right now all cases pin cpSize to 1, so CP handling is untested.
Would you like me to draft an additional verify block using cp=2 with the expected ranks/domains?
1448-1461
: verifyContext lambda looks solid; minor improvements for type-safety and failure diagnostics
- Type consistency: contextRank is an index into the parallel layout; elsewhere we use SizeType32. Using SizeType32 avoids accidental sign conversions.
- Better diffs on failures: prefer EXPECT_THAT(..., ElementsAreArray(...)) for vectors over EXPECT_EQ to get more informative mismatch output.
Apply the following diff inside this lambda:
- auto const verifyContext - = [&](int contextRank, ParallelConfig const& contextPC, ParallelConfig const& genPC, + auto const verifyContext + = [&](SizeType32 contextRank, ParallelConfig const& contextPC, ParallelConfig const& genPC, std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain, bool expectNeedSend) { @@ - auto const contextTargetInfo + auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); - EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_THAT(contextTargetInfo.mIRanks, ::testing::ElementsAreArray(expectRanks)); EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize); EXPECT_EQ(expectNeedSend, MLACacheFormatter::needSendCache(contextCache, genCache, contextRank)); };Also applies to: 1462-1469
1471-1499
: Case A (ctx tp=4,pp=2 → gen tp=2,pp=2): expectations align with TP downscale and send-strideThe expected ranks and needSend pattern (true on TP ranks {0,2} and false on {1,3} per PP) correctly reflect the MLA needSend rule when selfTPNum > destTPNum. Nice coverage across all 8 ranks. Consider adding SCOPED_TRACE to make failures identify which contextRank failed faster.
- verifyContext( + SCOPED_TRACE("ctxPC=tp4,pp2,cp1 → genPC=tp2,pp2,cp1"); + verifyContext( /*contextRank*/ 0, /*contextPC*/ contextPC, /*genPC*/ genPC, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1, /*expectNeedSend*/ true);
1567-1574
: Typo: contextTragetInfo → contextTargetInfoUnrelated to the new block above, but while you’re here: fix the misspelling to keep test names and logs consistent.
- auto const contextTragetInfo + auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); - EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks); - EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize); - EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize); + EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); + EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);- auto const contextTragetInfo + auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(contextCache, genCache, generationRank); - EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks); - EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize); - EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize); + EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); + EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);Also applies to: 1674-1680
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-135)TargetRanksInfoForDP
(50-51)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
needSendCache
(75-95)needSendCache
(75-76)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
needSendCache
(56-85)needSendCache
(56-57)
🔇 Additional comments (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
1500-1515
: Case B (ctx tp=2,pp=2 → gen tp=4,pp=2): correct fan-out to two TP ranks and always-send ruleExpectations {0,1}, {2,3}, {4,5}, {6,7} and domainTP=2 match TP upscaling semantics; always-send is consistent with selfTPNum <= destTPNum. LGTM.
1517-1529
: Case C (ctx tp=2,pp=1 → gen tp=4,pp=2): rank ordering verifies TP-major then PP-major concatenationThe expected order {0,4,1,5} exercises the nested loop order used to build IRanks (iterate TP ranks inside PP range). Domain sizes (PP=2, TP=2) and needSend=true look correct. Good addition.
If CP affects rank-global indexing in disagg mode, consider duplicating this case with cp=2 to ensure ordering remains stable under CP.
83643ae
to
31628d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
1441-1461
: Hoist shared ModelConfig out of the lambda to avoid repeated allocations
sharedModelConfig
is identical across calls; building it once improves clarity and avoids repeated vector construction in this tight test loop.Proposed refactor:
@@ - auto const verifyContext - = [&](int contextRank, tr::WorldConfig const& contextWC, tr::WorldConfig const& genWC, - std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain, bool expectNeedSend) + // Build once and reuse in all assertions + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock}; + + auto const verifyContext = [&](int contextRank, tr::WorldConfig const& contextWC, tr::WorldConfig const& genWC, + std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain, + bool expectNeedSend) { auto attentionType = isMLA ? texec::kv_cache::CacheState::AttentionType::kMLA : texec::kv_cache::CacheState::AttentionType::kDEFAULT; - - auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; auto const contextCache = texec::kv_cache::CacheState(sharedModelConfig, contextWC, dataType, attentionType, kvFactor); auto const genCache = texec::kv_cache::CacheState(sharedModelConfig, genWC, dataType, attentionType, kvFactor);
1463-1490
: Add a CP>1 scenario to exercise disagg CP pathPR objective mentions accommodating CP; current blocks use cp=1. Add at least one cp=2 case to ensure TargetRanksInfoForDP and needSendCache remain consistent in the presence of context parallelism.
Example addition after the existing blocks:
+ { + // CP>1 should not alter TP/PP target mapping semantics. + tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 2, /*cpSize*/ 2}; + tr::WorldConfig const genWC {/*tpSize*/ 4, /*ppSize*/ 2, /*cpSize*/ 2}; + verifyContext(/*contextRank*/ 0, contextWC, genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/ 1, + /*expectTPDomain*/ 2, /*expectNeedSend*/ true); + verifyContext(/*contextRank*/ 1, contextWC, genWC, /*expectRanks*/ {2, 3}, /*expectPPDomain*/ 1, + /*expectTPDomain*/ 2, /*expectNeedSend*/ true); + verifyContext(/*contextRank*/ 2, contextWC, genWC, /*expectRanks*/ {4, 5}, /*expectPPDomain*/ 1, + /*expectTPDomain*/ 2, /*expectNeedSend*/ true); + verifyContext(/*contextRank*/ 3, contextWC, genWC, /*expectRanks*/ {6, 7}, /*expectPPDomain*/ 1, + /*expectTPDomain*/ 2, /*expectNeedSend*/ true); + }Also applies to: 1492-1507, 1509-1520
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-135)TargetRanksInfoForDP
(50-51)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
needSendCache
(75-95)needSendCache
(75-76)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
needSendCache
(56-85)needSendCache
(56-57)
🔇 Additional comments (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
1463-1490
: LGTM: TP downscale (4->2) mapping and send gating are correct
- IRanks mapping matches TargetRanksInfoForDP (self TP > peer TP → 1 target per rank, proper TP folding by /2).
- MLACacheFormatter::needSendCache agrees with expectation (send only when selfTpRank % (4/2) == 0).
1492-1507
: LGTM: TP upscale (2->4) mapping and always-send behavior are correct
- DomainTPSize=2 and per-rank IRanks lists are accurate (two targets per source rank).
- With selfTP <= destTP, MLACacheFormatter always sends; assertions reflect this.
1509-1520
: LGTM: PP domain split validated and IRanks ordering matches i-then-j enumeration
- DomainPPSize=2 and DomainTPSize=2 are correct for PP 1->2 and TP 2->4.
- IRanks order {0,4,1,5} follows the nested loop over TP (outer) then PP (inner) in TargetRanksInfoForDP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
114-122
: Clarify CP semantics for mIRanks; current implementation ignores CP and will conflict with new testsThe returned
mIRanks
are formed as PP×TP only (irank = j * peerTPNum + i
). With CP differences (e.g., self CP=1, peer CP=2), the current code still returns a single rank. The new tests (see targetTest “CP grows from context to generation”) expect two ranks{0, 1}
. More importantly, elsewhere we assertmIRanks.size() == mDomainPPSize * mDomainTPSize
(splitKVCache/split path), which would be violated if CP were added tomIRanks
without also changing those invariants and the indexing logic (e.g.,inputAllRankNum
, andfillBlockInfo
’s pp/tp derivations).Action needed: decide the contract.
- If
mIRanks
should stay PP×TP only: update the tests to expect a single rank for CP growth (see suggested diff in the test file comment).- If
mIRanks
should enumerate CP as well: you must also update:
inputAllRankNum
to include CP,- all
rank
→{pp,tp,cp}
decompositions infillBlockInfo
(computeppRank = (rank / tp) / cp
etc.),- any site assuming
size == PP×TP
.Given the breadth of the second option, I recommend aligning tests to the current invariant for this PR and planning a follow-up that carries CP through end-to-end.
♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1448-1450
: Fix invalid CTAD: std::vector(numLayers, numHeads) is ill-formedThe two-argument “count, value” constructor doesn’t have a deduction guide. Specify the element type explicitly.
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
60-66
: Unused CP locals; mark them explicitly or remove until used
peerCPNum
,selfCPNum
, andselfCPRank
are computed but never used. This will trigger -Wunused-variable on stricter builds and is misleading. If CP is not yet integrated into rank enumeration here, either remove them or mark them as intentionally unused.Apply one of the following:
Option A — keep for future use but silence warnings:
- auto const peerCPNum = peerParConfig.mContextParallelism; - auto const selfCPNum = selfParConfig.mContextParallelism; + [[maybe_unused]] auto const peerCPNum = peerParConfig.mContextParallelism; + [[maybe_unused]] auto const selfCPNum = selfParConfig.mContextParallelism; @@ - auto const selfTPRank = selfRank % selfTPNum; - auto const selfPPRank = selfRank / (selfTPNum * selfCPNum); - auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum; + auto const selfTPRank = selfRank % selfTPNum; + auto const selfPPRank = selfRank / (selfTPNum * selfCPNum); + [[maybe_unused]] auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;Option B — remove until CP is fully wired into the mapping:
- auto const peerCPNum = peerParConfig.mContextParallelism; - auto const selfCPNum = selfParConfig.mContextParallelism; @@ - auto const selfTPRank = selfRank % selfTPNum; - auto const selfPPRank = selfRank / (selfTPNum * selfCPNum); - auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum; + auto const selfTPRank = selfRank % selfTPNum; + auto const selfPPRank = selfRank / (selfTPNum /* * selfCPNum intentionally omitted if CP is unused here */);Please confirm whether CP is intended to affect the returned target ranks in this function in this PR. If yes, see my other comment about semantics mismatch with tests and downstream invariants. If not, Option A will keep the build clean.
70-74
: Also validate CP sizes (power-of-two) alongside PPWe check PP sizes are powers-of-two but not CP. For consistency and earlier failure, validate CP as well.
for (auto val : {peerPPNum, selfPPNum}) { TLLM_CHECK(isPowerOfTwo(val)); } + for (auto val : {peerParConfig.mContextParallelism, selfParConfig.mContextParallelism}) + { + TLLM_CHECK(isPowerOfTwo(val)); + }
90-93
: Guard DP grouping with divisibility checksWhen DP is enabled we divide TP by DP size. Add explicit divisibility checks to fail fast on misconfiguration instead of truncating.
- int const selfTPSizePerDPGroup = selfParConfig.mEnableAttentionDP ? selfTPNum / selfParConfig.mDPsize : selfTPNum; - int const peerTPSizePerDPGroup = peerParConfig.mEnableAttentionDP ? peerTPNum / peerParConfig.mDPsize : peerTPNum; + if (selfParConfig.mEnableAttentionDP) + { + TLLM_CHECK(selfParConfig.mDPsize > 0 && selfTPNum % selfParConfig.mDPsize == 0); + } + if (peerParConfig.mEnableAttentionDP) + { + TLLM_CHECK(peerParConfig.mDPsize > 0 && peerTPNum % peerParConfig.mDPsize == 0); + } + int const selfTPSizePerDPGroup = selfParConfig.mEnableAttentionDP ? selfTPNum / selfParConfig.mDPsize : selfTPNum; + int const peerTPSizePerDPGroup = peerParConfig.mEnableAttentionDP ? peerTPNum / peerParConfig.mDPsize : peerTPNum;cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1441-1461
: Nice focused harness; minor suggestion: name captures to document expectationsThe
verifyContext
lambda is a good consolidation. Consider renaming parameters to reflect semantics more clearly (expectedRanksPPxTP
,expectedNeedSend
) to avoid implying CP is enumerated here, unless you proceed with the broader CP contract change.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-138)TargetRanksInfoForDP
(50-51)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
needSendCache
(75-95)needSendCache
(75-76)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
needSendCache
(56-85)needSendCache
(56-57)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)
320-323
: Bug: inputAllRankNum ignores CP causing OOB writes when CP > 1inputAllRankNum must reflect the peer/input world size. With CP introduced, retRanks now ranges over PPTPCP, but the backing array is sized with PPTP only. Indexing blockInfos[oi * inputAllRankNum + iRank] will overflow for iRank ≥ PPTP.
Apply this diff:
- int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism; + int const inputAllRankNum = iParallelConfig.mPipelineParallelism + * iParallelConfig.mTensorParallelism + * iParallelConfig.mContextParallelism;
328-338
: Bug: ppRank computation ignores CP; incorrect for global rank with CP > 1For a global rank laid out as (pp, cp, tp), ppRank must be computed dividing by (TP*CP). Current code divides by TP only.
Apply this diff inside fillBlockInfo:
- int const tpRank = rank % parallelConfig.mTensorParallelism; - int const ppRank = rank / parallelConfig.mTensorParallelism; + int const tpRank = rank % parallelConfig.mTensorParallelism; + int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism);Optionally capture cpRank if/when needed:
int const cpRank = (rank / parallelConfig.mTensorParallelism) % parallelConfig.mContextParallelism;
929-933
: Assertion must include CP domain sizeretRanks now has size mDomainPPSize * mDomainTPSize * mDomainCPSize. The current check will fail (or worse, silently mis-assume counts).
Apply this diff:
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK(targetRankInfo.mIRanks.size() == static_cast<size_t>( + targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));Follow-up: Re-evaluate whether outputCacheNum should still be independent of CP (likely yes), but the assertion must reflect the new rank enumeration.
1259-1263
: Same: assertion must include CP domain sizeMirror the fix in concatKVCache().
Apply this diff:
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK(targetRankInfo.mIRanks.size() == static_cast<size_t>( + targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
🧹 Nitpick comments (5)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (2)
35-43
: Add brief Doxygen for new CP domain field and confirm public ABI intentGood addition. Please document what units and invariants mDomainCPSize follows (e.g., factor between peer/self CP, must divide peerCPNum, power-of-two expectation), since this is now part of the public struct.
Apply this diff to add a brief Doxygen trailing member doc:
struct TargetRanksInfo { int mDomainPPSize; int mDomainTPSize; - int mDomainCPSize; + int mDomainCPSize; //!< Domain size along Context Parallelism (CP). Usually peerCPNum / selfCPNum; must be >= 1. std::vector<int> mIRanks; int mDupHeadFactor; int mPeerDupHeadFactor; };
24-24
: Missing include guard as per repo guidelinesHeaders are required to use include guards with symbol TRTLLM_<FILE_NAME>_H. This file only has #pragma once. Consider adding guards (keep #pragma once if you like).
A guard matching the guideline would look like:
#ifndef TRTLLM_CACHE_SPLIT_CONCAT_H #define TRTLLM_CACHE_SPLIT_CONCAT_H // ... existing content ... #endif // TRTLLM_CACHE_SPLIT_CONCAT_Hcpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
127-129
: Replace std::cerr with repo loggingDirect stderr logging is noisy in library code and bypasses repo logging controls. Use TLLM_LOG_DEBUG instead.
Apply this diff:
- std::cerr << "peerCPRankStart: " << peerCPRankStart << ", peerCPRankEnd: " << peerCPRankEnd << ", selfCPRank: " - << selfCPRank << ", mDomainCPSize: " << mDomainCPSize << std::endl; + TLLM_LOG_DEBUG("CP domain — peerCPRankStart: %d, peerCPRankEnd: %d, selfCPRank: %d, mDomainCPSize: %d", + peerCPRankStart, peerCPRankEnd, selfCPRank, mDomainCPSize);
963-971
: Optional: clarify assumptions about CP and number of cachesThe computation of outputCacheNum ignores CP, which may be correct if per-CP shards do not change cache multiplicity for split/concat. Please confirm this matches the intended data transfer topology; otherwise we may need to scale by mDomainCPSize or factor it into head duplication rules.
If CP should influence cache multiplicity, I can propose a concrete adjustment once we confirm the intended model.
146-155
: Minor naming/style: avoid m-prefixed localsLocals named like members (mDomainPPSize, mDomainTPSize, mDupHeadFactor, etc.) are confusing per guideline (m- prefix for members only). Not a blocker; consider renaming in a follow-up.
Example:
- int mDomainPPSize = 1; + int domainPPSize = 1;…and propagate accordingly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards with symbol TRTLLM_<FILE_NAME>_H (no leading/trailing underscores; directory names excluded)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🔇 Additional comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
133-141
: Rank enumeration includes CP — LGTMThe nesting order and irank computation are consistent with the decomposition used above.
156-156
: Return value now includes CP domain size — LGTMInitializer order matches TargetRanksInfo fields after adding mDomainCPSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1448-1452
: Fix invalid CTAD: std::vector(numLayers, numHeads) doesn’t deduce element type (won’t compile)The “count, value” constructor has no CTAD guide. Specify the element type explicitly.
Apply one of these diffs (first mirrors the earlier bot suggestion; second aligns with SizeType32):
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};or
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<SizeType32>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1560-1565
: Fill in the commented test or add a negative-case for unsupported CP shrinkYou’ve stubbed “TP shrinks while CP grows” but left it commented. Either implement it (good to lock ordering across mixed TP/CP transitions) or add a death/throw test for the explicitly unsupported case selfCPNum > peerCPNum (CP shrink), per TargetRanksInfoForDP’s TLLM_THROW.
Example (CP shrink should throw):
- // // TP shrinks while CP grows from context to generation. - // { - // tr::WorldConfig const contextWC{/*tpSize*/ 4, /*ppSize*/ 1, /*cpSize*/ 1}; - // tr::WorldConfig const genWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 2}; - // } + // CP shrinks from context to generation is not supported; verify it throws. + { + tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 2}; + tr::WorldConfig const genWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const contextCache + = texec::kv_cache::CacheState(sharedModelConfig, contextWC, dataType, attentionType, kvFactor); + auto const genCache + = texec::kv_cache::CacheState(sharedModelConfig, genWC, dataType, attentionType, kvFactor); + EXPECT_THROW( + { (void) tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, /*contextRank*/ 0); }, + std::runtime_error); + }If TLLM_THROW maps to a different exception type in your build, replace std::runtime_error accordingly. Want me to push a version that also covers the mixed “TP shrinks while CP grows” positive case?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-161)TargetRanksInfoForDP
(50-51)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
needSendCache
(56-85)needSendCache
(56-57)
🔇 Additional comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1524-1582
: No remaining PP×TP-only assumptions—CP is correctly includedI ran a project-wide audit for any size checks or count initializations that multiply only
mDomainPPSize × mDomainTPSize
(with and without whitespace, and reversed operands) and found no instances. All occurrences ofmDomainCPSize
are already declared and used alongsidemDomainPPSize
andmDomainTPSize
(in the header and in the existing tests).It looks like the downstream invariants and callers have already been updated to include the CP dimension. There are no outstanding PP×TP-only assertions or loops to refactor. We can consider this covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
322-324
: Bug: CP-aware ranks can index past blockInfos buffer; fix inputAllRankNum and PP-rank decompositioninputRanks now comes from targetIRanks with CP included, so iRank ∈ [0, PPTPCP). However, inputAllRankNum is still PP*TP, and fillBlockInfo computes ppRank = rank / TP, both ignoring CP. This can cause out-of-bounds writes into blockInfos and miscomputed ppRank when CP > 1.
Apply this patch:
@@ - int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism; + // Include CP to cover the full world of ranks + int const inputAllRankNum = iParallelConfig.mPipelineParallelism + * iParallelConfig.mTensorParallelism + * iParallelConfig.mContextParallelism; @@ - int const tpRank = rank % parallelConfig.mTensorParallelism; - int const ppRank = rank / parallelConfig.mTensorParallelism; + int const tpRank = rank % parallelConfig.mTensorParallelism; + // Decompose rank with CP in the denominator: rank = (pp * TP * CP) + (cp * TP) + tp + int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism); int const ppNum = parallelConfig.mPipelineParallelism; @@ - int iRank = inputRanks[i]; - blockInfos[oi * inputAllRankNum + iRank] + int iRank = inputRanks[i]; + TLLM_CHECK_WITH_INFO(0 <= iRank && iRank < inputAllRankNum, + "iRank (%d) out of range for PP*TP*CP world (%d).", iRank, inputAllRankNum); + blockInfos[oi * inputAllRankNum + iRank] = fillBlockInfo(iCacheState, inputBlocks[oi * iRankNum + i], iRank);This aligns the buffer sizing and rank decomposition with the new CP-aware rank space and prevents OOB writes when CP > 1.
Also applies to: 330-341, 349-355
931-934
: Fix assertion: mIRanks now includes CP; update expected size to PPTPCPThe check still uses PP*TP and will fail when CP > 1. mIRanks is generated across PP×CP×TP.
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK(targetRankInfo.mIRanks.size() + == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
1261-1263
: Same fix as above: include CP in mIRanks size checkMirror the PPTPCP expectation here too to avoid false negatives under CP.
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK(targetRankInfo.mIRanks.size() + == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
60-66
: Guard CP invariants before using them in division; avoid potential div-by-zero and mis-decompositionYou compute selfPPRank using (selfTPNum * selfCPNum) before validating CP values. Move/add the CP checks before Lines 63–65 so we never divide by zero or decompose with invalid CP.
Apply this minimal reordering/safety patch:
@@ - auto const peerCPNum = peerParConfig.mContextParallelism; - auto const selfCPNum = selfParConfig.mContextParallelism; + auto const peerCPNum = peerParConfig.mContextParallelism; + auto const selfCPNum = selfParConfig.mContextParallelism; + + // Validate CP sizes before using them in denominators/decomposition + for (auto val : {peerCPNum, selfCPNum}) + { + TLLM_CHECK(isPowerOfTwo(val)); + } @@ - auto const selfTPRank = selfRank % selfTPNum; - auto const selfPPRank = selfRank / (selfTPNum * selfCPNum); - auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum; + auto const selfTPRank = selfRank % selfTPNum; + auto const selfPPRank = selfRank / (selfTPNum * selfCPNum); + auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum; + + // Optional: ensure selfRank falls into world bounds (PP*TP*CP). Helps catch mis-sized worlds early. + TLLM_CHECK_WITH_INFO(selfPPRank < selfPPNum, "selfRank out of range for PP*TP*CP world.");And remove the duplicated CP power-of-two loop at Lines 117–120 (see follow-up patch below).
114-131
: CP domain mapping: enforce exact divisibility and clarify unsupported downscale pathWhen selfCPNum <= peerCPNum you do integer division without verifying exact divisibility; when selfCPNum > peerCPNum you throw. Add a modulus check to prevent silent truncation; keep the explicit throw (or implement symmetric mapping if desired).
@@ - for (auto val : {peerCPNum, selfCPNum}) - { - TLLM_CHECK(isPowerOfTwo(val)); - } + // (Moved CP validation earlier; see Lines 60–66 patch) @@ - if (selfCPNum <= peerCPNum) + if (selfCPNum <= peerCPNum) { - mDomainCPSize = peerCPNum / selfCPNum; + TLLM_CHECK_WITH_INFO(peerCPNum % selfCPNum == 0, + "peerCPNum (%d) must be divisible by selfCPNum (%d).", peerCPNum, selfCPNum); + mDomainCPSize = peerCPNum / selfCPNum; peerCPRankStart = selfCPRank * mDomainCPSize; peerCPRankEnd = (selfCPRank + 1) * mDomainCPSize; } else { - TLLM_THROW("selfCPNum > peerCPNum is not supported."); + TLLM_THROW("CP downscale (selfCPNum > peerCPNum) is not supported."); }If CP downscale should mirror PP/TP behavior, I can provide a symmetric else-branch.
🧹 Nitpick comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
49-50
: Update dimension comment to reflect CP is now part of the mappingThe comment still says [PP, TP], but CP is now included in the rank space.
-// [PP, TP] +// [PP, CP, TP]
1008-1016
: Verify DomainCP impact on split/concat shapes or document why CP does not change kernelsThe kernels and host launch math only use DomainPPSize and DomainTPSize (no DomainCPSize). If CP does not alter per-rank head/layer tiling (tokens-only split), please add a brief comment here clarifying why DomainCPSize does not factor into grid/block or per-block shapes. Otherwise, wire DomainCPSize into the relevant calculations.
If helpful, I can draft comments or a small guard like:
- Assert that tokens-per-block already reflect CP partitioning, or
- Assert that CP is handled exclusively by rank routing (mIRanks) and not by kernel shapes.
Also applies to: 1043-1049, 1356-1362, 1364-1369
371-381
: Minor: use const for oPPNum/iPPNum; consider precomputing gridDimx for clarityNon-blocking; current code is fine.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
135-141
: Rank composition with CP looks correctirank = (k * peerTPNum * peerCPNum) + (j * peerTPNum) + i matches the decomposition rank = (pp * TP * CP) + (cp * TP) + tp. Loop order [TP -> CP -> PP] is consistent with that mapping.
311-317
: Consistency check on inputRanks routingGood to assert inputRanks equals targetIRanks(...). With CP included, this is even more important to catch mismatches earlier.
158-159
: Aggregate-init check complete: no mismatched sites
- Used ripgrep across the
cpp/
tree to find any brace‐initializer, assignment‐style or functional‐cast constructions ofTargetRanksInfo
.- Only occurrence is the return in
cacheSplitConcat.cu
(lines 158–159), which now correctly includesmDomainCPSize
.- No other aggregate‐init sites or direct constructions of
TargetRanksInfo
were found.All initializer sites are updated; no further action required.
0f33380
to
e04c5f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1448-1453
: Fix invalid CTAD: std::vector(numLayers, numHeads) won't compileThe test contains invalid CTAD syntax.
std::vector(numLayers, numHeads)
lacks an explicit element type and won't compile on conforming compilers. Specify the element type to use the "count, value" constructor properly.Apply this diff to fix the compilation error:
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
1679-1699
: Complex multi-dimensional scaling test - consider adding assertions for intermediate computationsWhile the test correctly verifies the final target ranks when TP and CP grow while PP shrinks, the complexity of this scenario might benefit from additional intermediate assertions to aid debugging.
Consider adding debug assertions or comments explaining the rank mapping formula for this complex case, especially how the PP shrink affects the rank distribution:
// TP, CP grow while PP shrinks from context to generation. { tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 2, /*cpSize*/ 1}; tr::WorldConfig const genWC{/*tpSize*/ 4, /*ppSize*/ 1, /*cpSize*/ 2}; + // Context rank mapping: rank = ppRank * (tpSize * cpSize) + cpRank * tpSize + tpRank + // Generation expects: TP 2→4 (2x growth), PP 2→1 (collapse), CP 1→2 (2x growth) verifyContext( /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 2, /*expectCPDomain*/ 2, /*expectNeedSend*/ true);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-159)TargetRanksInfoForDP
(50-51)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
needSendCache
(56-85)needSendCache
(56-57)
🔇 Additional comments (5)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
1441-1444
: Update verifyContext signature - excellent CP domain addition!The lambda now properly accepts
expectCPDomain
as a parameter and validates it againstmDomainCPSize
, which correctly extends the test coverage to CP domain configurations. This aligns well with the PR's objective of accommodating CP in disaggregated scenarios.
1539-1555
: Well-structured CP growth test blockExcellent test coverage for CP growth from context to generation! The test correctly verifies that when CP increases from 1 to 2, the target ranks expand to include the additional CP dimension (e.g., {0, 2} for rank 0), and all ranks correctly report
expectNeedSend
as true.
1557-1577
: Comprehensive test for combined TP and CP growthGood coverage of the complex case where both TP and CP grow simultaneously. The expected ranks correctly reflect the cross-product of TP×CP domain expansion (e.g., {0, 4, 1, 5} represents TP growth from 2→4 combined with CP growth from 1→2).
1579-1595
: TP shrink with CP growth test correctly implementedThe test properly handles the case where TP shrinks (4→2) while CP grows (1→2), correctly identifying which ranks need to send cache based on the shrinking TP dimension.
1449-1453
: Config-driven CacheState construction aligns with PR objectivesThe migration to using
CacheState::ModelConfig
andWorldConfig
objects for constructing both context and generation caches is a clean design that supports the dynamic TP/PP/CP transitions mentioned in the PR summary. This config-driven approach provides better encapsulation and makes the tests more maintainable.
e04c5f5
to
0eb2596
Compare
/bot run |
PR_Github #16495 [ run ] triggered by Bot |
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
1524-1537
: Avoid commented-out test code; track the unresolved case explicitlyThis block is commented out with a TODO, which goes against the repo guideline to avoid dead/commented code. Prefer one of:
- Extract this scenario into a separate disabled test (use GTest’s DISABLED_ prefix) with a reference to a tracking issue.
- Or remove the code and leave a concise TODO with an issue link.
Proposed minimal change (remove the commented block; leave a TODO):
- // // PP grows while TP shrinks from context to generation. - // { - // tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1}; - // tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 1}; - // verifyContext( - // /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/ - // 2, - // /*expectTPDomain*/ 1, /*expectCPDomain*/ 1, /*expectNeedSend*/ true); - // // TODO: Figure why needSendCache is false here. - // verifyContext( - // /*contextRank*/ 1, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/ - // 2, - // /*expectTPDomain*/ 1, /*expectCPDomain*/ 1, /*expectNeedSend*/ false); - // } + // TODO: Add dedicated test for “PP grows while TP shrinks” once needSendCache semantics are finalized. Track in issue <ID>.If you want, I can draft a DISABLED_ test that reuses the same verifyContext helper.
1701-1715
: Second commented-out scenario: convert to a disabled test or removeSame concern as Lines 1524–1537: please avoid commented-out code; either move into a DISABLED_ test and reference a tracking issue or remove and leave a TODO.
- // // PP, CP grow while TP shrinks from context to generation. - // { - // tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1}; - // tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 4}; - // verifyContext( - // /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5, 2, 6, 3, 7}, - // /*expectPPDomain*/ 2, - // /*expectTPDomain*/ 1, /*expectCPDomain*/ 4, /*expectNeedSend*/ true); - // // TODO: Figure why needSendCache is false here. - // verifyContext( - // /*contextRank*/ 1, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5, 2, 6, 3, 7}, - // /*expectPPDomain*/ 2, - // /*expectTPDomain*/ 1, /*expectCPDomain*/ 4, /*expectNeedSend*/ false); - // } + // TODO: Add dedicated test for “PP, CP grow while TP shrinks” once needSendCache semantics are finalized. Track in issue <ID>.
1753-1759
: Typo: rename contextTragetInfo → contextTargetInfoMinor readability/consistency fix.
- auto const contextTragetInfo + auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); - EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks); - EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize); - EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize); + EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); + EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
1844-1866
: Typo: rename verfiyGeneration → verifyGenerationMinor naming correction to match spelling of “verify”.
- auto const verfiyGeneration = [&](int contextRank, int generationRank, std::vector<int> const& expectRanks, + auto const verifyGeneration = [&](int contextRank, int generationRank, std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain) @@ - verfiyGeneration( + verifyGeneration( /*contextRank*/ 0, /*generationRank*/ 0, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1); - verfiyGeneration( + verifyGeneration( /*contextRank*/ 0, /*generationRank*/ 1, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1); @@ - verfiyGeneration( + verifyGeneration( /*contextRank*/ 0, /*generationRank*/ 0, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1); - verfiyGeneration( + verifyGeneration( /*contextRank*/ 0, /*generationRank*/ 1, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-159)TargetRanksInfoForDP
(50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (10)
1464-1492
: Scenario “TP shrinks”: expectations and needSend toggling look correctMapping and domain sizes match TargetRanksInfoForDP’s TP-downscale math; alternating needSend is consistent with duplicate coverage within the same peer rank. No issues.
1494-1510
: Scenario “TP grows”: correct rank ordering and domain sizingExpected ranks reflect i-major ordering (TP outermost, then CP, then PP) in TargetRanksInfoForDP. Looks good.
1512-1522
: Scenario “TP and PP grow”: correct 2×2 expansion, ordering {k over PP inside i over TP}The expected order {0,4,1,5} matches the nested loops (i TP outer, k PP inner).
1539-1555
: Scenario “CP grows”: ranks and DomainCPSize assertions align with CP fan-outThe expected ranks {0,2}, {1,3}, etc., correctly reflect irank = kTPCP + j*TP + i with CP=2.
1557-1577
: Scenario “TP and CP grow”: correct cross-product expansion and orderingThe sequence {0,4,1,5} matches the nested loop order (i over TP outer, j over CP inner).
1579-1595
: Scenario “TP shrinks while CP grows”: assertions consistent with mixed shrink/growAlternating needSend for pairs mapping to the same peer TP rank looks intentional and consistent.
1597-1617
: Scenario “PP and CP grow”: two-level expansion validatedExpected ranks {0,4,2,6}, etc., reflect k over PP as the innermost loop and j over CP in the middle.
1619-1647
: Scenario “PP shrinks while CP grows”: correct downscale of PP with CP fan-outRank expectations and domain sizes look correct for selfPP > peerPP and selfCP < peerCP.
1649-1677
: Scenario “TP and PP shrink while CP grows”: coverage and assertions look goodCorrect mapping and needSend alternation across the pairs that collapse onto the same peer TP rank.
1679-1699
: Scenario “TP, CP grow while PP shrinks”: correct mixed transitionsRanks and domain assertions are aligned with the TargetRanksInfoForDP nested-loop construction.
0eb2596
to
6c86b28
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
1457-1461
: Add a sanity check that expectRanks size matches PP×TP×CP domain productThis catches mismatches early when editing expectations or changing rank ordering.
Apply:
auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_EQ(expectRanks.size(), + static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain)); EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize); EXPECT_EQ(expectCPDomain, contextTargetInfo.mDomainCPSize);
1524-1537
: Avoid commented-out test code; either disable as a named test or track with an issueLeaving large commented blocks hinders maintainability. Prefer a disabled test (GTest naming
DISABLED_...
) with a clear TODO/issue link, or remove and file an issue. This also aligns with the guideline to avoid dead code.If you want, I can draft a small
TEST(targetTest, DISABLED_PPGrowsTPShrinks_needSendMismatch)
encapsulating this case.
1701-1714
: Same here: convert to a disabled test instead of commenting outCarry the rationale (“needSendCache is false here”) into the disabled test name or add a TODO with an issue id.
1448-1453
: Fix CTAD type deduction: std::vector(numLayers, numHeads) deduces std::vector
CacheState::ModelConfig::mNbKvHeadsPerLayer
expectsstd::vector<SizeType32>
. Usingstd::vector(numLayers, numHeads)
deducesstd::vector<int>
and risks narrowing or compilation errors on platforms whereSizeType32
!=int
. Also, cast the scalars to the expected type for consistency.Apply:
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)), + static_cast<SizeType32>(sizePerHead), + static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
135-141
: Rank enumeration order updated (TP outer × CP middle × PP inner) — verify consumers expect this orderThe nested loop order yields sequences like {0, 4, 1, 5} for TP growth with PP=2, which matches the updated tests. Ensure any code that previously assumed PP-major iteration is updated accordingly.
50-52
: Update comment to reflect the CP dimensionThe function-level comment still says “[PP, TP]”. With CP added, please update it to avoid confusion, e.g., “[PP, TP, CP]” and document the irank formula.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-159)TargetRanksInfoForDP
(50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1441-1444
: Good refactor: verifyContext signature now takes WorldConfig and packs asserts togetherThis improves readability and mirrors the production construction path. LGTM.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)
63-66
: CP rank decomposition looks consistent with the irank layoutUsing:
- selfTPRank = selfRank % selfTPNum
- selfPPRank = selfRank / (selfTPNum × selfCPNum)
- selfCPRank = (selfRank % (selfTPNum × selfCPNum)) / selfTPNum
matches irank = (pp × TP × CP) + (cp × TP) + tp. LGTM.
Please confirm no callers still assume the old 2D [PP, TP] layout when iterating target ranks.
158-158
: Audit CP dimension in split/concat helpersPlease verify that downstream helpers and kernels in
cacheSplitConcat.cu
account for the newly added CP domain:
- In
splitKVCache
andconcatKVCache
(or similarly named wrappers), update any size‐check onmIRanks.size()
so it asserts against
mDomainPPSize * mDomainTPSize * mDomainCPSize
(instead of just PP * TP).- In the CUDA kernels (loop bounds, grid/block calculations, indexing math), ensure the third dimension (CP) is folded into any two-dimensional assumptions.
- Add or extend unit tests to cover non-trivial CP splits—i.e., cases where
mDomainCPSize > 1
.I didn’t find any existing
mIRanks.size() == mDomainPPSize * mDomainTPSize
checks in the repo, so please audit those functions and kernels and add the CP factor where appropriate. Let me know if you’d like assistance updating any of these sections!
114-131
: Symmetric CP downscale mapping missing (and divisibility already enforced by power-of-two checks)The CP branch currently only handles upscaling and throws on downscale, unlike the TP/PP section which implements symmetric mapping in both directions. Also, since both
peerCPNum
andselfCPNum
are asserted to be powers of two,peerCPNum/selfCPNum
is always an integer—no additional divisibility check is needed.Please choose one of the following:
- Implement symmetric downscale (mirroring the TP/PP logic):
if (selfCPNum <= peerCPNum) { // up-scale (ratio of two powers-of-two is always integral) mDomainCPSize = peerCPNum / selfCPNum; peerCPRankStart = selfCPRank * mDomainCPSize; peerCPRankEnd = (selfCPRank + 1) * mDomainCPSize; } else {TLLM_THROW("selfCPNum > peerCPNum is not supported.");
// down-scale: each self rank maps to one peer rank
// selfCPNum and peerCPNum are powers of two, so ratio is integral
mDomainCPSize = 1;
peerCPRankStart = selfCPRank / (selfCPNum / peerCPNum);
peerCPRankEnd = peerCPRankStart + mDomainCPSize; }
- Keep downscale unsupported: replace the
else
with a clear exception and a TODO, e.g.:else {TLLM_THROW("selfCPNum > peerCPNum is not supported.");
TLLM_THROW("Downscale (selfCPNum > peerCPNum) is not supported yet; see ISSUE-1234 TODO"); }
Either way, document the behavior and add a test to cover the downscale case.
PR_Github #16498 [ run ] triggered by Bot |
PR_Github #16495 [ run ] completed with state |
PR_Github #16498 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)
323-326
: Include CP in inputAllRankNum (correct indexing when CP>1 on input side)If input ranks include CP (decode path), current sizing under-allocates and risks OOB indexing.
- int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism; + int const inputAllRankNum = iParallelConfig.mPipelineParallelism + * iParallelConfig.mTensorParallelism * iParallelConfig.mContextParallelism;
331-346
: Derive ppRank with TP×CP strideRank decomposition here ignores CP; when CP>1, PP rank is miscomputed.
- int const tpRank = rank % parallelConfig.mTensorParallelism; - int const ppRank = rank / parallelConfig.mTensorParallelism; + int const tpRank = rank % parallelConfig.mTensorParallelism; + int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism);
932-934
: Account for CP in rank-count sanity checkWith CP added to rank generation, this invariant should include it.
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK( + targetRankInfo.mIRanks.size() + == (static_cast<size_t>(targetRankInfo.mDomainPPSize) + * static_cast<size_t>(targetRankInfo.mDomainTPSize) + * static_cast<size_t>(targetRankInfo.mDomainCPSize)));
1261-1263
: Same: include CP in rank-count check (concat path)Mirror the fix for the split path.
- TLLM_CHECK(targetRankInfo.mIRanks.size() - == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize))); + TLLM_CHECK( + targetRankInfo.mIRanks.size() + == (static_cast<size_t>(targetRankInfo.mDomainPPSize) + * static_cast<size_t>(targetRankInfo.mDomainTPSize) + * static_cast<size_t>(targetRankInfo.mDomainCPSize)));
♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
114-131
: Add explicit CP divisibility checks (guard silent truncation)Even with power-of-two validation, adding explicit divisibility checks improves diagnostics and future-proofs against relaxing the PoT invariant.
Apply:
for (auto val : {peerCPNum, selfCPNum}) { TLLM_CHECK(isPowerOfTwo(val)); } if (selfCPNum <= peerCPNum) { - mDomainCPSize = peerCPNum / selfCPNum; + TLLM_CHECK_WITH_INFO(peerCPNum % selfCPNum == 0, "peerCPNum must be divisible by selfCPNum."); + mDomainCPSize = peerCPNum / selfCPNum; peerCPRankStart = selfCPRank * mDomainCPSize; peerCPRankEnd = (selfCPRank + 1) * mDomainCPSize; } else { - peerCPRankStart = selfCPRank / (selfCPNum / peerCPNum); + TLLM_CHECK_WITH_INFO(selfCPNum % peerCPNum == 0, "selfCPNum must be divisible by peerCPNum."); + peerCPRankStart = selfCPRank / (selfCPNum / peerCPNum); peerCPRankEnd = peerCPRankStart + mDomainCPSize; }cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1448-1453
: Fix ModelConfig element types; avoid wrong std::vector deduction
std::vector(numLayers, numHeads)
deducesstd::vector<int>
and may not matchSizeType32
. Also ensure all fields use the expected element type.- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)), + static_cast<SizeType32>(sizePerHead), + static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
133-141
: Reserve rank vector capacityAvoids reallocations in the triple loop.
- std::vector<int> retRanks; + std::vector<int> retRanks; + retRanks.reserve(static_cast<size_t>(mDomainPPSize) * mDomainTPSize * mDomainCPSize);cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)
1457-1461
: Assert rank list size matches PP×TP×CP productStrengthens the test to catch mapping/ordering regressions.
EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize); EXPECT_EQ(expectCPDomain, contextTargetInfo.mDomainCPSize); + EXPECT_EQ(expectRanks.size(), + static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain));
1751-1757
: Typo: contextTragetInfo → contextTargetInfoMinor rename improves readability.
- auto const contextTragetInfo + auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); - - EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks); - EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize); - EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize); + EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize); + EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-160)TargetRanksInfoForDP
(50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
60-66
: CP rank decomposition looks correct
selfTPRank
,selfPPRank = selfRank / (selfTPNum * selfCPNum)
, andselfCPRank
are consistent with the new TP×CP×PP layout.
159-160
: Manual Verification Recommended — Confirm all TargetRanksInfo usages updated to include mDomainCPSize
Automated searches did not find any directTargetRanksInfo{…}
initializer-list constructions or structured bindings in the codebase; all existing references appear to use member‐access (e.g.,mDomainCPSize
) or are returned by helper functions. However, absence of evidence is not proof of absence.• Search for any remaining brace‐initializer calls, e.g.:
TargetRanksInfo{…}and ensure they pass six arguments in the new order.
• Audit for structured bindings orstd::tie
patterns, for example:auto [pp, tp, cp, ranks, dupHead, peerDup] = targetIRanks(…);
and update them to include the
cp
element.
• Verify any other call sites (outside tests) oftargetIRanks
orTargetRanksInfoForDP
are handled correctly.cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1680-1712
: Optional CP Downscale Test CasePlease add a scenario where the context’s CP count is greater than the generation’s CP count (e.g. contextCP=2, genCP=1) to exercise the
selfCPNum > peerCPNum
branch inTargetRanksInfoForDP
.• Location: cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp, within the “PP, CP grow while TP shrinks” block (around lines 1680–1712)
• Suggested diff:@@ // PP, CP grow while TP shrinks from context to generation. { tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1}; tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 4}; + // CP downscale case: contextCP > genCP + tr::WorldConfig const contextWC2{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 2}; + tr::WorldConfig const genWC2{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 1}; + verifyContext( + /*contextRank*/ 0, /*contextWC*/ contextWC2, /*genWC*/ genWC2, + /*expectRanks*/ { /* … appropriate expected ranks … */ }, + /*expectPPDomain*/ 2, /*expectTPDomain*/ 1, /*expectCPDomain*/ 1, + /*expectNeedSend*/ true); }After adding the new case, rebuild the test binary and run:
# Build your tests (e.g., via CMake/Ninja) # Then: ./cpp/build/tests/batch_manager/cacheTransceiverTest --gtest_filter="targetTest.CacheStateNODP"Verify that all tests, including the new downscale case, pass successfully.
PR_Github #16750 [ run ] completed with state |
39a8f4c
to
6d8f43e
Compare
/bot run |
PR_Github #16808 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1441-1463
: Fix ModelConfig construction; add domain-size sanity check
- std::vector(numLayers, numHeads) deduces std::vector. CacheState::ModelConfig expects std::vector. Use explicit element type and casts to avoid narrowing and potential compile break.
- Add a sanity check that expectRanks.size() equals PP×TP×CP domain product computed by TargetRanksInfoForDP.
Apply this diff:
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)), + static_cast<SizeType32>(sizePerHead), + static_cast<SizeType32>(tokensPerBlock)}; @@ - EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks); + // Sanity: mIRanks count must equal the product of domain sizes. + EXPECT_EQ(expectRanks.size(), + static_cast<size_t>(contextTargetInfo.mDomainPPSize) + * static_cast<size_t>(contextTargetInfo.mDomainTPSize) + * static_cast<size_t>(contextTargetInfo.mDomainCPSize)); + EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);Run to catch any other places with accidental CTAD on std::vector for ModelConfig:
#!/bin/bash rg -nP --type=cpp 'CacheState::ModelConfig\\s*\\{\\s*std::vector\\s*\\('
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1700-1712
: Add CP-shrink coverage (selfCP > peerCP) to exercise the else-branchCurrently only CP growth (1→2, 1→4) is tested. Add scenarios where CP shrinks (e.g., context CP=4 → generation CP=1) to validate peerCPRankStart computation and rank ordering. Also consider a mixed case with TP grow, PP grow, CP shrink to fully cover the mapping matrix.
I can append representative CP-shrink blocks mirroring the existing patterns if you want.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-160)TargetRanksInfoForDP
(50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (11)
1464-1492
: TP shrink mapping and MLA send gating look correctExpected ranks and needSend flags match the DP-group modulo rule when TP shrinks (4→2) with CP=1.
1494-1510
: TP grow case is consistent with TP-major, CP-next, PP-slowest orderingRank ordering {0,4,1,5}… matches loops (TP inner-most push with PP stride).
1512-1522
: TP and PP grow: cross-product ordering matches implementationCombined domains (PP×TP) produce the expected interleaving.
1524-1536
: PP grows while TP shrinks: expectations align with domain computationSingle TP choice with two PP targets and correct needSend asymmetry.
1538-1554
: CP grow only: ranks step by TP across CP slices as expectedmDomainCPSize=2 reflected correctly in target ranks.
1556-1576
: TP and CP grow: ordering and cardinality are correctRank vectors reflect TP-fastest, CP-next for each PP.
1578-1594
: TP shrinks, CP grows: send gating and targets are correctNon-sending ranks (odd TP ranks) are captured for MLA.
1596-1616
: PP and CP grow with constant TP: mapping is consistentTwo PP and two CP destinations with correct ordering.
1618-1646
: PP shrinks, CP grows: expectations are coherentDomain sizes and rank lists match the mapping scheme.
1648-1676
: TP and PP shrink, CP grows: mixed shrink/grow case looks goodNeedSend false on “duplicate” TP senders is correct.
1678-1698
: TP, CP grow while PP shrinks: ordering verifiedInterleaving of TP within each PP destination is correct.
PR_Github #16808 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16869 [ run ] triggered by Bot |
Signed-off-by: Balaram Buddharaju <[email protected]>
6d8f43e
to
e7865b3
Compare
/bot run --disable-fail-fast |
PR_Github #16893 [ run ] triggered by Bot |
PR_Github #16869 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1441-1452
: Fix ModelConfig vector construction to avoid type mismatch.The
std::vector(numLayers, numHeads)
deduces tostd::vector<int>
, butModelConfig
expectsstd::vector<SizeType32>
. This can cause compilation errors or narrowing conversion warnings.Apply this diff to fix the type issue:
- auto const sharedModelConfig - = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock}; + auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{ + std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)), + static_cast<SizeType32>(sizePerHead), + static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1457-1460
: Consider adding domain-size sanity check.While the CP domain testing is comprehensive, adding a sanity check that the size of
expectRanks
matches the product of domain sizes would catch configuration mismatches early.Add this sanity check after computing
contextTargetInfo
:auto const contextTargetInfo = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank); + // Sanity check: expectRanks size should match domain product + EXPECT_EQ(expectRanks.size(), + static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain)) + << "Expected ranks size mismatch for domain product"; EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(3 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}
: Close namespaces with a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Only use non-trivial literals for initialization; use named constexprs instead of inline magic numbers (except 0, nullptr, true, false)
Use Allman brace style
Filename style for C++: camelCase starting lowercase (e.g., thisIsAFilename.cpp); must be case-insensitive unique within a target
Type names use CamelCase with uppercase first letter
Local variables, methods, and namespaces use lowerCamelCase
Non-static global (non-magic) variables use gPrefix (e.g., gCounter)
Static or anonymous-namespace globals use sPrefix (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use mPrefix (e.g., mNbFooValues); public members may omit but prefer m when clarity benefits
Constants: enums and constant values use UPPER_SNAKE with k prefix (e.g., kDIGIT_NUM)
If macros must be used, name them UPPER_SNAKE (e.g., FOO_VERSION); prefer constants over #define
Run clang-format (LLVM style); max line length 120; only justify exceptions with clang-format off/on blocks
Use C++-style comments; C comments only for special inline cases; prefer // for single-line
Capitalize and punctuate full-sentence comments
Disable code with #if/#endif (or noop macros), not by commenting; avoid dead code
Do not throw exceptions across library boundaries
Use the least-forceful cast; avoid removing const/volatile; avoid C-style and functional casts; static_cast from void* to T*; reinterpret_cast as last resort; avoid dynamic_cast
Avoid large inline functions unless justified; class-defined functions are implicitly inline
Prefer anonymous namespaces over static for internal linkage of functions
Keep parameter names consistent between declarations and definitions
Only #define/#undef at g...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; no tabs; indent by 4 spaces
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}
: Follow Doxygen: //! for comments, //!< for members
All templates (classes/functions/members/static members) must be instantiated at least once
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cuh}
: If a class is not POD, keep data members private
Use include guards named TRTLLM_<FILE_NAME_UPPER> without leading/trailing underscores
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu}
: Place the semicolon of an empty for/while loop on a new line
Bodies of switch/while/do/for must be compound (brace-delimited), including if/else
Use smart pointers for heap objects; prefer unique_ptr over shared_ptr; weak_ptr only exceptionally; avoid deprecated smart pointers
Annotate non-obvious call arguments with inline C comments (/* name = */ value)
Do not use assignment within subexpressions (e.g., if (x = y))
Switch over enums should cover all values without default to enable compile-time checks
Write well-structured switch statements; no chained case/goto-like constructs
No fall-through except between consecutive empty-to-nonempty cases; each case must end with break or throw; return not allowed
If a switch clause is compound, place break inside the braces
Every defined function must be used at least once (no unused methods)
Avoid C library functions; prefer C++ alternatives; do not memset non-POD objects; use brace init or std::fill_n
Files:
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
TargetRanksInfoForDP
(50-160)TargetRanksInfoForDP
(50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1)
39-39
: LGTM! CP domain size properly added to TargetRanksInfo.The addition of
mDomainCPSize
follows the existing pattern and is correctly positioned aftermDomainTPSize
.cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (10)
1464-1493
: Excellent coverage for TP shrinking scenarios with CP.The test comprehensively covers TP shrinking from context to generation, correctly verifying CP domain size remains 1 and checking the
needSendCache
flag for MLA optimization.
1495-1510
: LGTM! TP growth scenario properly tested.The test correctly validates TP expansion with CP=1, demonstrating proper rank mapping across the TP domain.
1538-1554
: Well-structured CP growth test cases.The test properly validates CP expansion from 1 to 2 while maintaining TP and PP dimensions, correctly computing the expanded rank set and CP domain size of 2.
1556-1576
: Excellent coverage of combined TP and CP growth.The test effectively validates simultaneous TP and CP expansion, correctly computing the 2×2 domain expansion and resulting rank mappings.
1578-1594
: Good coverage for TP shrinking with CP growth.The test properly handles the complex scenario where TP decreases while CP increases, ensuring correct rank distribution and checking MLA-specific
needSendCache
behavior.
1596-1616
: Comprehensive PP and CP growth testing.The test correctly validates simultaneous PP and CP expansion, properly computing the 2×2 domain size and resulting rank mappings across pipeline and context parallel dimensions.
1618-1646
: Thorough coverage of PP shrinking with CP growth.The test properly validates PP reduction while CP increases, ensuring correct rank mappings across all context ranks. Good attention to detail in covering all 8 context ranks.
1648-1676
: Excellent coverage of complex TP/PP shrinking with CP growth.The test comprehensively covers the scenario where both TP and PP decrease while CP increases, properly handling MLA-specific
needSendCache
logic.
1678-1698
: Well-tested TP/CP growth with PP shrinking.The test properly validates the complex scenario of TP and CP expansion with PP reduction, ensuring correct 2×2 CP×TP domain expansion.
1700-1712
: Comprehensive extreme scaling test case.Excellent test coverage for PP and CP significant growth (2×4) with TP shrinking, properly validating the 8-rank domain expansion.
/bot run --disable-fail-fast |
PR_Github #16907 [ run ] triggered by Bot |
PR_Github #16893 [ run ] completed with state |
PR_Github #16907 [ run ] completed with state |
/bot run |
PR_Github #17009 [ run ] triggered by Bot |
PR_Github #17009 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #17015 [ run ] triggered by Bot |
PR_Github #17015 [ run ] completed with state |
Description
This MR
TargetRanksInfoForDP
function to account for Context Parallelism. Follows the same mapping scheme from here.TEST(targetTest, CacheStateNODP)
for improved readability.Assumptions:
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Tests
Chores