Skip to content

Conversation

brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Aug 25, 2025

Description

This MR

  • Updates TargetRanksInfoForDP function to account for Context Parallelism. Follows the same mapping scheme from here.
  • Refactors TEST(targetTest, CacheStateNODP) for improved readability.

Assumptions:

  • CP>1 is valid only for decode phase. For prefill, CP is always 1. This shall be the case for prototyping of helix parallelism.

Test Coverage

$ ./cpp/build/tests/batch_manager/cacheTransceiverTest --gtest_filter="targetTest.CacheStateNODP"

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Added context-parallelism (CP) awareness to cache domain partitioning; CP is included in rank generation and the CP domain size is exposed in rank info.
  • Tests

    • Expanded unit tests to cover many TP/PP/CP transition scenarios, validating rank mappings, domain sizes (PP/TP/CP), and cache-send behavior.
  • Chores

    • Cache-state construction migrated to shared ModelConfig/WorldConfig API (new ModelConfig type and updated constructor).

Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

📝 Walkthrough

Walkthrough

Adds a context-parallelism (CP) dimension to KV-cache rank/domain computation: decomposes ranks including CP, validates CP counts, partitions CP domains (mDomainCPSize, peerCPRankStart/End), extends rank generation across TP×CP×PP, returns CP-inclusive TargetRanksInfo, and updates CacheState construction and unit tests to exercise CP-aware transitions.

Changes

Cohort / File(s) Summary
Executor: CP-aware rank/domain logic
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
Adds peerCPNum/selfCPNum, derives selfCPRank, validates CP power-of-two, computes mDomainCPSize and peerCPRankStart/End, extends rank generation to iterate TP×CP×PP and compute irank, and returns mDomainCPSize in TargetRanksInfo.
Public struct expansion
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
Adds public int mDomainCPSize to TargetRanksInfo and updates TargetRanksInfoForDP return/init to include CP domain size.
Unit tests & API: WorldConfig/ModelConfig + CP scenarios
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Replaces int-based CacheState construction with shared CacheState::ModelConfig + per-context/per-generation tr::WorldConfig; updates verifyContext signature and assertions to include mDomainCPSize; adds multiple TP/PP/CP transition scenarios and needSendCache checks.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Test as Unit Test
  participant WCc as WorldConfig (context)
  participant WCg as WorldConfig (generation)
  participant Func as TargetRanksInfoForDP
  participant TRI as TargetRanksInfo

  Test->>WCc: construct context WC (TP,PP,CP)
  Test->>WCg: construct generation WC (TP,PP,CP)
  Test->>Func: call TargetRanksInfoForDP(contextWC, genWC)
  activate Func
  note right of Func: decompose self/peer TP,PP,CP\nvalidate CP power-of-two
  rect rgba(220,235,255,0.35)
    note right of Func: compute domain sizes:\nmDomainPPSize, mDomainTPSize, mDomainCPSize
    loop peerPPRankStart..peerPPRankEnd
      loop peerCPRankStart..peerCPRankEnd
        loop peerTPRankStart..peerTPRankEnd
          Func->>Func: irank = k*(peerTPNum*peerCPNum) + j*peerTPNum + i
          Func-->>Func: append irank
        end
      end
    end
  end
  Func-->>Test: return TRI {mDomainPPSize,mDomainTPSize,mDomainCPSize,ranks,...}
  deactivate Func
  Test->>Test: assert ranks and domain sizes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • MatthiasKohl
  • pcastonguay
  • schetlur-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)

1441-1447: Good abstraction: ParallelConfig alias + helper improve readability; consider adding cp>1 coverage

The alias and the small builder lambda make the scenarios concise and less error-prone. To exercise the CP dimension that this PR targets, please add at least one case with cpSize > 1 (e.g., contextCP=2/genCP=2 and a disagg case like contextCP=1/genCP=2) to ensure TargetRanksInfoForDP behavior and needSendCache remain correct when CP is involved. Right now all cases pin cpSize to 1, so CP handling is untested.

Would you like me to draft an additional verify block using cp=2 with the expected ranks/domains?


1448-1461: verifyContext lambda looks solid; minor improvements for type-safety and failure diagnostics

  • Type consistency: contextRank is an index into the parallel layout; elsewhere we use SizeType32. Using SizeType32 avoids accidental sign conversions.
  • Better diffs on failures: prefer EXPECT_THAT(..., ElementsAreArray(...)) for vectors over EXPECT_EQ to get more informative mismatch output.

Apply the following diff inside this lambda:

-    auto const verifyContext
-        = [&](int contextRank, ParallelConfig const& contextPC, ParallelConfig const& genPC,
+    auto const verifyContext
+        = [&](SizeType32 contextRank, ParallelConfig const& contextPC, ParallelConfig const& genPC,
               std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain, bool expectNeedSend)
     {
@@
-        auto const contextTargetInfo
+        auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);
 
-        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_THAT(contextTargetInfo.mIRanks, ::testing::ElementsAreArray(expectRanks));
         EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
         EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
         EXPECT_EQ(expectNeedSend, MLACacheFormatter::needSendCache(contextCache, genCache, contextRank));
     };

Also applies to: 1462-1469


1471-1499: Case A (ctx tp=4,pp=2 → gen tp=2,pp=2): expectations align with TP downscale and send-stride

The expected ranks and needSend pattern (true on TP ranks {0,2} and false on {1,3} per PP) correctly reflect the MLA needSend rule when selfTPNum > destTPNum. Nice coverage across all 8 ranks. Consider adding SCOPED_TRACE to make failures identify which contextRank failed faster.

-        verifyContext(
+        SCOPED_TRACE("ctxPC=tp4,pp2,cp1 → genPC=tp2,pp2,cp1");
+        verifyContext(
             /*contextRank*/ 0, /*contextPC*/ contextPC, /*genPC*/ genPC, /*expectRanks*/ {0}, /*expectPPDomain*/ 1,
             /*expectTPDomain*/ 1, /*expectNeedSend*/ true);

1567-1574: Typo: contextTragetInfo → contextTargetInfo

Unrelated to the new block above, but while you’re here: fix the misspelling to keep test names and logs consistent.

-        auto const contextTragetInfo
+        auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);
 
-        EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks);
-        EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize);
-        EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize);
+        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
+        EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
-        auto const contextTragetInfo
+        auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(contextCache, genCache, generationRank);
 
-        EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks);
-        EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize);
-        EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize);
+        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
+        EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);

Also applies to: 1674-1680

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a44e5b and 83643ae.

📒 Files selected for processing (1)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-135)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
  • needSendCache (75-95)
  • needSendCache (75-76)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
🔇 Additional comments (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)

1500-1515: Case B (ctx tp=2,pp=2 → gen tp=4,pp=2): correct fan-out to two TP ranks and always-send rule

Expectations {0,1}, {2,3}, {4,5}, {6,7} and domainTP=2 match TP upscaling semantics; always-send is consistent with selfTPNum <= destTPNum. LGTM.


1517-1529: Case C (ctx tp=2,pp=1 → gen tp=4,pp=2): rank ordering verifies TP-major then PP-major concatenation

The expected order {0,4,1,5} exercises the nested loop order used to build IRanks (iterate TP ranks inside PP range). Domain sizes (PP=2, TP=2) and needSend=true look correct. Good addition.

If CP affects rank-global indexing in disagg mode, consider duplicating this case with cp=2 to ensure ordering remains stable under CP.

@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from 83643ae to 31628d0 Compare August 25, 2025 18:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)

1441-1461: Hoist shared ModelConfig out of the lambda to avoid repeated allocations

sharedModelConfig is identical across calls; building it once improves clarity and avoids repeated vector construction in this tight test loop.

Proposed refactor:

@@
-    auto const verifyContext
-        = [&](int contextRank, tr::WorldConfig const& contextWC, tr::WorldConfig const& genWC,
-              std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain, bool expectNeedSend)
+    // Build once and reuse in all assertions
+    auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+        std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
+
+    auto const verifyContext = [&](int contextRank, tr::WorldConfig const& contextWC, tr::WorldConfig const& genWC,
+                                   std::vector<int> const& expectRanks, int expectPPDomain, int expectTPDomain,
+                                   bool expectNeedSend)
     {
         auto attentionType = isMLA ? texec::kv_cache::CacheState::AttentionType::kMLA
                                    : texec::kv_cache::CacheState::AttentionType::kDEFAULT;
-
-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
         auto const contextCache
             = texec::kv_cache::CacheState(sharedModelConfig, contextWC, dataType, attentionType, kvFactor);
         auto const genCache = texec::kv_cache::CacheState(sharedModelConfig, genWC, dataType, attentionType, kvFactor);

1463-1490: Add a CP>1 scenario to exercise disagg CP path

PR objective mentions accommodating CP; current blocks use cp=1. Add at least one cp=2 case to ensure TargetRanksInfoForDP and needSendCache remain consistent in the presence of context parallelism.

Example addition after the existing blocks:

+    {
+        // CP>1 should not alter TP/PP target mapping semantics.
+        tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 2, /*cpSize*/ 2};
+        tr::WorldConfig const genWC    {/*tpSize*/ 4, /*ppSize*/ 2, /*cpSize*/ 2};
+        verifyContext(/*contextRank*/ 0, contextWC, genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/ 1,
+                      /*expectTPDomain*/ 2, /*expectNeedSend*/ true);
+        verifyContext(/*contextRank*/ 1, contextWC, genWC, /*expectRanks*/ {2, 3}, /*expectPPDomain*/ 1,
+                      /*expectTPDomain*/ 2, /*expectNeedSend*/ true);
+        verifyContext(/*contextRank*/ 2, contextWC, genWC, /*expectRanks*/ {4, 5}, /*expectPPDomain*/ 1,
+                      /*expectTPDomain*/ 2, /*expectNeedSend*/ true);
+        verifyContext(/*contextRank*/ 3, contextWC, genWC, /*expectRanks*/ {6, 7}, /*expectPPDomain*/ 1,
+                      /*expectTPDomain*/ 2, /*expectNeedSend*/ true);
+    }

Also applies to: 1492-1507, 1509-1520

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 83643ae and 31628d0.

📒 Files selected for processing (1)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-135)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
  • needSendCache (75-95)
  • needSendCache (75-76)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
🔇 Additional comments (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

1463-1490: LGTM: TP downscale (4->2) mapping and send gating are correct

  • IRanks mapping matches TargetRanksInfoForDP (self TP > peer TP → 1 target per rank, proper TP folding by /2).
  • MLACacheFormatter::needSendCache agrees with expectation (send only when selfTpRank % (4/2) == 0).

1492-1507: LGTM: TP upscale (2->4) mapping and always-send behavior are correct

  • DomainTPSize=2 and per-rank IRanks lists are accurate (two targets per source rank).
  • With selfTP <= destTP, MLACacheFormatter always sends; assertions reflect this.

1509-1520: LGTM: PP domain split validated and IRanks ordering matches i-then-j enumeration

  • DomainPPSize=2 and DomainTPSize=2 are correct for PP 1->2 and TP 2->4.
  • IRanks order {0,4,1,5} follows the nested loop over TP (outer) then PP (inner) in TargetRanksInfoForDP.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

114-122: Clarify CP semantics for mIRanks; current implementation ignores CP and will conflict with new tests

The returned mIRanks are formed as PP×TP only (irank = j * peerTPNum + i). With CP differences (e.g., self CP=1, peer CP=2), the current code still returns a single rank. The new tests (see targetTest “CP grows from context to generation”) expect two ranks {0, 1}. More importantly, elsewhere we assert mIRanks.size() == mDomainPPSize * mDomainTPSize (splitKVCache/split path), which would be violated if CP were added to mIRanks without also changing those invariants and the indexing logic (e.g., inputAllRankNum, and fillBlockInfo’s pp/tp derivations).

Action needed: decide the contract.

  • If mIRanks should stay PP×TP only: update the tests to expect a single rank for CP growth (see suggested diff in the test file comment).
  • If mIRanks should enumerate CP as well: you must also update:
    • inputAllRankNum to include CP,
    • all rank{pp,tp,cp} decompositions in fillBlockInfo (compute ppRank = (rank / tp) / cp etc.),
    • any site assuming size == PP×TP.

Given the breadth of the second option, I recommend aligning tests to the current invariant for this PR and planning a follow-up that carries CP through end-to-end.

♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1448-1450: Fix invalid CTAD: std::vector(numLayers, numHeads) is ill-formed

The two-argument “count, value” constructor doesn’t have a deduction guide. Specify the element type explicitly.

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

60-66: Unused CP locals; mark them explicitly or remove until used

peerCPNum, selfCPNum, and selfCPRank are computed but never used. This will trigger -Wunused-variable on stricter builds and is misleading. If CP is not yet integrated into rank enumeration here, either remove them or mark them as intentionally unused.

Apply one of the following:

Option A — keep for future use but silence warnings:

-    auto const peerCPNum = peerParConfig.mContextParallelism;
-    auto const selfCPNum = selfParConfig.mContextParallelism;
+    [[maybe_unused]] auto const peerCPNum = peerParConfig.mContextParallelism;
+    [[maybe_unused]] auto const selfCPNum = selfParConfig.mContextParallelism;
@@
-    auto const selfTPRank = selfRank % selfTPNum;
-    auto const selfPPRank = selfRank / (selfTPNum * selfCPNum);
-    auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;
+    auto const selfTPRank = selfRank % selfTPNum;
+    auto const selfPPRank = selfRank / (selfTPNum * selfCPNum);
+    [[maybe_unused]] auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;

Option B — remove until CP is fully wired into the mapping:

-    auto const peerCPNum = peerParConfig.mContextParallelism;
-    auto const selfCPNum = selfParConfig.mContextParallelism;
@@
-    auto const selfTPRank = selfRank % selfTPNum;
-    auto const selfPPRank = selfRank / (selfTPNum * selfCPNum);
-    auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;
+    auto const selfTPRank = selfRank % selfTPNum;
+    auto const selfPPRank = selfRank / (selfTPNum /* * selfCPNum intentionally omitted if CP is unused here */);

Please confirm whether CP is intended to affect the returned target ranks in this function in this PR. If yes, see my other comment about semantics mismatch with tests and downstream invariants. If not, Option A will keep the build clean.


70-74: Also validate CP sizes (power-of-two) alongside PP

We check PP sizes are powers-of-two but not CP. For consistency and earlier failure, validate CP as well.

     for (auto val : {peerPPNum, selfPPNum})
     {
         TLLM_CHECK(isPowerOfTwo(val));
     }
+    for (auto val : {peerParConfig.mContextParallelism, selfParConfig.mContextParallelism})
+    {
+        TLLM_CHECK(isPowerOfTwo(val));
+    }

90-93: Guard DP grouping with divisibility checks

When DP is enabled we divide TP by DP size. Add explicit divisibility checks to fail fast on misconfiguration instead of truncating.

-    int const selfTPSizePerDPGroup = selfParConfig.mEnableAttentionDP ? selfTPNum / selfParConfig.mDPsize : selfTPNum;
-    int const peerTPSizePerDPGroup = peerParConfig.mEnableAttentionDP ? peerTPNum / peerParConfig.mDPsize : peerTPNum;
+    if (selfParConfig.mEnableAttentionDP)
+    {
+        TLLM_CHECK(selfParConfig.mDPsize > 0 && selfTPNum % selfParConfig.mDPsize == 0);
+    }
+    if (peerParConfig.mEnableAttentionDP)
+    {
+        TLLM_CHECK(peerParConfig.mDPsize > 0 && peerTPNum % peerParConfig.mDPsize == 0);
+    }
+    int const selfTPSizePerDPGroup = selfParConfig.mEnableAttentionDP ? selfTPNum / selfParConfig.mDPsize : selfTPNum;
+    int const peerTPSizePerDPGroup = peerParConfig.mEnableAttentionDP ? peerTPNum / peerParConfig.mDPsize : peerTPNum;
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1441-1461: Nice focused harness; minor suggestion: name captures to document expectations

The verifyContext lambda is a good consolidation. Consider renaming parameters to reflect semantics more clearly (expectedRanksPPxTP, expectedNeedSend) to avoid implying CP is enumerated here, unless you proceed with the broader CP contract change.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 31628d0 and f339a87.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-138)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
  • needSendCache (75-95)
  • needSendCache (75-76)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)

320-323: Bug: inputAllRankNum ignores CP causing OOB writes when CP > 1

inputAllRankNum must reflect the peer/input world size. With CP introduced, retRanks now ranges over PPTPCP, but the backing array is sized with PPTP only. Indexing blockInfos[oi * inputAllRankNum + iRank] will overflow for iRank ≥ PPTP.

Apply this diff:

-    int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism;
+    int const inputAllRankNum = iParallelConfig.mPipelineParallelism
+        * iParallelConfig.mTensorParallelism
+        * iParallelConfig.mContextParallelism;

328-338: Bug: ppRank computation ignores CP; incorrect for global rank with CP > 1

For a global rank laid out as (pp, cp, tp), ppRank must be computed dividing by (TP*CP). Current code divides by TP only.

Apply this diff inside fillBlockInfo:

-        int const tpRank = rank % parallelConfig.mTensorParallelism;
-        int const ppRank = rank / parallelConfig.mTensorParallelism;
+        int const tpRank = rank % parallelConfig.mTensorParallelism;
+        int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism);

Optionally capture cpRank if/when needed:

int const cpRank = (rank / parallelConfig.mTensorParallelism) % parallelConfig.mContextParallelism;

929-933: Assertion must include CP domain size

retRanks now has size mDomainPPSize * mDomainTPSize * mDomainCPSize. The current check will fail (or worse, silently mis-assume counts).

Apply this diff:

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size() == static_cast<size_t>(
+        targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));

Follow-up: Re-evaluate whether outputCacheNum should still be independent of CP (likely yes), but the assertion must reflect the new rank enumeration.


1259-1263: Same: assertion must include CP domain size

Mirror the fix in concatKVCache().

Apply this diff:

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size() == static_cast<size_t>(
+        targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
🧹 Nitpick comments (5)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (2)

35-43: Add brief Doxygen for new CP domain field and confirm public ABI intent

Good addition. Please document what units and invariants mDomainCPSize follows (e.g., factor between peer/self CP, must divide peerCPNum, power-of-two expectation), since this is now part of the public struct.

Apply this diff to add a brief Doxygen trailing member doc:

 struct TargetRanksInfo
 {
     int mDomainPPSize;
     int mDomainTPSize;
-    int mDomainCPSize;
+    int mDomainCPSize; //!< Domain size along Context Parallelism (CP). Usually peerCPNum / selfCPNum; must be >= 1.
     std::vector<int> mIRanks;
     int mDupHeadFactor;
     int mPeerDupHeadFactor;
 };

24-24: Missing include guard as per repo guidelines

Headers are required to use include guards with symbol TRTLLM_<FILE_NAME>_H. This file only has #pragma once. Consider adding guards (keep #pragma once if you like).

A guard matching the guideline would look like:

#ifndef TRTLLM_CACHE_SPLIT_CONCAT_H
#define TRTLLM_CACHE_SPLIT_CONCAT_H
// ... existing content ...
#endif // TRTLLM_CACHE_SPLIT_CONCAT_H
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

127-129: Replace std::cerr with repo logging

Direct stderr logging is noisy in library code and bypasses repo logging controls. Use TLLM_LOG_DEBUG instead.

Apply this diff:

-    std::cerr << "peerCPRankStart: " << peerCPRankStart << ", peerCPRankEnd: " << peerCPRankEnd << ", selfCPRank: "
-              << selfCPRank << ", mDomainCPSize: " << mDomainCPSize << std::endl;
+    TLLM_LOG_DEBUG("CP domain — peerCPRankStart: %d, peerCPRankEnd: %d, selfCPRank: %d, mDomainCPSize: %d",
+        peerCPRankStart, peerCPRankEnd, selfCPRank, mDomainCPSize);

963-971: Optional: clarify assumptions about CP and number of caches

The computation of outputCacheNum ignores CP, which may be correct if per-CP shards do not change cache multiplicity for split/concat. Please confirm this matches the intended data transfer topology; otherwise we may need to scale by mDomainCPSize or factor it into head duplication rules.

If CP should influence cache multiplicity, I can propose a concrete adjustment once we confirm the intended model.


146-155: Minor naming/style: avoid m-prefixed locals

Locals named like members (mDomainPPSize, mDomainTPSize, mDupHeadFactor, etc.) are confusing per guideline (m- prefix for members only). Not a blocker; consider renaming in a follow-up.

Example:

-    int mDomainPPSize = 1;
+    int domainPPSize = 1;

…and propagate accordingly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f339a87 and c709489.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards with symbol TRTLLM_<FILE_NAME>_H (no leading/trailing underscores; directory names excluded)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🔇 Additional comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

133-141: Rank enumeration includes CP — LGTM

The nesting order and irank computation are consistent with the decomposition used above.


156-156: Return value now includes CP domain size — LGTM

Initializer order matches TargetRanksInfo fields after adding mDomainCPSize.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1448-1452: Fix invalid CTAD: std::vector(numLayers, numHeads) doesn’t deduce element type (won’t compile)

The “count, value” constructor has no CTAD guide. Specify the element type explicitly.

Apply one of these diffs (first mirrors the earlier bot suggestion; second aligns with SizeType32):

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};

or

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<SizeType32>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1560-1565: Fill in the commented test or add a negative-case for unsupported CP shrink

You’ve stubbed “TP shrinks while CP grows” but left it commented. Either implement it (good to lock ordering across mixed TP/CP transitions) or add a death/throw test for the explicitly unsupported case selfCPNum > peerCPNum (CP shrink), per TargetRanksInfoForDP’s TLLM_THROW.

Example (CP shrink should throw):

-    // // TP shrinks while CP grows from context to generation.
-    // {
-    //     tr::WorldConfig const contextWC{/*tpSize*/ 4, /*ppSize*/ 1, /*cpSize*/ 1};
-    //     tr::WorldConfig const genWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 2};
-    // }
+    // CP shrinks from context to generation is not supported; verify it throws.
+    {
+        tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 2};
+        tr::WorldConfig const genWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const contextCache
+            = texec::kv_cache::CacheState(sharedModelConfig, contextWC, dataType, attentionType, kvFactor);
+        auto const genCache
+            = texec::kv_cache::CacheState(sharedModelConfig, genWC, dataType, attentionType, kvFactor);
+        EXPECT_THROW(
+            { (void) tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, /*contextRank*/ 0); },
+            std::runtime_error);
+    }

If TLLM_THROW maps to a different exception type in your build, replace std::runtime_error accordingly. Want me to push a version that also covers the mixed “TP shrinks while CP grows” positive case?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c709489 and bd978af.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-161)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
🔇 Additional comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1524-1582: No remaining PP×TP-only assumptions—CP is correctly included

I ran a project-wide audit for any size checks or count initializations that multiply only mDomainPPSize × mDomainTPSize (with and without whitespace, and reversed operands) and found no instances. All occurrences of mDomainCPSize are already declared and used alongside mDomainPPSize and mDomainTPSize (in the header and in the existing tests).

It looks like the downstream invariants and callers have already been updated to include the CP dimension. There are no outstanding PP×TP-only assertions or loops to refactor. We can consider this covered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

322-324: Bug: CP-aware ranks can index past blockInfos buffer; fix inputAllRankNum and PP-rank decomposition

inputRanks now comes from targetIRanks with CP included, so iRank ∈ [0, PPTPCP). However, inputAllRankNum is still PP*TP, and fillBlockInfo computes ppRank = rank / TP, both ignoring CP. This can cause out-of-bounds writes into blockInfos and miscomputed ppRank when CP > 1.

Apply this patch:

@@
-    int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism;
+    // Include CP to cover the full world of ranks
+    int const inputAllRankNum = iParallelConfig.mPipelineParallelism
+        * iParallelConfig.mTensorParallelism
+        * iParallelConfig.mContextParallelism;
@@
-        int const tpRank = rank % parallelConfig.mTensorParallelism;
-        int const ppRank = rank / parallelConfig.mTensorParallelism;
+        int const tpRank = rank % parallelConfig.mTensorParallelism;
+        // Decompose rank with CP in the denominator: rank = (pp * TP * CP) + (cp * TP) + tp
+        int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism);
         int const ppNum = parallelConfig.mPipelineParallelism;
@@
-            int iRank = inputRanks[i];
-            blockInfos[oi * inputAllRankNum + iRank]
+            int iRank = inputRanks[i];
+            TLLM_CHECK_WITH_INFO(0 <= iRank && iRank < inputAllRankNum,
+                "iRank (%d) out of range for PP*TP*CP world (%d).", iRank, inputAllRankNum);
+            blockInfos[oi * inputAllRankNum + iRank]
                 = fillBlockInfo(iCacheState, inputBlocks[oi * iRankNum + i], iRank);

This aligns the buffer sizing and rank decomposition with the new CP-aware rank space and prevents OOB writes when CP > 1.

Also applies to: 330-341, 349-355


931-934: Fix assertion: mIRanks now includes CP; update expected size to PPTPCP

The check still uses PP*TP and will fail when CP > 1. mIRanks is generated across PP×CP×TP.

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size()
+        == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));

1261-1263: Same fix as above: include CP in mIRanks size check

Mirror the PPTPCP expectation here too to avoid false negatives under CP.

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(targetRankInfo.mIRanks.size()
+        == static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize * targetRankInfo.mDomainCPSize));
♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

60-66: Guard CP invariants before using them in division; avoid potential div-by-zero and mis-decomposition

You compute selfPPRank using (selfTPNum * selfCPNum) before validating CP values. Move/add the CP checks before Lines 63–65 so we never divide by zero or decompose with invalid CP.

Apply this minimal reordering/safety patch:

@@
-    auto const peerCPNum = peerParConfig.mContextParallelism;
-    auto const selfCPNum = selfParConfig.mContextParallelism;
+    auto const peerCPNum = peerParConfig.mContextParallelism;
+    auto const selfCPNum = selfParConfig.mContextParallelism;
+
+    // Validate CP sizes before using them in denominators/decomposition
+    for (auto val : {peerCPNum, selfCPNum})
+    {
+        TLLM_CHECK(isPowerOfTwo(val));
+    }
@@
-    auto const selfTPRank = selfRank % selfTPNum;
-    auto const selfPPRank = selfRank / (selfTPNum * selfCPNum);
-    auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;
+    auto const selfTPRank = selfRank % selfTPNum;
+    auto const selfPPRank = selfRank / (selfTPNum * selfCPNum);
+    auto const selfCPRank = (selfRank % (selfTPNum * selfCPNum)) / selfTPNum;
+
+    // Optional: ensure selfRank falls into world bounds (PP*TP*CP). Helps catch mis-sized worlds early.
+    TLLM_CHECK_WITH_INFO(selfPPRank < selfPPNum, "selfRank out of range for PP*TP*CP world.");

And remove the duplicated CP power-of-two loop at Lines 117–120 (see follow-up patch below).


114-131: CP domain mapping: enforce exact divisibility and clarify unsupported downscale path

When selfCPNum <= peerCPNum you do integer division without verifying exact divisibility; when selfCPNum > peerCPNum you throw. Add a modulus check to prevent silent truncation; keep the explicit throw (or implement symmetric mapping if desired).

@@
-    for (auto val : {peerCPNum, selfCPNum})
-    {
-        TLLM_CHECK(isPowerOfTwo(val));
-    }
+    // (Moved CP validation earlier; see Lines 60–66 patch)
@@
-    if (selfCPNum <= peerCPNum)
+    if (selfCPNum <= peerCPNum)
     {
-        mDomainCPSize = peerCPNum / selfCPNum;
+        TLLM_CHECK_WITH_INFO(peerCPNum % selfCPNum == 0,
+            "peerCPNum (%d) must be divisible by selfCPNum (%d).", peerCPNum, selfCPNum);
+        mDomainCPSize = peerCPNum / selfCPNum;
         peerCPRankStart = selfCPRank * mDomainCPSize;
         peerCPRankEnd = (selfCPRank + 1) * mDomainCPSize;
     }
     else
     {
-        TLLM_THROW("selfCPNum > peerCPNum is not supported.");
+        TLLM_THROW("CP downscale (selfCPNum > peerCPNum) is not supported.");
     }

If CP downscale should mirror PP/TP behavior, I can provide a symmetric else-branch.

🧹 Nitpick comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

49-50: Update dimension comment to reflect CP is now part of the mapping

The comment still says [PP, TP], but CP is now included in the rank space.

-// [PP, TP]
+// [PP, CP, TP]

1008-1016: Verify DomainCP impact on split/concat shapes or document why CP does not change kernels

The kernels and host launch math only use DomainPPSize and DomainTPSize (no DomainCPSize). If CP does not alter per-rank head/layer tiling (tokens-only split), please add a brief comment here clarifying why DomainCPSize does not factor into grid/block or per-block shapes. Otherwise, wire DomainCPSize into the relevant calculations.

If helpful, I can draft comments or a small guard like:

  • Assert that tokens-per-block already reflect CP partitioning, or
  • Assert that CP is handled exclusively by rank routing (mIRanks) and not by kernel shapes.

Also applies to: 1043-1049, 1356-1362, 1364-1369


371-381: Minor: use const for oPPNum/iPPNum; consider precomputing gridDimx for clarity

Non-blocking; current code is fine.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6d18b93 and c75b046.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

135-141: Rank composition with CP looks correct

irank = (k * peerTPNum * peerCPNum) + (j * peerTPNum) + i matches the decomposition rank = (pp * TP * CP) + (cp * TP) + tp. Loop order [TP -> CP -> PP] is consistent with that mapping.


311-317: Consistency check on inputRanks routing

Good to assert inputRanks equals targetIRanks(...). With CP included, this is even more important to catch mismatches earlier.


158-159: Aggregate-init check complete: no mismatched sites

  • Used ripgrep across the cpp/ tree to find any brace‐initializer, assignment‐style or functional‐cast constructions of TargetRanksInfo.
  • Only occurrence is the return in cacheSplitConcat.cu (lines 158–159), which now correctly includes mDomainCPSize.
  • No other aggregate‐init sites or direct constructions of TargetRanksInfo were found.

All initializer sites are updated; no further action required.

@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from 0f33380 to e04c5f5 Compare August 26, 2025 01:10
@brb-nv brb-nv changed the title Update TargetInfo to accommodate CP in disagg [None][feat] Update TargetInfo to accommodate CP in disagg Aug 26, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1448-1453: Fix invalid CTAD: std::vector(numLayers, numHeads) won't compile

The test contains invalid CTAD syntax. std::vector(numLayers, numHeads) lacks an explicit element type and won't compile on conforming compilers. Specify the element type to use the "count, value" constructor properly.

Apply this diff to fix the compilation error:

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<int>(numLayers, numHeads), sizePerHead, tokensPerBlock};
🧹 Nitpick comments (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

1679-1699: Complex multi-dimensional scaling test - consider adding assertions for intermediate computations

While the test correctly verifies the final target ranks when TP and CP grow while PP shrinks, the complexity of this scenario might benefit from additional intermediate assertions to aid debugging.

Consider adding debug assertions or comments explaining the rank mapping formula for this complex case, especially how the PP shrink affects the rank distribution:

     // TP, CP grow while PP shrinks from context to generation.
     {
         tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 2, /*cpSize*/ 1};
         tr::WorldConfig const genWC{/*tpSize*/ 4, /*ppSize*/ 1, /*cpSize*/ 2};
+        // Context rank mapping: rank = ppRank * (tpSize * cpSize) + cpRank * tpSize + tpRank
+        // Generation expects: TP 2→4 (2x growth), PP 2→1 (collapse), CP 1→2 (2x growth)
         verifyContext(
             /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5},
             /*expectPPDomain*/ 1,
             /*expectTPDomain*/ 2, /*expectCPDomain*/ 2, /*expectNeedSend*/ true);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c75b046 and e04c5f5.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-159)
  • TargetRanksInfoForDP (50-51)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
  • needSendCache (56-85)
  • needSendCache (56-57)
🔇 Additional comments (5)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)

1441-1444: Update verifyContext signature - excellent CP domain addition!

The lambda now properly accepts expectCPDomain as a parameter and validates it against mDomainCPSize, which correctly extends the test coverage to CP domain configurations. This aligns well with the PR's objective of accommodating CP in disaggregated scenarios.


1539-1555: Well-structured CP growth test block

Excellent test coverage for CP growth from context to generation! The test correctly verifies that when CP increases from 1 to 2, the target ranks expand to include the additional CP dimension (e.g., {0, 2} for rank 0), and all ranks correctly report expectNeedSend as true.


1557-1577: Comprehensive test for combined TP and CP growth

Good coverage of the complex case where both TP and CP grow simultaneously. The expected ranks correctly reflect the cross-product of TP×CP domain expansion (e.g., {0, 4, 1, 5} represents TP growth from 2→4 combined with CP growth from 1→2).


1579-1595: TP shrink with CP growth test correctly implemented

The test properly handles the case where TP shrinks (4→2) while CP grows (1→2), correctly identifying which ranks need to send cache based on the shrinking TP dimension.


1449-1453: Config-driven CacheState construction aligns with PR objectives

The migration to using CacheState::ModelConfig and WorldConfig objects for constructing both context and generation caches is a clean design that supports the dynamic TP/PP/CP transitions mentioned in the PR summary. This config-driven approach provides better encapsulation and makes the tests more maintainable.

@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from e04c5f5 to 0eb2596 Compare August 26, 2025 01:44
@brb-nv brb-nv requested a review from MatthiasKohl August 26, 2025 01:46
@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16495 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

1524-1537: Avoid commented-out test code; track the unresolved case explicitly

This block is commented out with a TODO, which goes against the repo guideline to avoid dead/commented code. Prefer one of:

  • Extract this scenario into a separate disabled test (use GTest’s DISABLED_ prefix) with a reference to a tracking issue.
  • Or remove the code and leave a concise TODO with an issue link.

Proposed minimal change (remove the commented block; leave a TODO):

-    // // PP grows while TP shrinks from context to generation.
-    // {
-    //     tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1};
-    //     tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 1};
-    //     verifyContext(
-    //         /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/
-    //         2,
-    //         /*expectTPDomain*/ 1, /*expectCPDomain*/ 1, /*expectNeedSend*/ true);
-    //     // TODO: Figure why needSendCache is false here.
-    //     verifyContext(
-    //         /*contextRank*/ 1, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 1}, /*expectPPDomain*/
-    //         2,
-    //         /*expectTPDomain*/ 1, /*expectCPDomain*/ 1, /*expectNeedSend*/ false);
-    // }
+    // TODO: Add dedicated test for “PP grows while TP shrinks” once needSendCache semantics are finalized. Track in issue <ID>.

If you want, I can draft a DISABLED_ test that reuses the same verifyContext helper.


1701-1715: Second commented-out scenario: convert to a disabled test or remove

Same concern as Lines 1524–1537: please avoid commented-out code; either move into a DISABLED_ test and reference a tracking issue or remove and leave a TODO.

-    // // PP, CP grow while TP shrinks from context to generation.
-    // {
-    //     tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1};
-    //     tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 4};
-    //     verifyContext(
-    //         /*contextRank*/ 0, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5, 2, 6, 3, 7},
-    //         /*expectPPDomain*/ 2,
-    //         /*expectTPDomain*/ 1, /*expectCPDomain*/ 4, /*expectNeedSend*/ true);
-    //     // TODO: Figure why needSendCache is false here.
-    //     verifyContext(
-    //         /*contextRank*/ 1, /*contextWC*/ contextWC, /*genWC*/ genWC, /*expectRanks*/ {0, 4, 1, 5, 2, 6, 3, 7},
-    //         /*expectPPDomain*/ 2,
-    //         /*expectTPDomain*/ 1, /*expectCPDomain*/ 4, /*expectNeedSend*/ false);
-    // }
+    // TODO: Add dedicated test for “PP, CP grow while TP shrinks” once needSendCache semantics are finalized. Track in issue <ID>.

1753-1759: Typo: rename contextTragetInfo → contextTargetInfo

Minor readability/consistency fix.

-        auto const contextTragetInfo
+        auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);
 
-        EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks);
-        EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize);
-        EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize);
+        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
+        EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);

1844-1866: Typo: rename verfiyGeneration → verifyGeneration

Minor naming correction to match spelling of “verify”.

-    auto const verfiyGeneration = [&](int contextRank, int generationRank, std::vector<int> const& expectRanks,
+    auto const verifyGeneration = [&](int contextRank, int generationRank, std::vector<int> const& expectRanks,
                                       int expectPPDomain, int expectTPDomain)
@@
-    verfiyGeneration(
+    verifyGeneration(
         /*contextRank*/ 0, /*generationRank*/ 0, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1);
-    verfiyGeneration(
+    verifyGeneration(
         /*contextRank*/ 0, /*generationRank*/ 1, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1);
@@
-    verfiyGeneration(
+    verifyGeneration(
         /*contextRank*/ 0, /*generationRank*/ 0, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1);
-    verfiyGeneration(
+    verifyGeneration(
         /*contextRank*/ 0, /*generationRank*/ 1, /*expectRanks*/ {0}, /*expectPPDomain*/ 1, /*expectTPDomain*/ 1);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e04c5f5 and 0eb2596.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-159)
  • TargetRanksInfoForDP (50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (10)

1464-1492: Scenario “TP shrinks”: expectations and needSend toggling look correct

Mapping and domain sizes match TargetRanksInfoForDP’s TP-downscale math; alternating needSend is consistent with duplicate coverage within the same peer rank. No issues.


1494-1510: Scenario “TP grows”: correct rank ordering and domain sizing

Expected ranks reflect i-major ordering (TP outermost, then CP, then PP) in TargetRanksInfoForDP. Looks good.


1512-1522: Scenario “TP and PP grow”: correct 2×2 expansion, ordering {k over PP inside i over TP}

The expected order {0,4,1,5} matches the nested loops (i TP outer, k PP inner).


1539-1555: Scenario “CP grows”: ranks and DomainCPSize assertions align with CP fan-out

The expected ranks {0,2}, {1,3}, etc., correctly reflect irank = kTPCP + j*TP + i with CP=2.


1557-1577: Scenario “TP and CP grow”: correct cross-product expansion and ordering

The sequence {0,4,1,5} matches the nested loop order (i over TP outer, j over CP inner).


1579-1595: Scenario “TP shrinks while CP grows”: assertions consistent with mixed shrink/grow

Alternating needSend for pairs mapping to the same peer TP rank looks intentional and consistent.


1597-1617: Scenario “PP and CP grow”: two-level expansion validated

Expected ranks {0,4,2,6}, etc., reflect k over PP as the innermost loop and j over CP in the middle.


1619-1647: Scenario “PP shrinks while CP grows”: correct downscale of PP with CP fan-out

Rank expectations and domain sizes look correct for selfPP > peerPP and selfCP < peerCP.


1649-1677: Scenario “TP and PP shrink while CP grows”: coverage and assertions look good

Correct mapping and needSend alternation across the pairs that collapse onto the same peer TP rank.


1679-1699: Scenario “TP, CP grow while PP shrinks”: correct mixed transitions

Ranks and domain assertions are aligned with the TargetRanksInfoForDP nested-loop construction.

@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from 0eb2596 to 6c86b28 Compare August 26, 2025 02:02
@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 26, 2025

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

1457-1461: Add a sanity check that expectRanks size matches PP×TP×CP domain product

This catches mismatches early when editing expectations or changing rank ordering.

Apply:

         auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);

         EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_EQ(expectRanks.size(),
+            static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain));
         EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
         EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
         EXPECT_EQ(expectCPDomain, contextTargetInfo.mDomainCPSize);

1524-1537: Avoid commented-out test code; either disable as a named test or track with an issue

Leaving large commented blocks hinders maintainability. Prefer a disabled test (GTest naming DISABLED_...) with a clear TODO/issue link, or remove and file an issue. This also aligns with the guideline to avoid dead code.

If you want, I can draft a small TEST(targetTest, DISABLED_PPGrowsTPShrinks_needSendMismatch) encapsulating this case.


1701-1714: Same here: convert to a disabled test instead of commenting out

Carry the rationale (“needSendCache is false here”) into the disabled test name or add a TODO with an issue id.


1448-1453: Fix CTAD type deduction: std::vector(numLayers, numHeads) deduces std::vector

CacheState::ModelConfig::mNbKvHeadsPerLayer expects std::vector<SizeType32>. Using std::vector(numLayers, numHeads) deduces std::vector<int> and risks narrowing or compilation errors on platforms where SizeType32 != int. Also, cast the scalars to the expected type for consistency.

Apply:

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)),
+            static_cast<SizeType32>(sizePerHead),
+            static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

135-141: Rank enumeration order updated (TP outer × CP middle × PP inner) — verify consumers expect this order

The nested loop order yields sequences like {0, 4, 1, 5} for TP growth with PP=2, which matches the updated tests. Ensure any code that previously assumed PP-major iteration is updated accordingly.


50-52: Update comment to reflect the CP dimension

The function-level comment still says “[PP, TP]”. With CP added, please update it to avoid confusion, e.g., “[PP, TP, CP]” and document the irank formula.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0eb2596 and 6c86b28.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-159)
  • TargetRanksInfoForDP (50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1441-1444: Good refactor: verifyContext signature now takes WorldConfig and packs asserts together

This improves readability and mirrors the production construction path. LGTM.

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3)

63-66: CP rank decomposition looks consistent with the irank layout

Using:

  • selfTPRank = selfRank % selfTPNum
  • selfPPRank = selfRank / (selfTPNum × selfCPNum)
  • selfCPRank = (selfRank % (selfTPNum × selfCPNum)) / selfTPNum

matches irank = (pp × TP × CP) + (cp × TP) + tp. LGTM.

Please confirm no callers still assume the old 2D [PP, TP] layout when iterating target ranks.


158-158: Audit CP dimension in split/concat helpers

Please verify that downstream helpers and kernels in cacheSplitConcat.cu account for the newly added CP domain:

  • In splitKVCache and concatKVCache (or similarly named wrappers), update any size‐check on mIRanks.size() so it asserts against
    mDomainPPSize * mDomainTPSize * mDomainCPSize
    (instead of just PP * TP).
  • In the CUDA kernels (loop bounds, grid/block calculations, indexing math), ensure the third dimension (CP) is folded into any two-dimensional assumptions.
  • Add or extend unit tests to cover non-trivial CP splits—i.e., cases where mDomainCPSize > 1.

I didn’t find any existing mIRanks.size() == mDomainPPSize * mDomainTPSize checks in the repo, so please audit those functions and kernels and add the CP factor where appropriate. Let me know if you’d like assistance updating any of these sections!


114-131: Symmetric CP downscale mapping missing (and divisibility already enforced by power-of-two checks)

The CP branch currently only handles upscaling and throws on downscale, unlike the TP/PP section which implements symmetric mapping in both directions. Also, since both peerCPNum and selfCPNum are asserted to be powers of two, peerCPNum/selfCPNum is always an integer—no additional divisibility check is needed.

Please choose one of the following:

  • Implement symmetric downscale (mirroring the TP/PP logic):
         if (selfCPNum <= peerCPNum)
         {
             // up-scale (ratio of two powers-of-two is always integral)
             mDomainCPSize     = peerCPNum / selfCPNum;
             peerCPRankStart   = selfCPRank * mDomainCPSize;
             peerCPRankEnd     = (selfCPRank + 1) * mDomainCPSize;
         }
         else
         {
  •      TLLM_THROW("selfCPNum > peerCPNum is not supported.");
    
  •      // down-scale: each self rank maps to one peer rank
    
  •      // selfCPNum and peerCPNum are powers of two, so ratio is integral
    
  •      mDomainCPSize     = 1;
    
  •      peerCPRankStart   = selfCPRank / (selfCPNum / peerCPNum);
    
  •      peerCPRankEnd     = peerCPRankStart + mDomainCPSize;
     }
    
  • Keep downscale unsupported: replace the else with a clear exception and a TODO, e.g.:
         else
         {
  •      TLLM_THROW("selfCPNum > peerCPNum is not supported.");
    
  •      TLLM_THROW("Downscale (selfCPNum > peerCPNum) is not supported yet; see ISSUE-1234 TODO");
     }
    
    
    

Either way, document the behavior and add a test to cover the downscale case.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16498 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16495 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16498 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12392 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)

323-326: Include CP in inputAllRankNum (correct indexing when CP>1 on input side)

If input ranks include CP (decode path), current sizing under-allocates and risks OOB indexing.

-    int const inputAllRankNum = iParallelConfig.mPipelineParallelism * iParallelConfig.mTensorParallelism;
+    int const inputAllRankNum = iParallelConfig.mPipelineParallelism
+        * iParallelConfig.mTensorParallelism * iParallelConfig.mContextParallelism;

331-346: Derive ppRank with TP×CP stride

Rank decomposition here ignores CP; when CP>1, PP rank is miscomputed.

-        int const tpRank = rank % parallelConfig.mTensorParallelism;
-        int const ppRank = rank / parallelConfig.mTensorParallelism;
+        int const tpRank = rank % parallelConfig.mTensorParallelism;
+        int const ppRank = rank / (parallelConfig.mTensorParallelism * parallelConfig.mContextParallelism);

932-934: Account for CP in rank-count sanity check

With CP added to rank generation, this invariant should include it.

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(
+        targetRankInfo.mIRanks.size()
+        == (static_cast<size_t>(targetRankInfo.mDomainPPSize)
+            * static_cast<size_t>(targetRankInfo.mDomainTPSize)
+            * static_cast<size_t>(targetRankInfo.mDomainCPSize)));

1261-1263: Same: include CP in rank-count check (concat path)

Mirror the fix for the split path.

-    TLLM_CHECK(targetRankInfo.mIRanks.size()
-        == (static_cast<size_t>(targetRankInfo.mDomainPPSize * targetRankInfo.mDomainTPSize)));
+    TLLM_CHECK(
+        targetRankInfo.mIRanks.size()
+        == (static_cast<size_t>(targetRankInfo.mDomainPPSize)
+            * static_cast<size_t>(targetRankInfo.mDomainTPSize)
+            * static_cast<size_t>(targetRankInfo.mDomainCPSize)));
♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

114-131: Add explicit CP divisibility checks (guard silent truncation)

Even with power-of-two validation, adding explicit divisibility checks improves diagnostics and future-proofs against relaxing the PoT invariant.

Apply:

     for (auto val : {peerCPNum, selfCPNum})
     {
         TLLM_CHECK(isPowerOfTwo(val));
     }
     if (selfCPNum <= peerCPNum)
     {
-        mDomainCPSize = peerCPNum / selfCPNum;
+        TLLM_CHECK_WITH_INFO(peerCPNum % selfCPNum == 0, "peerCPNum must be divisible by selfCPNum.");
+        mDomainCPSize = peerCPNum / selfCPNum;
         peerCPRankStart = selfCPRank * mDomainCPSize;
         peerCPRankEnd = (selfCPRank + 1) * mDomainCPSize;
     }
     else
     {
-        peerCPRankStart = selfCPRank / (selfCPNum / peerCPNum);
+        TLLM_CHECK_WITH_INFO(selfCPNum % peerCPNum == 0, "selfCPNum must be divisible by peerCPNum.");
+        peerCPRankStart = selfCPRank / (selfCPNum / peerCPNum);
         peerCPRankEnd = peerCPRankStart + mDomainCPSize;
     }
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1448-1453: Fix ModelConfig element types; avoid wrong std::vector deduction

std::vector(numLayers, numHeads) deduces std::vector<int> and may not match SizeType32. Also ensure all fields use the expected element type.

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)),
+            static_cast<SizeType32>(sizePerHead),
+            static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)

133-141: Reserve rank vector capacity

Avoids reallocations in the triple loop.

-    std::vector<int> retRanks;
+    std::vector<int> retRanks;
+    retRanks.reserve(static_cast<size_t>(mDomainPPSize) * mDomainTPSize * mDomainCPSize);
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

1457-1461: Assert rank list size matches PP×TP×CP product

Strengthens the test to catch mapping/ordering regressions.

         EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
         EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
         EXPECT_EQ(expectCPDomain, contextTargetInfo.mDomainCPSize);
+        EXPECT_EQ(expectRanks.size(),
+            static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain));

1751-1757: Typo: contextTragetInfo → contextTargetInfo

Minor rename improves readability.

-        auto const contextTragetInfo
+        auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);
-
-        EXPECT_EQ(expectRanks, contextTragetInfo.mIRanks);
-        EXPECT_EQ(expectPPDomain, contextTragetInfo.mDomainPPSize);
-        EXPECT_EQ(expectTPDomain, contextTragetInfo.mDomainTPSize);
+        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        EXPECT_EQ(expectPPDomain, contextTargetInfo.mDomainPPSize);
+        EXPECT_EQ(expectTPDomain, contextTargetInfo.mDomainTPSize);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6c86b28 and 39a8f4c.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)

60-66: CP rank decomposition looks correct

selfTPRank, selfPPRank = selfRank / (selfTPNum * selfCPNum), and selfCPRank are consistent with the new TP×CP×PP layout.


159-160: Manual Verification Recommended — Confirm all TargetRanksInfo usages updated to include mDomainCPSize
Automated searches did not find any direct TargetRanksInfo{…} initializer-list constructions or structured bindings in the codebase; all existing references appear to use member‐access (e.g., mDomainCPSize) or are returned by helper functions. However, absence of evidence is not proof of absence.

• Search for any remaining brace‐initializer calls, e.g.:

TargetRanksInfo{…}

and ensure they pass six arguments in the new order.
• Audit for structured bindings or std::tie patterns, for example:

auto [pp, tp, cp, ranks, dupHead, peerDup] = targetIRanks(…);

and update them to include the cp element.
• Verify any other call sites (outside tests) of targetIRanks or TargetRanksInfoForDP are handled correctly.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1680-1712: Optional CP Downscale Test Case

Please add a scenario where the context’s CP count is greater than the generation’s CP count (e.g. contextCP=2, genCP=1) to exercise the selfCPNum > peerCPNum branch in TargetRanksInfoForDP.

• Location: cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp, within the “PP, CP grow while TP shrinks” block (around lines 1680–1712)
• Suggested diff:

@@ // PP, CP grow while TP shrinks from context to generation.
     {
         tr::WorldConfig const contextWC{/*tpSize*/ 2, /*ppSize*/ 1, /*cpSize*/ 1};
         tr::WorldConfig const genWC{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 4};
+        // CP downscale case: contextCP > genCP
+        tr::WorldConfig const contextWC2{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 2};
+        tr::WorldConfig const genWC2{/*tpSize*/ 1, /*ppSize*/ 2, /*cpSize*/ 1};
+        verifyContext(
+            /*contextRank*/ 0, /*contextWC*/ contextWC2, /*genWC*/ genWC2,
+            /*expectRanks*/ { /* … appropriate expected ranks … */ },
+            /*expectPPDomain*/ 2, /*expectTPDomain*/ 1, /*expectCPDomain*/ 1,
+            /*expectNeedSend*/ true);
     }

After adding the new case, rebuild the test binary and run:

# Build your tests (e.g., via CMake/Ninja)
# Then:
./cpp/build/tests/batch_manager/cacheTransceiverTest --gtest_filter="targetTest.CacheStateNODP"

Verify that all tests, including the new downscale case, pass successfully.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16750 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12572 completed with status: 'FAILURE'

@brb-nv brb-nv requested a review from chuangz0 August 28, 2025 05:45
@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from 39a8f4c to 6d8f43e Compare August 28, 2025 06:30
@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 28, 2025

/bot run

@brb-nv brb-nv enabled auto-merge (squash) August 28, 2025 06:30
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16808 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1441-1463: Fix ModelConfig construction; add domain-size sanity check

  • std::vector(numLayers, numHeads) deduces std::vector. CacheState::ModelConfig expects std::vector. Use explicit element type and casts to avoid narrowing and potential compile break.
  • Add a sanity check that expectRanks.size() equals PP×TP×CP domain product computed by TargetRanksInfoForDP.

Apply this diff:

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)),
+            static_cast<SizeType32>(sizePerHead),
+            static_cast<SizeType32>(tokensPerBlock)};
@@
-        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
+        // Sanity: mIRanks count must equal the product of domain sizes.
+        EXPECT_EQ(expectRanks.size(),
+            static_cast<size_t>(contextTargetInfo.mDomainPPSize)
+                * static_cast<size_t>(contextTargetInfo.mDomainTPSize)
+                * static_cast<size_t>(contextTargetInfo.mDomainCPSize));
+        EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);

Run to catch any other places with accidental CTAD on std::vector for ModelConfig:

#!/bin/bash
rg -nP --type=cpp 'CacheState::ModelConfig\\s*\\{\\s*std::vector\\s*\\('
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1700-1712: Add CP-shrink coverage (selfCP > peerCP) to exercise the else-branch

Currently only CP growth (1→2, 1→4) is tested. Add scenarios where CP shrinks (e.g., context CP=4 → generation CP=1) to validate peerCPRankStart computation and rank ordering. Also consider a mixed case with TP grow, PP grow, CP shrink to fully cover the mapping matrix.

I can append representative CP-shrink blocks mirroring the existing patterns if you want.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 39a8f4c and 6d8f43e.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (11)

1464-1492: TP shrink mapping and MLA send gating look correct

Expected ranks and needSend flags match the DP-group modulo rule when TP shrinks (4→2) with CP=1.


1494-1510: TP grow case is consistent with TP-major, CP-next, PP-slowest ordering

Rank ordering {0,4,1,5}… matches loops (TP inner-most push with PP stride).


1512-1522: TP and PP grow: cross-product ordering matches implementation

Combined domains (PP×TP) produce the expected interleaving.


1524-1536: PP grows while TP shrinks: expectations align with domain computation

Single TP choice with two PP targets and correct needSend asymmetry.


1538-1554: CP grow only: ranks step by TP across CP slices as expected

mDomainCPSize=2 reflected correctly in target ranks.


1556-1576: TP and CP grow: ordering and cardinality are correct

Rank vectors reflect TP-fastest, CP-next for each PP.


1578-1594: TP shrinks, CP grows: send gating and targets are correct

Non-sending ranks (odd TP ranks) are captured for MLA.


1596-1616: PP and CP grow with constant TP: mapping is consistent

Two PP and two CP destinations with correct ordering.


1618-1646: PP shrinks, CP grows: expectations are coherent

Domain sizes and rank lists match the mapping scheme.


1648-1676: TP and PP shrink, CP grows: mixed shrink/grow case looks good

NeedSend false on “duplicate” TP senders is correct.


1678-1698: TP, CP grow while PP shrinks: ordering verified

Interleaving of TP within each PP destination is correct.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16808 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12618 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 28, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16869 [ run ] triggered by Bot

@brb-nv brb-nv force-pushed the user/brb/target-ranks-with-cp branch from 6d8f43e to e7865b3 Compare August 28, 2025 17:21
@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 28, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16893 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16869 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1441-1452: Fix ModelConfig vector construction to avoid type mismatch.

The std::vector(numLayers, numHeads) deduces to std::vector<int>, but ModelConfig expects std::vector<SizeType32>. This can cause compilation errors or narrowing conversion warnings.

Apply this diff to fix the type issue:

-        auto const sharedModelConfig
-            = texec::kv_cache::CacheState::ModelConfig{std::vector(numLayers, numHeads), sizePerHead, tokensPerBlock};
+        auto const sharedModelConfig = texec::kv_cache::CacheState::ModelConfig{
+            std::vector<SizeType32>(numLayers, static_cast<SizeType32>(numHeads)),
+            static_cast<SizeType32>(sizePerHead),
+            static_cast<SizeType32>(tokensPerBlock)};
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

1457-1460: Consider adding domain-size sanity check.

While the CP domain testing is comprehensive, adding a sanity check that the size of expectRanks matches the product of domain sizes would catch configuration mismatches early.

Add this sanity check after computing contextTargetInfo:

         auto const contextTargetInfo
             = tensorrt_llm::executor::kv_cache::TargetRanksInfoForDP(genCache, contextCache, contextRank);
 
+        // Sanity check: expectRanks size should match domain product
+        EXPECT_EQ(expectRanks.size(),
+            static_cast<size_t>(expectPPDomain) * static_cast<size_t>(expectTPDomain) * static_cast<size_t>(expectCPDomain))
+            << "Expected ranks size mismatch for domain product";
         EXPECT_EQ(expectRanks, contextTargetInfo.mIRanks);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6d8f43e and e7865b3.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (3 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh}: Close namespaces with a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Only use non-trivial literals for initialization; use named constexprs instead of inline magic numbers (except 0, nullptr, true, false)
Use Allman brace style
Filename style for C++: camelCase starting lowercase (e.g., thisIsAFilename.cpp); must be case-insensitive unique within a target
Type names use CamelCase with uppercase first letter
Local variables, methods, and namespaces use lowerCamelCase
Non-static global (non-magic) variables use gPrefix (e.g., gCounter)
Static or anonymous-namespace globals use sPrefix (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use mPrefix (e.g., mNbFooValues); public members may omit but prefer m when clarity benefits
Constants: enums and constant values use UPPER_SNAKE with k prefix (e.g., kDIGIT_NUM)
If macros must be used, name them UPPER_SNAKE (e.g., FOO_VERSION); prefer constants over #define
Run clang-format (LLVM style); max line length 120; only justify exceptions with clang-format off/on blocks
Use C++-style comments; C comments only for special inline cases; prefer // for single-line
Capitalize and punctuate full-sentence comments
Disable code with #if/#endif (or noop macros), not by commenting; avoid dead code
Do not throw exceptions across library boundaries
Use the least-forceful cast; avoid removing const/volatile; avoid C-style and functional casts; static_cast from void* to T*; reinterpret_cast as last resort; avoid dynamic_cast
Avoid large inline functions unless justified; class-defined functions are implicitly inline
Prefer anonymous namespaces over static for internal linkage of functions
Keep parameter names consistent between declarations and definitions
Only #define/#undef at g...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; no tabs; indent by 4 spaces

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}: Follow Doxygen: //! for comments, //!< for members
All templates (classes/functions/members/static members) must be instantiated at least once

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cuh}: If a class is not POD, keep data members private
Use include guards named TRTLLM_<FILE_NAME_UPPER> without leading/trailing underscores

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu}: Place the semicolon of an empty for/while loop on a new line
Bodies of switch/while/do/for must be compound (brace-delimited), including if/else
Use smart pointers for heap objects; prefer unique_ptr over shared_ptr; weak_ptr only exceptionally; avoid deprecated smart pointers
Annotate non-obvious call arguments with inline C comments (/* name = */ value)
Do not use assignment within subexpressions (e.g., if (x = y))
Switch over enums should cover all values without default to enable compile-time checks
Write well-structured switch statements; no chained case/goto-like constructs
No fall-through except between consecutive empty-to-nonempty cases; each case must end with break or throw; return not allowed
If a switch clause is compound, place break inside the braces
Every defined function must be used at least once (no unused methods)
Avoid C library functions; prefer C++ alternatives; do not memset non-POD objects; use brace init or std::fill_n

Files:

  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (2)
  • TargetRanksInfoForDP (50-160)
  • TargetRanksInfoForDP (50-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1)

39-39: LGTM! CP domain size properly added to TargetRanksInfo.

The addition of mDomainCPSize follows the existing pattern and is correctly positioned after mDomainTPSize.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (10)

1464-1493: Excellent coverage for TP shrinking scenarios with CP.

The test comprehensively covers TP shrinking from context to generation, correctly verifying CP domain size remains 1 and checking the needSendCache flag for MLA optimization.


1495-1510: LGTM! TP growth scenario properly tested.

The test correctly validates TP expansion with CP=1, demonstrating proper rank mapping across the TP domain.


1538-1554: Well-structured CP growth test cases.

The test properly validates CP expansion from 1 to 2 while maintaining TP and PP dimensions, correctly computing the expanded rank set and CP domain size of 2.


1556-1576: Excellent coverage of combined TP and CP growth.

The test effectively validates simultaneous TP and CP expansion, correctly computing the 2×2 domain expansion and resulting rank mappings.


1578-1594: Good coverage for TP shrinking with CP growth.

The test properly handles the complex scenario where TP decreases while CP increases, ensuring correct rank distribution and checking MLA-specific needSendCache behavior.


1596-1616: Comprehensive PP and CP growth testing.

The test correctly validates simultaneous PP and CP expansion, properly computing the 2×2 domain size and resulting rank mappings across pipeline and context parallel dimensions.


1618-1646: Thorough coverage of PP shrinking with CP growth.

The test properly validates PP reduction while CP increases, ensuring correct rank mappings across all context ranks. Good attention to detail in covering all 8 context ranks.


1648-1676: Excellent coverage of complex TP/PP shrinking with CP growth.

The test comprehensively covers the scenario where both TP and PP decrease while CP increases, properly handling MLA-specific needSendCache logic.


1678-1698: Well-tested TP/CP growth with PP shrinking.

The test properly validates the complex scenario of TP and CP expansion with PP reduction, ensuring correct 2×2 CP×TP domain expansion.


1700-1712: Comprehensive extreme scaling test case.

Excellent test coverage for PP and CP significant growth (2×4) with TP shrinking, properly validating the 8-rank domain expansion.

@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 28, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16907 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16893 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16907 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12702 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 29, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17009 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17009 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12772 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17015 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17015 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12778 completed with status: 'SUCCESS'

@brb-nv brb-nv merged commit 43cb50f into NVIDIA:main Aug 29, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants