[None][chore] Remove onboard block switch for KV cache manager #7469

eopXD · 2025-09-02T07:14:21Z

Summary by CodeRabbit

Refactor
- Simplified KV cache APIs by removing the onboard_blocks option; onboarding/offloading now handled automatically.
- Updated C++ and Python constructor signatures (and property bindings) to exclude onboard_blocks; parameter order adjusted accordingly.
- Removed onboard_blocks from serialization/pickling formats; saved state no longer includes this field.
Tests
- Updated unit tests to align with the streamlined APIs and serialization changes.

Description

This MR has no functional change intended.

Dead code elimination. The secondary block pool is derived when kv_cache_config::host_cache_size is specified. Whether we onboard/offload a kv cache block can be implicated from whether the manager has secondary block or not. The onboardBlocks toggle itself only adds complication. This commit removes it.

Test Coverage

Since not functional change is intended. No test change is needed.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

coderabbitai · 2025-09-02T07:30:34Z

📝 Walkthrough

Walkthrough

Removes the onboardBlocks parameter and related logic from KV cache components across headers, implementations, bindings, serialization, and tests. Constructor signatures and parameter ordering are updated accordingly. Offload/onboard gating tied to onboardBlocks is eliminated. Python bindings and serialization schemas drop the onboard_blocks field. Tests adjusted to new APIs.

Changes

Cohort / File(s)	Summary
KV cache header API updates `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Removed onboardBlocks from WindowBlockManager, BlockManager, KVCacheManager constructors; reordered parameters; removed member storing onboard policy.
Executor config header `cpp/include/tensorrt_llm/executor/executor.h`	KvCacheConfig: removed onboardBlocks ctor param, getter/setter, and private member.
KV cache implementation `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Purged onboarding gating; updated constructors and internal calls; adjusted logging; offload/onboard decisions no longer depend on onboard flag.
Inflight batching usage `cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`	Removed runtime guard requiring onboard blocks for certain FMHA; updated KVCacheManager construction to new signature.
Executor config impl `cpp/tensorrt_llm/executor/kvCacheConfig.cpp`	KvCacheConfig ctor drops onboardBlocks; reorders params; removes getter/setter implementations and member init.
Serialization changes `cpp/tensorrt_llm/executor/serialization.cpp`	Removed onboardBlocks from serialize/deserialize and size calculations; updated KvCacheConfig construction order.
Nanobind (C++/Python) executor config `cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp`	Dropped onboard_blocks property and getstate element; note: setstate may still expect previous tuple size.
Pybind KV cache bindings `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`	Removed onboard_blocks arg from exposed KVCacheManager ctors; adjusted py::init signatures (two bools → one before CacheType).
Pybind executor config `cpp/tensorrt_llm/pybind/executor/executorConfig.cpp`	Removed onboard_blocks property; updated pickling tuple to exclude it.
Unit tests: KV cache `cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`	Updated ctor calls to omit onboardBlocks; one merge-conflict artifact present; logic otherwise unchanged.
Unit tests: serialization `cpp/tests/unit_tests/executor/serializeUtilsTest.cpp`	Removed assertion on onboardBlocks in KvCacheConfig serialization test.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant KVCacheManager
  participant BlockManager
  participant PrimaryPool
  participant SecondaryPool

  Client->>KVCacheManager: requestBlock()
  KVCacheManager->>BlockManager: getFreeBlock()
  alt primary has free block
    BlockManager->>PrimaryPool: allocate()
    PrimaryPool-->>BlockManager: block
  else primary needs space
    BlockManager->>SecondaryPool: offload eligible blocks
    SecondaryPool-->>BlockManager: offloaded
    BlockManager->>PrimaryPool: allocate()
    PrimaryPool-->>BlockManager: block
  end
  BlockManager-->>KVCacheManager: block
  KVCacheManager-->>Client: block

  note over BlockManager,SecondaryPool: Onboarding/offloading no longer gated by onboardBlocks flag

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][chore] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash #6249 — Also edits KV cache managers’ constructor signatures, removing a boolean; overlaps same classes and APIs.
[None][feat] Support NVFP4 KV Cache #6244 — Modifies KVCacheManager/WindowBlockManager/BlockManager APIs and cache pool behavior; touches same headers.
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events #6563 — Changes KvCacheConfig/KV cache interfaces (adds attention-DP fields); intersects with files updated here.

Suggested labels

KV-Cache Management

Suggested reviewers

tomeras91
Tabrizian
thorjohnsen
achartier
Funatiq

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)

cpp/tensorrt_llm/executor/serialization.cpp (1)

1173-1193: maxGpuTotalBytes is never serialized; value is lost across process boundaries.

KvCacheConfig exposes getMaxGpuTotalBytes()/ctor param, but serialize/deserialize/serializedSize omit it. This silently resets to 0 after (de)serialization.

Apply this diff to persist the field:

@@ KvCacheConfig Serialization::deserializeKvCacheConfig(std::istream& is)
-    auto attentionDpEventsGatherPeriodMs = su::deserialize<SizeType32>(is);
+    auto attentionDpEventsGatherPeriodMs = su::deserialize<SizeType32>(is);
+    auto maxGpuTotalBytes = su::deserialize<uint64_t>(is);
@@
-    return KvCacheConfig{enableBlockReuse, maxTokens, maxAttentionWindowVec, sinkTokenLength, freeGpuMemoryFraction,
-        hostCacheSize, crossKvCacheFraction, secondaryOffloadMinPriority, eventBufferMaxSize, enablePartialReuse,
-        copyOnPartialReuse, useUvm, attentionDpEventsGatherPeriodMs};
+    return KvCacheConfig{enableBlockReuse, maxTokens, maxAttentionWindowVec, sinkTokenLength, freeGpuMemoryFraction,
+        hostCacheSize, crossKvCacheFraction, secondaryOffloadMinPriority, eventBufferMaxSize, enablePartialReuse,
+        copyOnPartialReuse, useUvm, attentionDpEventsGatherPeriodMs, std::nullopt, maxGpuTotalBytes};

@@ void Serialization::serialize(KvCacheConfig const& kvCacheConfig, std::ostream& os)
     su::serialize(kvCacheConfig.getSecondaryOffloadMinPriority(), os);
     su::serialize(kvCacheConfig.getEventBufferMaxSize(), os);
     su::serialize(kvCacheConfig.getUseUvm(), os);
     su::serialize(kvCacheConfig.getAttentionDpEventsGatherPeriodMs(), os);
+    su::serialize(kvCacheConfig.getMaxGpuTotalBytes(), os);

@@ size_t Serialization::serializedSize(KvCacheConfig const& kvCacheConfig)
     totalSize += su::serializedSize(kvCacheConfig.getEventBufferMaxSize());
     totalSize += su::serializedSize(kvCacheConfig.getUseUvm());
     totalSize += su::serializedSize(kvCacheConfig.getAttentionDpEventsGatherPeriodMs());
+    totalSize += su::serializedSize(kvCacheConfig.getMaxGpuTotalBytes());
     return totalSize;

Note: adding this also changes the wire format; consider coupling with the versioning suggestion above.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3)

109-121: Pickle schema mismatch: getstate emits 14 fields, but setstate still requires 15.

This breaks round-trip pickling and will throw at runtime. Also keeps the removed onboard_blocks in the tuple layout.

Apply backward-compatible fix (accept 14 or 15; ignore deprecated onboard_blocks at index 6 when present):

-    auto kvCacheConfigSetstate = [](py::tuple const& state)
+    auto kvCacheConfigSetstate = [](py::tuple const& state)
     {
-        if (state.size() != 15)
+        if (state.size() != 14 && state.size() != 15)
         {
             throw std::runtime_error("Invalid state!");
         }
-        return tle::KvCacheConfig(state[0].cast<bool>(), state[1].cast<std::optional<SizeType32>>(),
-            state[2].cast<std::optional<std::vector<SizeType32>>>(), state[3].cast<std::optional<SizeType32>>(),
-            state[4].cast<std::optional<float>>(), state[5].cast<std::optional<size_t>>(), state[6].cast<bool>(),
-            state[7].cast<std::optional<float>>(), state[8].cast<std::optional<tle::RetentionPriority>>(),
-            state[9].cast<size_t>(), state[10].cast<bool>(), state[11].cast<bool>(), state[12].cast<bool>(),
-            state[13].cast<SizeType32>(), std::nullopt, state[14].cast<uint64_t>());
+        auto const shift = (state.size() == 15) ? 1 : 0; // ignore deprecated onboard_blocks at state[6]
+        return tle::KvCacheConfig(
+            state[0].cast<bool>(),
+            state[1].cast<std::optional<SizeType32>>(),
+            state[2].cast<std::optional<std::vector<SizeType32>>>(),
+            state[3].cast<std::optional<SizeType32>>(),
+            state[4].cast<std::optional<float>>(),
+            state[5].cast<std::optional<size_t>>(),
+            state[6 + shift].cast<std::optional<float>>(),
+            state[7 + shift].cast<std::optional<tle::RetentionPriority>>(),
+            state[8 + shift].cast<size_t>(),
+            state[9 + shift].cast<bool>(),
+            state[10 + shift].cast<bool>(),
+            state[11 + shift].cast<bool>(),
+            state[12 + shift].cast<SizeType32>(),
+            std::nullopt,
+            state[13 + shift].cast<uint64_t>());
     };

123-135: Constructor binding still exposes removed onboard_blocks parameter.

This contradicts the PR goal and likely won’t compile against the updated C++ API.

Remove the boolean from the ctor signature and the corresponding arg:

-        .def(py::init<bool, std::optional<SizeType32> const&, std::optional<std::vector<SizeType32>> const&,
-                 std::optional<SizeType32> const&, std::optional<float> const&, std::optional<size_t> const&, bool,
+        .def(py::init<bool, std::optional<SizeType32> const&, std::optional<std::vector<SizeType32>> const&,
+                 std::optional<SizeType32> const&, std::optional<float> const&, std::optional<size_t> const&,
                  std::optional<float> const&, std::optional<tle::RetentionPriority>, size_t const&, bool, bool, bool,
                  SizeType32, std::optional<RuntimeDefaults> const&, uint64_t const&>(),
@@
-            py::arg("free_gpu_memory_fraction") = py::none(), py::arg("host_cache_size") = py::none(),
-            py::arg("onboard_blocks") = true, py::arg("cross_kv_cache_fraction") = py::none(),
+            py::arg("free_gpu_memory_fraction") = py::none(), py::arg("host_cache_size") = py::none(),
+            py::arg("cross_kv_cache_fraction") = py::none(),

101-108: Align kvCacheConfig pickle getstate/setstate with ctor signature

kvCacheConfigGetstate returns 14 elements, but kvCacheConfigSetstate still checks for 15 (if (state.size() != 15)), so unpickling always fails.
setstate casts state[6] to bool (presumably for the removed onboard_blocks) and passes it into the C++ ctor’s crossKvCacheFraction parameter—mismapping both types and positions.
Update kvCacheConfigSetstate to expect 14 fields, correct the index offsets after removing onboard_blocks, and adjust the state.size() check and argument order to match the 15-parameter C++ ctor (with runtimeDefaults defaulted) exactly.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)

117-131: Pickle schema mismatch (14 vs 15) and lingering onboard in setstate.

Same issue as pybind: runtime error on unpickle and stale flag handling.

-    auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state)
+    auto kvCacheConfigSetstate = [](tle::KvCacheConfig& self, nb::tuple const& state)
     {
-        if (state.size() != 15)
+        if (state.size() != 14 && state.size() != 15)
         {
             throw std::runtime_error("Invalid state!");
         }
-        new (&self) tle::KvCacheConfig(nb::cast<bool>(state[0]), nb::cast<std::optional<SizeType32>>(state[1]),
-            nb::cast<std::optional<std::vector<SizeType32>>>(state[2]), nb::cast<std::optional<SizeType32>>(state[3]),
-            nb::cast<std::optional<float>>(state[4]), nb::cast<std::optional<size_t>>(state[5]),
-            nb::cast<bool>(state[6]), nb::cast<std::optional<float>>(state[7]),
-            nb::cast<std::optional<tle::RetentionPriority>>(state[8]), nb::cast<size_t>(state[9]),
-            nb::cast<bool>(state[10]), nb::cast<bool>(state[11]), nb::cast<bool>(state[12]),
-            nb::cast<SizeType32>(state[13]), std::nullopt, nb::cast<uint64_t>(state[14]));
+        int const shift = (state.size() == 15) ? 1 : 0; // ignore deprecated onboard_blocks
+        new (&self) tle::KvCacheConfig(
+            nb::cast<bool>(state[0]),
+            nb::cast<std::optional<SizeType32>>(state[1]),
+            nb::cast<std::optional<std::vector<SizeType32>>>(state[2]),
+            nb::cast<std::optional<SizeType32>>(state[3]),
+            nb::cast<std::optional<float>>(state[4]),
+            nb::cast<std::optional<size_t>>(state[5]),
+            nb::cast<std::optional<float>>(state[6 + shift]),
+            nb::cast<std::optional<tle::RetentionPriority>>(state[7 + shift]),
+            nb::cast<size_t>(state[8 + shift]),
+            nb::cast<bool>(state[9 + shift]),
+            nb::cast<bool>(state[10 + shift]),
+            nb::cast<bool>(state[11 + shift]),
+            nb::cast<SizeType32>(state[12 + shift]),
+            std::nullopt,
+            nb::cast<uint64_t>(state[13 + shift]));
     };

131-144: Constructor binding still includes onboard_blocks.

Remove the boolean and the nb::arg to reflect the C++ API.

-        .def(nb::init<bool, std::optional<SizeType32> const&, std::optional<std::vector<SizeType32>> const&,
-                 std::optional<SizeType32> const&, std::optional<float> const&, std::optional<size_t> const&, bool,
+        .def(nb::init<bool, std::optional<SizeType32> const&, std::optional<std::vector<SizeType32>> const&,
+                 std::optional<SizeType32> const&, std::optional<float> const&, std::optional<size_t> const&,
                  std::optional<float> const&, std::optional<tle::RetentionPriority>, size_t const&, bool, bool, bool,
                  SizeType32, std::optional<RuntimeDefaults> const&, uint64_t const&>(),
@@
-            nb::arg("free_gpu_memory_fraction") = nb::none(), nb::arg("host_cache_size") = nb::none(),
-            nb::arg("onboard_blocks") = true, nb::arg("cross_kv_cache_fraction") = nb::none(),
+            nb::arg("free_gpu_memory_fraction") = nb::none(), nb::arg("host_cache_size") = nb::none(),
+            nb::arg("cross_kv_cache_fraction") = nb::none(),

109-116: Align __getstate__ tuple with the updated KvCacheConfig constructor signature.
In cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (lines 109–116), __getstate__ currently returns 14 elements but the nanobind __init__ and C++ ctor expect 15 parameters (including the new runtime_defaults). Update the tuple (and adjust the __setstate__ size check) so it emits—and consumes—all fields in the exact constructor order to prevent mis-serialization.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

455-470: Fix pybind KVCacheManager constructor type list (stream/max_sequence_length types are wrong).

The py::init<> template uses bool,int64_t where stream and optional max_sequence_length are expected. This will break bindings.

-        .def(py::init<std::vector<SizeType32> const&, SizeType32, SizeType32,
-                 std::map<SizeType32, std::tuple<SizeType32, SizeType32>> const&, SizeType32, SizeType32,
-                 std::vector<SizeType32> const&, std::optional<tbk::TempAttentionWindowInputs> const&,
-                 nvinfer1::DataType, SizeType32, bool, int64_t, bool, tbk::CacheType,
+        .def(py::init<std::vector<SizeType32> const&, SizeType32, SizeType32,
+                 std::map<SizeType32, std::tuple<SizeType32, SizeType32>> const&, SizeType32, SizeType32,
+                 std::vector<SizeType32> const&, std::optional<tbk::TempAttentionWindowInputs> const&,
+                 nvinfer1::DataType, SizeType32, CudaStreamPtr, std::optional<SizeType32>, bool, tbk::CacheType,
                  std::optional<tensorrt_llm::executor::RetentionPriority>, std::shared_ptr<tbk::KVCacheEventManager>,
                  bool, bool, std::shared_ptr<tbc::KvCacheConnectorManager>>(),
@@
-            py::arg("sink_token_length"), py::arg("stream"), py::arg("max_sequence_length"),
+            py::arg("sink_token_length"), py::arg("stream"), py::arg("max_sequence_length"),

Also re-run all call sites in Python to ensure “onboard_blocks” kwargs are removed (see _torch.resource_manager).

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1386-1391: Fix parameter name casing: copyOnpartialReuse → copyOnPartialReuse
Rename the parameter in all KVCacheManager overloads (cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h) at lines 1389, 1400, 1411, and 1420 to match the .cpp implementation and avoid inconsistencies in generated/binding code.

Apply:
--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
+++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
@@ -1386,7 +1386,7 @@
-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,
@@ -1397,7 +1397,7 @@
-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,
@@ -1409,7 +1409,7 @@
-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,
@@ -1419,7 +1419,7 @@
-        bool copyOnpartialReuse = true);
+        bool copyOnPartialReuse = true);

🧹 Nitpick comments (6)

cpp/include/tensorrt_llm/executor/executor.h (1)
1-15: Update copyright year range to include 2025.

Header shows 2022-2024; repository guideline asks for current year on touched files.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)

197-201: Optional setter ergonomics.

getHostCacheSize() is optional but only a size_t setter exists. Consider an overload to clear the value.

Example (header + impl):

void setHostCacheSize(std::optional<size_t> hostCacheSize);
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
2-2: Update SPDX year range to include 2025.

Keep headers consistent with other updated files.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
548-551: Use 0 (SizeType32) instead of false for sinkTokenLength.

Passing a bool where SizeType32 is expected is confusing and risks overload mismatches. Use 0 for clarity and to match other sites.
-        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
-        false, stream, true);
+        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
+        0, stream, true);
@@
-        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kHALF,
-        false, stream, true);
+        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kHALF,
+        0, stream, true);
@@
-        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kHALF,
-        false, stream, true);
+        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kHALF,
+        0, stream, true);
Also applies to: 2101-2104, 2175-2178
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
934-952: Onboarding event emission: align condition with offload path

Offload path checks mEventManager && blockInRadixTree(block). Onboard path checks only mEventManager. For consistency and to avoid spurious events for non-radix nodes, gate on blockInRadixTree as well.
-        if (mEventManager)
+        if (mEventManager && blockInRadixTree(offloadBlock))
         {
             mEventManager->enqueueUpdatedEvent(
                 tle::KVCacheUpdatedData(offloadBlock->getHash()).cacheLevelUpdated(kSecondaryLevel, kPrimaryLevel),
                 mWindowSize);
         }
2262-2271: Nit: log message spelling

"secondayBlocks" → "secondaryBlocks".
-        TLLM_LOG_INFO(
-            "[windowSize=%d] {.primaryBlocks=%d, .secondayBlocks=%d}", windowSize, primaryBlocks, secondayBlocks);
+        TLLM_LOG_INFO(
+            "[windowSize=%d] {.primaryBlocks=%d, .secondaryBlocks=%d}", windowSize, primaryBlocks, secondayBlocks);

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff2439f and 6315ba1.

📒 Files selected for processing (11)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (6 hunks)
cpp/include/tensorrt_llm/executor/executor.h (1 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (12 hunks)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1 hunks)
cpp/tensorrt_llm/executor/serialization.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (30 hunks)
cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (0 hunks)

💤 Files with no reviewable changes (1)

cpp/tests/unit_tests/executor/serializeUtilsTest.cpp

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}