Skip to content

Conversation

chang-l
Copy link
Collaborator

@chang-l chang-l commented Aug 21, 2025

Refer to vllm-project/vllm#16016 for context. This is for security risks such as prompt theft attacks.

Summary by CodeRabbit

  • New Features

    • KV-cache salting: added optional cache_salt (string) and propagated cache_salt_id through request APIs, bindings, serialization and runtime; Request objects gain get/set accessors (default None).
    • Added get_cache_salt_id(cache_salt: str) helper and input validation to reject empty cache_salt.
  • Tests

    • Added unit and integration tests validating cache_salt governs KV-cache reuse and related metrics.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chang-l chang-l requested review from a team as code owners August 21, 2025 04:33
@chang-l chang-l requested review from syuoni and lfr-0531 August 21, 2025 04:33
Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

📝 Walkthrough

Walkthrough

Adds optional per-request KV-cache salting across C++ and Python: new CacheSaltID type, request fields/getters/setters, propagated through LlmRequest/Request constructors and bindings, mixed into BlockKey hashing for first-block seeding, utilities to derive salt IDs, and tests/benchmarks updated to exercise the feature.

Changes

Cohort / File(s) Summary
Types & runtime alias
cpp/include/tensorrt_llm/executor/types.h, cpp/include/tensorrt_llm/runtime/common.h
Add CacheSaltIDType alias (std::uint64_t) in executor and runtime namespaces.
Executor Request API (C++)
cpp/include/tensorrt_llm/executor/executor.h, cpp/tensorrt_llm/executor/request.cpp, cpp/tensorrt_llm/executor/requestImpl.h, cpp/tensorrt_llm/executor/serialization.cpp
Extend Request constructor with optional cacheSaltID; add getCacheSaltID/setCacheSaltID; propagate into Impl and (de)serialization.
Batch-manager LLM requests (C++)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h, cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h, cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h, cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp, cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
Add CacheSaltIDType alias, member mCacheSaltID, ctor param and getter; forward cacheSaltID through constructors and to tb::LlmRequest.
KV cache keying & hashing
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Add optional cacheSaltID to BlockKey (member, ctor, ==, partialMatch); pass salt into BlockKey; mix salt into first-block hash; widen mmStartInBlock to 64-bit.
Bindings: nanobind / pybind (executor)
cpp/tensorrt_llm/nanobind/executor/request.cpp, cpp/tensorrt_llm/pybind/executor/request.cpp, cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp, cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
Expose cache_salt_id property/arg in Python bindings; extend ctor bindings and (de)serialization/state tuples to include cache_salt_id; add property accessors.
Python executor surface & worker
tensorrt_llm/executor/executor.py, tensorrt_llm/executor/request.py, tensorrt_llm/executor/worker.py, tensorrt_llm/_torch/pyexecutor/llm_request.py
Add optional cache_salt_id to GenerationRequest/generate_async signatures; store/forward cache_salt_id into worker and into C++ Request construction.
Inputs utilities & export
tensorrt_llm/inputs/utils.py, tensorrt_llm/inputs/__init__.py
Add get_cache_salt_id(cache_salt: str) -> int using default_hasher; re-export in inputs package.
Serve / OpenAI protocol & server
tensorrt_llm/serve/openai_protocol.py, tensorrt_llm/serve/openai_server.py
Add optional cache_salt: Optional[str] field to ChatCompletionRequest with validator (non-empty string); forward cache_salt into generate_async call.
Benchmarks (C++)
benchmarks/cpp/disaggServerBenchmark.cpp, benchmarks/cpp/gptManagerBenchmark.cpp
Append trailing std::nullopt for new cacheSaltID parameter when constructing texec::Request.
Tests (unit & integration)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp, tests/unittest/llmapi/apps/_test_openai_cache_salt.py, tests/integration/defs/test_e2e.py
Add unit test validating block reuse semantics with/without cache salt; add OpenAI-focused unit and integration tests exercising cache_salt behavior and metrics.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant OpenAI_Server as OpenAI Server
  participant LLM as BaseLLM.generate_async
  participant PyExec as Python Executor
  participant CppReq as C++ Request
  participant KV as KV Cache Manager

  Client->>OpenAI_Server: ChatCompletionRequest(cache_salt?)
  OpenAI_Server->>LLM: generate_async(..., cache_salt)
  LLM->>LLM: if cache_salt -> get_cache_salt_id() -> cache_salt_id
  LLM->>PyExec: generate_async(..., cache_salt_id)
  PyExec->>CppReq: construct Request(..., cacheSaltID=cache_salt_id)
  CppReq->>KV: buildBlockKeys(cacheSaltID)
  Note over KV: first-block hash mixes cacheSaltID when present
  KV-->>CppReq: allocate/reuse blocks (salted keys)
  CppReq-->>PyExec: generation output
  PyExec-->>LLM: result
  LLM-->>OpenAI_Server: response
  OpenAI_Server-->>Client: completion
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Community want to contribute, Generic Runtime

Suggested reviewers

  • syuoni
  • lfr-0531
  • netanel-haber
  • Funatiq

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 57ea549 and eb52529.

📒 Files selected for processing (1)
  • tensorrt_llm/serve/openai_protocol.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/serve/openai_protocol.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/include/tensorrt_llm/executor/executor.h (1)

675-703: Misaligned constructor signature between declaration and definition

The implementation in cpp/tensorrt_llm/executor/request.cpp does not exactly match the declaration in cpp/include/tensorrt_llm/executor/executor.h:

  • In the definition (lines 28–100), the 21st parameter is named
    std::optional<LogitsPostProcessor> logitslogitsPostProcessor
    but the header declares it as std::optional<LogitsPostProcessor> logitsPostProcessor.
  • The priority parameter’s type is float in the .cpp, whereas the header uses PriorityType priority. Even if PriorityType is an alias for float, the signature must use the same identifier.

Please update the definition to align exactly with the declaration:

• Change the parameter name from logitslogitsPostProcessorlogitsPostProcessor.
• Use PriorityType priority (not float) in the implementation.

Everything else—the placement of new optional parameters and the Python/Nanobind bindings for cache_salt_id—looks correct.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)

75-79: Bug: optEncoderInputTokens engaged with nullptr when mEncoderTokens is absent

Current construction wraps a nullptr in an engaged std::optional, which can lead downstream code to dereference a null shared_ptr. Use nullopt when there is no encoder input.

-    auto const encoderInputTokens = mEncoderTokens.has_value()
-        ? std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get())
-        : nullptr;
-    auto const optEncoderInputTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(encoderInputTokens);
+    std::optional<std::shared_ptr<std::vector<TokenIdType>>> optEncoderInputTokens =
+        mEncoderTokens.has_value()
+            ? std::make_optional(
+                  std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get()))
+            : std::nullopt;
cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (1)

78-130: Critical: Binding passes cacheSaltID without header support

The pybind bridge at
cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (lines 78–130)
now appends mCacheSaltID to the tb::LlmRequest constructor call, but:

• In cpp/include/tensorrt_llm/batch_manager/llmRequest.h, the LlmRequest class declaration and its constructors do not list a cacheSaltID (or similarly named) parameter.

This mismatch will break the build and ABI.

Please update the LlmRequest constructor signature (and its implementation) in the header and source to include the cacheSaltID parameter—positioned after mContextPhaseParams (or in the correct spot relative to any defaulted parameters)—or remove the extra argument from the bridge.

cpp/tensorrt_llm/pybind/executor/request.cpp (1)

518-555: Pickle get/set state does not persist language_adapter_uid/allotted_time_ms/cache_salt_id.

requestGetstate/requestSetstate still serialize 33 entries and don’t carry the three trailing fields. That silently drops cache_salt_id (and potentially language_adapter_uid/allotted_time_ms) across pickle boundaries.

Recommend:

  • Append these fields to the tuple returned by getstate.
  • Make setstate backward compatible by accepting both the legacy 33-tuple and a new 36-tuple.

Proposed patch (minimal and backward compatible):

@@
-    auto requestGetstate = [](tle::Request const& self)
+    auto requestGetstate = [](tle::Request const& self)
     {
-        return py::make_tuple(self.getInputTokenIds(), self.getMaxTokens(), self.getStreaming(),
+        return py::make_tuple(self.getInputTokenIds(), self.getMaxTokens(), self.getStreaming(),
             self.getSamplingConfig(), self.getOutputConfig(), self.getEndId(), self.getPadId(), self.getPositionIds(),
             self.getBadWords(), self.getStopWords(), self.getEmbeddingBias(), self.getExternalDraftTokensConfig(),
             self.getPromptTuningConfig(), self.getMultimodalInput(), self.getMultimodalEmbedding(),
             self.getMropeConfig(), self.getLoraConfig(), self.getLookaheadConfig(), self.getKvCacheRetentionConfig(),
             self.getLogitsPostProcessorName(), self.getLogitsPostProcessor(), self.getEncoderInputTokenIds(),
             self.getClientId(), self.getReturnAllGeneratedTokens(), self.getPriority(), self.getRequestType(),
             self.getContextPhaseParams(), self.getEncoderInputFeatures(), self.getEncoderOutputLength(),
-            self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(),
-            self.getGuidedDecodingParams());
+            self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(),
+            self.getGuidedDecodingParams(),
+            self.getLanguageAdapterUid(), self.getAllottedTimeMs(), self.getCacheSaltID());
     };
-    auto requestSetstate = [](py::tuple const& state)
+    auto requestSetstate = [](py::tuple const& state)
     {
-        if (state.size() != 33)
+        if (state.size() != 33 && state.size() != 36)
         {
             throw std::runtime_error("Invalid Request state!");
         }
-        return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(),
+        // Legacy 33-field pickle (no language_adapter_uid/allotted_time_ms/cache_salt_id)
+        if (state.size() == 33)
+        {
+            return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(),
             state[2].cast<bool>(), state[3].cast<tle::SamplingConfig>(), state[4].cast<tle::OutputConfig>(),
             state[5].cast<std::optional<SizeType32>>(), state[6].cast<std::optional<SizeType32>>(),
             state[7].cast<std::optional<std::vector<SizeType32>>>(),
             state[8].cast<std::optional<std::list<VecTokens>>>(), state[9].cast<std::optional<std::list<VecTokens>>>(),
             state[10].cast<std::optional<Tensor>>(), state[11].cast<std::optional<tle::ExternalDraftTokensConfig>>(),
             state[12].cast<std::optional<tle::PromptTuningConfig>>(),
             state[13].cast<std::optional<tle::MultimodalInput>>(), state[14].cast<std::optional<Tensor>>(),
             state[15].cast<std::optional<tle::MropeConfig>>(), state[16].cast<std::optional<tle::LoraConfig>>(),
             state[17].cast<std::optional<tle::LookaheadDecodingConfig>>(),
             state[18].cast<std::optional<tle::KvCacheRetentionConfig>>(), state[19].cast<std::optional<std::string>>(),
             state[20].cast<std::optional<tle::LogitsPostProcessor>>(), state[21].cast<std::optional<VecTokens>>(),
             state[22].cast<std::optional<IdType>>(), state[23].cast<bool>(), state[24].cast<tle::PriorityType>(),
             state[25].cast<tle::RequestType>(), state[26].cast<std::optional<tle::ContextPhaseParams>>(),
             state[27].cast<std::optional<tle::Tensor>>(), state[28].cast<std::optional<SizeType32>>(),
-            state[29].cast<std::optional<tle::Tensor>>(), 1, state[30].cast<std::optional<tle::EagleConfig>>(),
-            state[31].cast<std::optional<tle::Tensor>>(), state[32].cast<std::optional<tle::GuidedDecodingParams>>());
+            state[29].cast<std::optional<tle::Tensor>>(), 1,
+            state[30].cast<std::optional<tle::EagleConfig>>(),
+            state[31].cast<std::optional<tle::Tensor>>(),
+            state[32].cast<std::optional<tle::GuidedDecodingParams>>(),
+            /*languageAdapterUid*/ std::nullopt,
+            /*allottedTimeMs*/ std::nullopt,
+            /*cacheSaltID*/ std::nullopt);
+        }
+        // New 36-field pickle (includes language_adapter_uid, allotted_time_ms, cache_salt_id)
+        return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(),
+            state[2].cast<bool>(), state[3].cast<tle::SamplingConfig>(), state[4].cast<tle::OutputConfig>(),
+            state[5].cast<std::optional<SizeType32>>(), state[6].cast<std::optional<SizeType32>>(),
+            state[7].cast<std::optional<std::vector<SizeType32>>>(),
+            state[8].cast<std::optional<std::list<VecTokens>>>(), state[9].cast<std::optional<std::list<VecTokens>>>(),
+            state[10].cast<std::optional<Tensor>>(), state[11].cast<std::optional<tle::ExternalDraftTokensConfig>>(),
+            state[12].cast<std::optional<tle::PromptTuningConfig>>(),
+            state[13].cast<std::optional<tle::MultimodalInput>>(), state[14].cast<std::optional<Tensor>>(),
+            state[15].cast<std::optional<tle::MropeConfig>>(), state[16].cast<std::optional<tle::LoraConfig>>(),
+            state[17].cast<std::optional<tle::LookaheadDecodingConfig>>(),
+            state[18].cast<std::optional<tle::KvCacheRetentionConfig>>(), state[19].cast<std::optional<std::string>>(),
+            state[20].cast<std::optional<tle::LogitsPostProcessor>>(), state[21].cast<std::optional<VecTokens>>(),
+            state[22].cast<std::optional<IdType>>(), state[23].cast<bool>(), state[24].cast<tle::PriorityType>(),
+            state[25].cast<tle::RequestType>(), state[26].cast<std::optional<tle::ContextPhaseParams>>(),
+            state[27].cast<std::optional<tle::Tensor>>(), state[28].cast<std::optional<SizeType32>>(),
+            state[29].cast<std::optional<tle::Tensor>>(), 1,
+            state[30].cast<std::optional<tle::EagleConfig>>(),
+            state[31].cast<std::optional<tle::Tensor>>(),
+            state[32].cast<std::optional<tle::GuidedDecodingParams>>(),
+            state[33].cast<std::optional<tle::SizeType32>>(),       // language_adapter_uid
+            state[34].cast<std::optional<tle::MillisecondsType>>(), // allotted_time_ms
+            state[35].cast<std::optional<tle::CacheSaltIDType>>()   // cache_salt_id
+        );
     };
🧹 Nitpick comments (43)
cpp/include/tensorrt_llm/runtime/common.h (1)

47-47: CacheSaltIDType alias addition — LGTM; add brief Doxygen for intent

Good place for the alias alongside other ID types. Please add a short Doxygen comment to clarify semantics and intended range.

Apply this diff to document the alias:

-using CacheSaltIDType = std::uint64_t;
+//! \brief Per-request KV cache salt used to namespace KV reuse.
+//! Requests with different salts must not match for reuse.
+using CacheSaltIDType = std::uint64_t;
cpp/include/tensorrt_llm/executor/types.h (1)

61-61: Avoid type drift: re-export CacheSaltIDType from runtime

To ensure the executor stays in sync with the runtime and prevent future divergence, please update the alias in cpp/include/tensorrt_llm/executor/types.h:

• In executor/types.h (around line 61), replace:

using CacheSaltIDType = std::uint64_t;

with:

#include "tensorrt_llm/runtime/common.h"  // add near other includes

using CacheSaltIDType = tensorrt_llm::runtime::CacheSaltIDType;

• This aligns executor/types.h with other modules (e.g. batch_manager/llmRequest.h and kvCacheManager.h) that already re-export runtime::CacheSaltIDType.

• After applying, rerun:

rg -nP --glob '!**/build/**' '\busing\s+CacheSaltIDType\s*=' -C2

to confirm no remaining direct definitions of std::uint64_t.

tensorrt_llm/executor/request.py (1)

86-101: API extension is backward compatible; document and ensure end-to-end plumbing

The new cache_salt_id argument is appended, so callsites using positional args are unaffected. Make sure this is plumbed through the pybind/nanobind layer and into Request::Impl::setCacheSaltID, and add a docstring entry so downstream users understand purpose and bounds (uint64).

If helpful, I can add/update the Google-style docstring for GenerationRequest.init.

cpp/tensorrt_llm/executor/requestImpl.h (1)

301-305: Add brief Doxygen for new accessors

Small clarity boost for public API.

Apply this diff:

-    [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const
+    //! \brief Per-request KV cache salt if present. Same value used to namespace KV-cache reuse.
+    [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const
     {
         return mCacheSaltID;
     }
@@
-    void setCacheSaltID(CacheSaltIDType cacheSaltID)
+    //! \brief Set per-request KV cache salt (namespaces KV reuse). Idempotent overwrite.
+    void setCacheSaltID(CacheSaltIDType cacheSaltID)
     {
         mCacheSaltID = cacheSaltID;
     }

Also applies to: 480-484

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

118-119: Add tokens-based BlockKey constructor overload for cacheSaltID

The existing C++ constructors in kvCacheManager.h are:

  • explicit BlockKey(VecTokens const& tokens, std::optional<LoraTaskIdType> loraTaskId = std::nullopt)
  • explicit BlockKey(bool usesExtraIds, std::optional<LoraTaskIdType> loraTaskId, VecUniqueTokens uniqueTokens, std::vector<MmKey> extraKeys = {}, std::optional<CacheSaltIDType> cacheSaltID = std::nullopt)

All salting-sensitive call sites in kvCacheManager.cpp already invoke the full overload and correctly pass llmRequest.getCacheSaltID(). No other C++ call sites use the tokens-only constructor. To reduce boilerplate where callers start from a token vector, add a convenience overload that accepts an optional cacheSaltID directly:

Diff in cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:

 class BlockKey
 {
 public:
     // … existing members …

+    // Convenience ctor: tokens + optional Lora task + optional cache salt
+    explicit BlockKey(
+        VecTokens const& tokens,
+        std::optional<LoraTaskIdType> loraTaskId = std::nullopt,
+        std::optional<CacheSaltIDType> cacheSaltID = std::nullopt)
+        : loraTaskId{loraTaskId}
+        , cacheSaltID{cacheSaltID}
+    {
+        uniqueTokens.reserve(tokens.size());
+        for (auto const& token : tokens)
+        {
+            uniqueTokens.push_back(UniqueToken{token, 0});
+        }
+    }
 
     // … existing full ctor already handles cacheSaltID …
 };

• Verified that in kvCacheManager.cpp all salting-sensitive sites pass through the full ctor with getCacheSaltID()
• No additional C++ construction sites use the tokens-only ctor; Python code in serve/router.py still uses the tokens-only binding—update Python bindings if salt support is needed there

cpp/include/tensorrt_llm/executor/executor.h (4)

673-674: Clarify parameter doc: the API accepts an ID, not a string.

The docstring says “limit the kv cache reuse to the requests with the same string,” but the parameter is a numeric CacheSaltIDType. Suggest rewording to avoid confusion and to note that None/absence means “unsalted/unrestricted reuse.”

Apply this diff to tighten the wording:

-    /// @param cacheSaltID Salt ID for KV cache blocks to limit the kv cache reuse to the requests with the same string.
+    /// @param cacheSaltID Salt identifier for KV cache blocks. When set, KV cache reuse is limited to requests using
+    /// the same salt identifier. Leave std::nullopt to allow unrestricted reuse (unsalted).

750-751: Add accessor docs; consider naming consistency with existing Id/ID usage.

  • Please add a brief Doxygen comment for getCacheSaltID() clarifying semantics (std::nullopt means unsalted).
  • Naming is mixed across the codebase (getClientId vs getCacheSaltID). Consider aligning to “Id” for API consistency in future cleanup.

Apply this diff to document the accessor:

-    [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const;
+    /// @brief Returns the KV cache salt identifier if set; std::nullopt indicates unsalted reuse.
+    [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const;

786-787: Offer a way to clear the salt at runtime.

There’s a setter but no way to clear (unset) the salt after construction besides rebuilding the Request. Consider adding either setCacheSaltID(std::nullopt) overload or a clear/reset method.

Here’s a minimal additive API:

     void setCacheSaltID(CacheSaltIDType cacheSaltID);
+    /// @brief Clear the KV cache salt identifier (revert to unsalted).
+    void clearCacheSaltID();

And in the Impl/definition:

+void Request::clearCacheSaltID()
+{
+    mImpl->setCacheSaltID(std::nullopt);
+}

1-15: Update copyright year.

Header still lists 2022–2024. Project guidelines ask to use the current year in source headers. Update to include 2025 if that’s the project-wide convention.

tensorrt_llm/executor/executor.py (2)

125-126: Document and validate cache_salt_id; allow broader int types.

  • Please document cache_salt_id in the method docstring.
  • Add a small validation/conversion to handle numpy scalars and ensure the ID fits uint64 (0..2^64-1). This avoids surprises coming from client code.

Apply this diff:

@@
-    def generate_async(
+    def generate_async(
         self,
         prompt_token_ids: List[int],
         sampling_params: SamplingParams,
         query_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None,
         lora_request: Optional[LoRARequest] = None,
         prompt_adapter_request: Optional[PromptAdapterRequest] = None,
         streaming: bool = False,
         kv_cache_retention_config: Optional[KvCacheRetentionConfig] = None,
         disaggregated_params: Optional[DisaggregatedParams] = None,
         postproc_params: Optional[PostprocParams] = None,
         multimodal_params: Optional[MultimodalParams] = None,
         scheduling_params: Optional[SchedulingParams] = None,
-        cache_salt_id: Optional[int] = None,
+        cache_salt_id: Optional[int] = None,
     ) -> GenerationResult:
-        """Generate output for the given prompt token ids in the asynchronous mode.
+        """Generate output for the given prompt token ids in the asynchronous mode.
         Asynchronous generation accepts single prompt only.
+
+        Args:
+            cache_salt_id: Optional uint64 identifier to isolate KV-cache reuse. Requests sharing the
+                same salt id may reuse KV cache; different ids prevent reuse. None means unsalted.
         """
@@
+        if cache_salt_id is not None:
+            # Accept plain int, numpy scalar, or types convertible to int
+            cache_salt_id = int(cache_salt_id)
+            if cache_salt_id < 0 or cache_salt_id >= (1 << 64):
+                raise ValueError("cache_salt_id must be in [0, 2**64 - 1]")

149-151: Propagation LGTM; consider adding cache_salt_id to synchronous generate() for API parity.

generate() doesn’t expose cache_salt_id, so there’s no way to use salting via the sync path. Consider adding an optional parameter there and forwarding it to generate_async (batched case could accept a single int or a list aligned with prompts).

Proposed minimal change:

@@ def generate(
-        disaggregated_params: Optional[DisaggregatedParams] = None,
+        disaggregated_params: Optional[DisaggregatedParams] = None,
+        cache_salt_id: Optional[Union[int, List[int]]] = None,
 ) -> Union[GenerationResult, List[GenerationResult]]:
@@
-            future = self.generate_async(
+            # Support per-request salting when batched (single int or a list matching prompts)
+            _salt = (cache_salt_id[i] if isinstance(cache_salt_id, list)
+                     else cache_salt_id)
+            future = self.generate_async(
                 p,
                 sampling_params=sp,
                 query_token_ids=query_token_ids,
                 lora_request=lora_req,
                 prompt_adapter_request=pa_req,
                 streaming=False,
-                disaggregated_params=disaggregated_params)
+                disaggregated_params=disaggregated_params,
+                cache_salt_id=_salt)
tensorrt_llm/inputs/utils.py (2)

605-610: Add docstring and minor guardrails.

This public API should have a short docstring per guidelines. Also, defensively strip the string to avoid accidental whitespace differences.

Apply this diff:

-def get_cache_salt_id(cache_salt: str) -> int:
-    b = cache_salt.encode("utf-8")
+def get_cache_salt_id(cache_salt: str) -> int:
+    """Return a 64-bit integer ID derived from a salt string.
+
+    The same input string deterministically maps to the same uint64. Different strings are unlikely
+    to collide. Use this to isolate KV-cache reuse across requests sharing the same salt.
+    """
+    b = cache_salt.strip().encode("utf-8")
     h = default_hasher(b).digest(length=8)
     return int.from_bytes(h, "little", signed=False)

Logic is otherwise sound and efficient.


1-15: Add header per project convention.

Python files typically include the NVIDIA copyright header. Consider adding it here.

tensorrt_llm/executor/worker.py (1)

502-504: Propagate cache_salt_id correctly; optional uint64 validation in Python

Propagation of cache_salt_id=request.cache_salt_id is already consistent:

  • All Python call sites to executor.Request include the new cache_salt_id keyword.
  • The nanobind/pybind layer exposes it via .def_property("cache_salt_id", &tle::Request::getCacheSaltID, &tle::Request::setCacheSaltID).

To guard against subtle runtime errors (e.g. passing a non‐int, negative value, or an integer > 2⁶⁴–1) before hitting the C++/nanobind boundary, you may optionally coerce and range‐check locally:

-                type=request_type,
-                cache_salt_id=request.cache_salt_id)
+                type=request_type,
+                cache_salt_id=(
+                    None if request.cache_salt_id is None
+                    else (lambda _s: (_s if (isinstance(_s, int) and 0 <= _s <= (1 << 64) - 1)
+                                      else (_ for _ in ()).throw(ValueError("cache_salt_id must be uint64"))))
+                         (int(request.cache_salt_id))
+                ))

To double‐check consistency across the codebase, you can run:

rg -nP --type=py -C2 'executor\.Request\([^)]*cache_salt_id\s*='
rg -nP --type=cpp -C3 'cacheSaltID|cache_salt_id|CacheSaltIDType' cpp/ tensorrt_llm/
benchmarks/cpp/disaggServerBenchmark.cpp (1)

546-547: Tidy optional pass-through and document new trailing cacheSaltID

The ternary is redundant for std::optional. Pass the optional directly; keep the explicit std::nullopt for the new cacheSaltID.

-            encoderInputTokenIds.has_value() ? encoderInputTokenIds : std::nullopt,
-            std::nullopt);   // cacheSaltID
+            encoderInputTokenIds,
+            std::nullopt);   // cacheSaltID

Architecture note: consider adding a CLI flag (e.g., --cache_salt_id) wired into these requests so benchmarks can exercise salted vs. unsalted reuse behavior end-to-end.

benchmarks/cpp/gptManagerBenchmark.cpp (1)

840-842: Simplify optional and add explicit cacheSaltID placeholder

Same cleanup applies here; the ternary is unnecessary when forwarding an std::optional.

-        encoderInputTokenIds.has_value() ? encoderInputTokenIds : std::nullopt,
-        std::nullopt);   // cacheSaltID
+        encoderInputTokenIds,
+        std::nullopt);   // cacheSaltID
tensorrt_llm/serve/openai_server.py (1)

50-51: Typo in formatter directive.

The comment reads “# yapf: enale”. Consider fixing to “enable” or remove if unused.

-# yapf: enale
+# yapf: enable
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)

54-54: Update parameter-count comment to match reality.

Constructor now has 50 parameters (including contextPhaseParams and cacheSaltID) but the comment still says 49.

-    // 49 parameters
+    // 50 parameters

88-90: Forwarding of cacheSaltID is correct; consider keeping comments and guards consistent.

  • The new std::optional cacheSaltID parameter and its forwarding into Base(...) look correct and maintain append-only ordering. Good.
  • Minor: coding guidelines require include guards for headers; this file uses #pragma once. If the codebase prefers #pragma once, ignore; otherwise consider adding TRTLLM_LLMREQUEST_H and removing #pragma once in a follow-up.

Also applies to: 150-152

tests/integration/defs/test_e2e.py (1)

1453-1463: Good targeted integration for cache-salt behavior.

Installing example requirements and running the dedicated _test_openai_cache_salt.py mirrors existing serve tests. LGTM.

If install time becomes an issue, consider factoring the pip install into a session-scoped fixture reused by related OpenAI serve tests to avoid repeated installs.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)

1778-1784: Make assertions independent of a shared numBlocks variable across cases.

In Cases 2–4 you assert against blockManager.getNumAllocatedBlocks() using numBlocks computed in Case 1. While it currently works (same token lengths), it couples cases and can lead to fragile tests if sequence lengths diverge. Recompute per case or compute local constants.

Apply the following diffs to compute per-case block counts and avoid cross-case coupling:

@@
-    llmRequest1->addNewToken(3, beamIdx);
-    llmRequest1->addNewToken(4, beamIdx);
-    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks);
-    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks);
+    llmRequest1->addNewToken(3, beamIdx);
+    llmRequest1->addNewToken(4, beamIdx);
+    numTokens = llmRequest1->getNumTokens(beamIdx);
+    auto const numBlocks1 = tc::ceilDiv(numTokens, tokensPerBlock);
+    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks1);
+    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks1);
@@
-    llmRequest2->addNewToken(3, beamIdx);
-    llmRequest2->addNewToken(4, beamIdx);
-    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks);
-    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks);
+    llmRequest2->addNewToken(3, beamIdx);
+    llmRequest2->addNewToken(4, beamIdx);
+    numTokens = llmRequest2->getNumTokens(beamIdx);
+    auto const numBlocks2 = tc::ceilDiv(numTokens, tokensPerBlock);
+    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks2);
+    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks2);
@@
-    llmRequest3->addNewToken(5, beamIdx);
-    llmRequest3->addNewToken(6, beamIdx);
-    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks);
-    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks);
+    llmRequest3->addNewToken(5, beamIdx);
+    llmRequest3->addNewToken(6, beamIdx);
+    numTokens = llmRequest3->getNumTokens(beamIdx);
+    auto const numBlocks3 = tc::ceilDiv(numTokens, tokensPerBlock);
+    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks3);
+    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks3);
@@
-    llmRequest4->addNewToken(7, beamIdx);
-    numTokens = llmRequest4->getNumTokens(beamIdx);
-    numBlocks = tc::ceilDiv(numTokens, tokensPerBlock);
-    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks * 2);
-    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks * 2);
+    llmRequest4->addNewToken(7, beamIdx);
+    auto const numBlocksCase3 = tc::ceilDiv(llmRequest3->getNumTokens(beamIdx), tokensPerBlock);
+    numTokens = llmRequest4->getNumTokens(beamIdx);
+    auto const numBlocksCase5 = tc::ceilDiv(numTokens, tokensPerBlock);
+    EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocksCase3 + numBlocksCase5);
+    EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - (numBlocksCase3 + numBlocksCase5));

Also applies to: 1811-1815, 1843-1847, 1869-1874


1736-1743: Optional: Add EXPECT_NO_THROW wrappers around addSequence for consistency and clearer failures.

Other tests in this file occasionally wrap state-changing calls in EXPECT_NO_THROW. Mirroring that here keeps failure points crisp without affecting behavior.

Example for Case 1:

-    blockManager.addSequence(seq0, promptLen0, numContextBlocks0, *llmRequest0, maxAttentionWindow);
+    EXPECT_NO_THROW(blockManager.addSequence(seq0, promptLen0, numContextBlocks0, *llmRequest0, maxAttentionWindow));

Repeat similarly for seq1–seq4 additions in this test.

Also applies to: 1773-1779, 1804-1810, 1836-1842, 1863-1868


1758-1769: Consider asserting reuse metrics to complement block ID checks.

If available in this scope, asserting llmRequestX->getReusedBlocksPerRequest() or getKVCacheHitRatePerRequest() would make the test resilient to incidental ID allocation differences while still validating the core reuse/no-reuse semantics governed by cache_salt_id.

tensorrt_llm/serve/openai_protocol.py (3)

554-560: Clarify description and mark as prototype in schema.

The current description is slightly awkward and doesn’t advertise feature maturity. Recommend clearer wording and tagging the field as “prototype” in the schema to set expectations.

Apply this diff:

-    cache_salt: Optional[str] = Field(
-        default=None,
-        description=
-        ("If specified, KV cache will be salted with the provided string "
-         "to limit the kv cache reuse on with the requests having the same string."
-         ))
+    cache_salt: Optional[str] = Field(
+        default=None,
+        description=(
+            "Optional: scope KV-cache reuse to requests that provide the same salt string. "
+            "Use to isolate prompts or tenants and mitigate cross-request cache reuse."
+        ),
+        json_schema_extra={"status": "prototype"},
+    )

640-650: Trim and validate non-empty salt strings; normalize input in-place.

Accepting whitespace-only strings as valid salts is likely unintended and can lead to hard-to-debug behavior. Normalize with strip(), then validate non-empty.

Apply this diff:

 @model_validator(mode="before")
 @classmethod
 def check_cache_salt_support(cls, data):
-    if data.get("cache_salt") is not None:
-        if not isinstance(data["cache_salt"],
-                          str) or not data["cache_salt"]:
-            raise ValueError(
-                "Parameter 'cache_salt' must be a non-empty string if provided."
-            )
+    cache_salt = data.get("cache_salt")
+    if cache_salt is not None:
+        if not isinstance(cache_salt, str):
+            raise ValueError("Parameter 'cache_salt' must be a string if provided.")
+        cache_salt = cache_salt.strip()
+        if not cache_salt:
+            raise ValueError("Parameter 'cache_salt' must be a non-empty string after trimming.")
+        data["cache_salt"] = cache_salt
     return data

194-257: Ensure cache_salt parity between CompletionRequest and ChatCompletionRequest

CompletionRequest (in tensorrt_llm/serve/openai_protocol.py, lines 194–257) currently lacks the cache_salt field and the corresponding validation logic that ChatCompletionRequest provides (lines 554–647). To avoid confusion and maintain a consistent API surface, please choose one of the following:

• Add cache_salt to CompletionRequest, mirroring ChatCompletionRequest:

cache_salt: Optional[str] = Field(
    default=None,
    description=(
        "If specified, KV cache will be salted with the provided string "
        "to limit the kv cache reuse to the requests having the same string."
    ),
)

and register the same @root_validator(pre=True)–based check_cache_salt_support to enforce non‐empty strings.

• If salting is intentionally chat-only, add a brief note to CompletionRequest’s doc comments explaining why cache_salt isn’t supported for the Completions API.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

197-197: Expose a short docstring for cache_salt_id property.

Minor ergonomics: adding a docstring helps discoverability in Python help() and IDEs.

Apply this diff:

-        .def_property_readonly("cache_salt_id", &GenLlmReq::getCacheSaltID)
+        .def_property_readonly(
+            "cache_salt_id",
+            &GenLlmReq::getCacheSaltID,
+            "Optional uint64 salt that scopes KV-cache reuse; None means no salt.")
tests/unittest/llmapi/apps/_test_openai_cache_salt.py (5)

1-1: Add NVIDIA copyright header (2025) to comply with repo guidelines.

All source files must carry the current-year NVIDIA header. Please prepend the SPDX header before the module docstring.

+ # SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
 """Test cache_salt functionality in OpenAI API to ensure it prevents cache reuse"""

21-47: Avoid fixed temp filename to prevent collisions in parallel test runs.

Using a constant filename in the system temp directory can race when pytest runs in parallel. Prefer mkstemp/NamedTemporaryFile.

-    temp_dir = tempfile.gettempdir()
-    temp_file_path = os.path.join(temp_dir, "cache_salt_test_options.yaml")
+    fd, temp_file_path = tempfile.mkstemp(prefix="cache_salt_test_options_", suffix=".yaml")
+    os.close(fd)

140-162: Relax assertion to tolerate metric sampling noise.

Hit-rate snapshots may not strictly increase step-to-step. Use a tiny epsilon and include values in the failure message.

-    assert hit_rate_after_reuse >= initial_hit_rate, \
-        "Cache hit rate should increase when reusing cache without salt"
+    eps = 1e-6
+    assert hit_rate_after_reuse + eps >= initial_hit_rate, \
+        f"Expected non-decreasing hit rate on reuse; got initial={initial_hit_rate:.4f}, after={hit_rate_after_reuse:.4f}"

176-181: Prefer checking counter deltas over global hit-rate for salt/no-salt behavior.

A lower hit-rate is a proxy and can flake. More robust: compare deltas of reusedBlocks/missedBlocks before/after requests (expect more “missed” and no “reused” when salt changes).

I can refactor the test to record (reusedBlocks, missedBlocks) before/after each call and assert on those deltas. Want me to push a patch?


223-231: Catch a specific OpenAI exception for empty salt validation.

Catching Exception is too broad; assert protocol semantics explicitly.

-    with pytest.raises(Exception) as exc_info:
+    with pytest.raises(openai.BadRequestError) as exc_info:
         client.chat.completions.create(
             model=model_name,
             messages=messages,
             max_tokens=max_tokens,
             extra_body={"cache_salt": ""}  # Empty string should be rejected
         )
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (4)

100-101: Public alias CacheSaltIDType: OK; add brief Doxygen for semantics.

Add a short comment clarifying what the salt represents (e.g., per-tenant or per-user cache domain) and its valid range/type as defined in runtime.


1770-1774: Getter getCacheSaltID(): add Doxygen.

Expose intended use in hashing/block-key derivation; note that absence implies unsalted behavior.


2053-2055: mCacheSaltID member: consider privacy and documentation.

Keeping it protected is fine for derived access. Recommend documenting lifecycle (e.g., copied to children via createChildRequest if applicable) and thread-safety expectations.

If child requests inherit salt, confirm in LlmRequest::createChildRequest implementation or tests.


1-3: Update header copyright year and consider include guards per guidelines.

File header still says 2022-2024; please bump to 2025. Repo guidelines also prefer explicit include guards (TRTLLM_BATCH_MANAGER_LLMREQUEST_H) over pragma once.

Would you like me to generate a guard template consistent with the project’s naming?

cpp/tensorrt_llm/executor/request.cpp (1)

422-425: New mutator setCacheSaltID — OK; minor style nit.

Delegation is consistent. Optional nit: most codebases avoid return expr; in void functions; if you prefer, change to a plain call for readability.

-void Request::setCacheSaltID(CacheSaltIDType cacheSaltID)
-{
-    return mImpl->setCacheSaltID(cacheSaltID);
-}
+void Request::setCacheSaltID(CacheSaltIDType cacheSaltID)
+{
+    mImpl->setCacheSaltID(cacheSaltID);
+}
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

134-136: Widened multimodal start offset to 64-bit — ensure MmKey’s second type matches.

mmStartInBlock is now uint64_t and later hashed as 64-bit. Make sure the MmKey alias (and any containers) was also updated to use uint64_t as the second element to avoid truncation when emplacing. The local comment still says SizeType32 — please update it for accuracy.

-    std::vector<MmKey> extraKeys; // MmKey = std::pair<std::array<uint8_t, 32>, SizeType32>
+    std::vector<MmKey> extraKeys; // MmKey = std::pair<std::array<uint8_t, 32>, uint64_t>

170-178: Salt mixed into root-block hash only — design ACK; document and test.

Mixing the salt only when parentHash == 0 is sufficient, as descendants derive from the salted root hash. Two asks:

  • Document this in BlockKeyHasher’s header-level comment for future maintainers.
  • Ensure tests cover “same tokens, different salt → no reuse” and “same tokens, same salt → reuse,” including partial-reuse scenarios.

The implementation looks sound.

cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (1)

55-90: Constructor extended with cacheSaltID — verify bindings parity and header defaults.

Forwarding cacheSaltID to Base is correct. Please verify:

  • pybind and nanobind call sites include the new kwarg at the tail to keep backward compatibility (looks like they do).
  • The underlying C++ Base constructors have defaulted trailing parameters so older code paths remain callable.

Also, consider updating the header’s SPDX year range to include 2025 to match project guidance.

- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/llmapi/llm.py (3)

319-320: Public API: add cache_salt to generate_async — nice; consider parity in generate().

Adding cache_salt to generate_async is great. For parity and ergonomics, consider also threading a cache_salt (or per-item list) through generate() to avoid forcing users into the async path for salting.


343-346: Derive cache_salt_id once — OK; add a small guard for empty strings?

Current logic treats "" as a valid salt. If you’d prefer “empty string = no salt,” add a truthy check before hashing. Otherwise, it’s fine as-is.

-        cache_salt_id = get_cache_salt_id(
-            cache_salt) if cache_salt is not None else None
+        cache_salt_id = get_cache_salt_id(cache_salt) if cache_salt else None

334-335: Docstring line too long (ruff E501) — wrap for 120 cols.

Reflow the new line to satisfy the project’s 120-col wrap.

-            cache_salt (str, optional): If specified, KV cache will be salted with the provided string to limit the kv cache reuse to the requests with the same string. Defaults to None.
+            cache_salt (str, optional): If specified, KV cache will be salted with
+                the provided string to limit KV-cache reuse to requests with the
+                same string. Defaults to None.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)

579-605: Backward-incompatible pickle shape and ctor arg mismatch; fix setstate and restore compatibility.

  • The size check forces 34, breaking older pickles.
  • The placement-new omits languageAdapterUid and allottedTimeMs and passes cache_salt_id in their slot, which is a constructor arity/order mismatch and will not bind to the intended overload.

Make __setstate__ accept 33 (legacy), 34 (intermediate: only cache_salt_id), and 35 (current: allotted_time_ms + cache_salt_id), and pass std::nullopt for languageAdapterUid, then the extracted allottedTimeMs and cacheSaltID in order.

Apply this diff:

-    auto requestSetstate = [](tle::Request& self, nb::tuple const& state)
+    auto requestSetstate = [](tle::Request& self, nb::tuple const& state)
     {
-        if (state.size() != 34)
+        if (state.size() != 33 && state.size() != 34 && state.size() != 35)
         {
             throw std::runtime_error("Invalid Request state!");
         }
+        // Backward compatibility:
+        // 33: legacy (no allotted_time_ms, no cache_salt_id)
+        // 34: intermediate (cache_salt_id only)
+        // 35: current (allotted_time_ms and cache_salt_id)
+        std::optional<tle::MillisecondsType> allottedTimeMs = std::nullopt;
+        std::optional<tle::CacheSaltIDType> cacheSaltID = std::nullopt;
+        if (state.size() == 34)
+        {
+            cacheSaltID = nb::cast<std::optional<tle::CacheSaltIDType>>(state[33]);
+        }
+        else if (state.size() == 35)
+        {
+            allottedTimeMs = nb::cast<std::optional<tle::MillisecondsType>>(state[33]);
+            cacheSaltID = nb::cast<std::optional<tle::CacheSaltIDType>>(state[34]);
+        }
         new (&self) tle::Request(nb::cast<VecTokens>(state[0]), nb::cast<SizeType32>(state[1]),
             nb::cast<bool>(state[2]), nb::cast<tle::SamplingConfig>(state[3]), nb::cast<tle::OutputConfig>(state[4]),
             nb::cast<std::optional<SizeType32>>(state[5]), nb::cast<std::optional<SizeType32>>(state[6]),
             nb::cast<std::optional<std::vector<SizeType32>>>(state[7]),
             nb::cast<std::optional<std::list<VecTokens>>>(state[8]),
             nb::cast<std::optional<std::list<VecTokens>>>(state[9]), nb::cast<std::optional<Tensor>>(state[10]),
             nb::cast<std::optional<tle::ExternalDraftTokensConfig>>(state[11]),
             nb::cast<std::optional<tle::PromptTuningConfig>>(state[12]),
             nb::cast<std::optional<tle::MultimodalInput>>(state[13]), nb::cast<std::optional<Tensor>>(state[14]),
             nb::cast<std::optional<tle::MropeConfig>>(state[15]), nb::cast<std::optional<tle::LoraConfig>>(state[16]),
             nb::cast<std::optional<tle::LookaheadDecodingConfig>>(state[17]),
             nb::cast<std::optional<tle::KvCacheRetentionConfig>>(state[18]),
             nb::cast<std::optional<std::string>>(state[19]),
             nb::cast<std::optional<tle::LogitsPostProcessor>>(state[20]), nb::cast<std::optional<VecTokens>>(state[21]),
             nb::cast<std::optional<IdType>>(state[22]), nb::cast<bool>(state[23]),
             nb::cast<tle::PriorityType>(state[24]), nb::cast<tle::RequestType>(state[25]),
             nb::cast<std::optional<tle::ContextPhaseParams>>(state[26]),
             nb::cast<std::optional<tle::Tensor>>(state[27]), nb::cast<std::optional<SizeType32>>(state[28]),
             nb::cast<std::optional<tle::Tensor>>(state[29]), 1, nb::cast<std::optional<tle::EagleConfig>>(state[30]),
-            nb::cast<std::optional<tle::Tensor>>(state[31]),
-            nb::cast<std::optional<tle::GuidedDecodingParams>>(state[32]),
-            nb::cast<std::optional<tle::CacheSaltIDType>>(state[33]));
+            nb::cast<std::optional<tle::Tensor>>(state[31]),
+            nb::cast<std::optional<tle::GuidedDecodingParams>>(state[32]),
+            std::nullopt,                  // languageAdapterUid not serialized today
+            allottedTimeMs,
+            cacheSaltID);
     };

575-576: Allotted time is not serialized; pickling drops allotted_time_ms.

__getstate__ appends cache_salt_id but omits allotted_time_ms, so a Request pickled with a non-null allotted time will silently lose it upon unpickling. Append getAllottedTimeMs() to the state for persistence.

Apply this diff:

-            self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(),
-            self.getGuidedDecodingParams(), self.getCacheSaltID());
+            self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(),
+            self.getGuidedDecodingParams(),
+            self.getAllottedTimeMs(),          // NEW: preserve allotted_time_ms
+            self.getCacheSaltID());            // NEW: preserve cache_salt_id
🧹 Nitpick comments (1)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)

684-686: New kw-only args are exposed clearly; consider documenting persistence behavior.

language_adapter_uid, allotted_time_ms, and cache_salt_id are now constructible from Python. Note that only allotted_time_ms and cache_salt_id are intended to be pickled; language_adapter_uid is not. Please confirm this is intentional and document it in the Python API docs to avoid surprises.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 66c4dd9 and 4d3fa49.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (5 hunks)
  • cpp/tensorrt_llm/pybind/executor/request.cpp (5 hunks)
  • tensorrt_llm/inputs/utils.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/inputs/utils.py
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/pybind/executor/request.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/nanobind/executor/request.cpp
🔇 Additional comments (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)

644-646: Ctor signature addition looks good and consistent with Python exposure.

The trailing optional parameters languageAdapterUid, allottedTimeMs, and cacheSaltID are ordered correctly and keep ABI stability for positional args in Python (kw-only). No further action.


728-729: Property exposure for cache_salt_id is correct.

RW property maps cleanly to the new accessors. Once __getstate__/__setstate__ are fixed, round-tripping via pickle should preserve this value.

@chang-l
Copy link
Collaborator Author

chang-l commented Aug 22, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16229 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (9)
cpp/include/tensorrt_llm/runtime/common.h (1)

47-47: Avoid duplicating CacheSaltIDType across namespaces

There is a parallel alias in executor/types.h. To prevent drift and include-order surprises, consider centralizing the typedef in one header (e.g., runtime/common.h) and aliasing from executor (using CacheSaltIDType = runtime::CacheSaltIDType;).

-// in cpp/include/tensorrt_llm/executor/types.h
-using CacheSaltIDType = std::uint64_t;
+using CacheSaltIDType = tensorrt_llm::runtime::CacheSaltIDType;
cpp/tensorrt_llm/executor/serialization.cpp (2)

716-726: Outdated parameter count comment

Comment says “35 parameters” but an extra argument was added. Either update the count or remove the comment to avoid future drift.

-    // 35 parameters
+    // 36 parameters

714-726: Serialization parity verified

The cacheSaltID field is correctly included at the end of both RequestImpl::serialize() and RequestImpl::serializedSize() via the visitMembers invocation (see the final lambda(mCacheSaltID); in requestImpl.h), and it is deserialized last in Serialization::deserializeRequest (lines 714–726 in serialization.cpp). No misalignment between writer and reader was found.

Optional improvement:

  • If you require cross-version IPC resilience, consider adding a minor version tag to the stream or performing a feature-flagged read (e.g. peeking for EOF before attempting to deserialize cacheSaltID).
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

292-293: Document cache_salt_id in nanobind bindings

Parity between the nanobind and pybind constructors is correct: both take a trailing cache_salt_id parameter defaulting to nullopt/None.

• File: cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
– At the .def("__init__", … nb::arg("cache_salt_id") = std::nullopt) invocation (around line 359), add a short docstring explaining that cache_salt_id implements per-request KV cache salting and that None means “no salt.”

tensorrt_llm/executor/executor.py (1)

125-126: New cache_salt_id parameter — good, but update the docstring

Add a short docstring entry explaining purpose and expected type/range (uint64). This helps downstream users discoverability.

     def generate_async(
         self,
         prompt_token_ids: List[int],
         sampling_params: SamplingParams,
         query_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None,
         lora_request: Optional[LoRARequest] = None,
         prompt_adapter_request: Optional[PromptAdapterRequest] = None,
         streaming: bool = False,
         kv_cache_retention_config: Optional[KvCacheRetentionConfig] = None,
         disaggregated_params: Optional[DisaggregatedParams] = None,
         postproc_params: Optional[PostprocParams] = None,
         multimodal_params: Optional[MultimodalParams] = None,
         scheduling_params: Optional[SchedulingParams] = None,
-        cache_salt_id: Optional[int] = None,
+        cache_salt_id: Optional[int] = None,
     ) -> GenerationResult:
-        """Generate output for the given prompt token ids in the asynchronous mode.
-        Asynchronous generation accepts single prompt only.
-        """
+        """Generate output for the given prompt token ids in the asynchronous mode.
+        Asynchronous generation accepts single prompt only.
+
+        Args:
+            cache_salt_id: Optional 64-bit integer used to salt KV-cache keys for this request.
+                Use the same salt to enable reuse across trusted requests; use different salts to isolate reuse.
+                When None, no additional salting is applied.
+        """
tensorrt_llm/llmapi/llm.py (4)

330-331: Expose cache_salt in synchronous generate() for API parity

generate_async() now accepts cache_salt, but generate() does not pass it through. This makes KV-cache salting inaccessible via the sync API and diverges from the async behavior. Suggest adding an optional cache_salt (scalar or per-request sequence) to generate() and forwarding it per item, mirroring how other per-request params are handled.

Apply this diff to add the parameter, docstring, and pass-through:

@@
     def generate(
         self,
         inputs: Union[PromptInputs, Sequence[PromptInputs]],
         sampling_params: Optional[Union[SamplingParams,
                                         List[SamplingParams]]] = None,
         use_tqdm: bool = True,
         lora_request: Optional[Union[LoRARequest,
                                      Sequence[LoRARequest]]] = None,
         prompt_adapter_request: Optional[Union[
             PromptAdapterRequest, Sequence[PromptAdapterRequest]]] = None,
         kv_cache_retention_config: Optional[Union[
             KvCacheRetentionConfig, Sequence[KvCacheRetentionConfig]]] = None,
         disaggregated_params: Optional[Union[
             DisaggregatedParams, Sequence[DisaggregatedParams]]] = None,
         scheduling_params: Optional[Union[SchedulingParams,
                                           List[SchedulingParams]]] = None,
+        cache_salt: Optional[Union[str, Sequence[str]]] = None,
     ) -> Union[RequestOutput, List[RequestOutput]]:
@@
             scheduling_params (tensorrt_llm.scheduling_params.SchedulingParams, List[tensorrt_llm.scheduling_params.SchedulingParams], optional):
                 Scheduling parameters. Defaults to None.
+            cache_salt (str or Sequence[str], optional):
+                If specified, salts the KV cache so reuse is limited to requests that use the same salt.
+                For batched inputs, pass a sequence of salts aligned with the batch. Defaults to None.
@@
         for i, request_inputs in enumerate(inputs):
             future = self.generate_async(
                 request_inputs,
                 sampling_params=_item_at(sampling_params, i),
                 lora_request=_item_at(lora_request, i),
                 prompt_adapter_request=_item_at(prompt_adapter_request, i),
                 kv_cache_retention_config=_item_at(kv_cache_retention_config,
                                                    i),
                 disaggregated_params=_item_at(disaggregated_params, i),
                 scheduling_params=_item_at(scheduling_params, i),
-                streaming=False)
+                streaming=False,
+                cache_salt=_item_at(cache_salt, i))

Also applies to: 238-256, 258-276, 293-305


345-345: Wrap docstring line to satisfy E501 (<=120 chars)

Line length exceeds the configured limit; wrap for readability and to keep linters quiet.

-            cache_salt (str, optional): If specified, KV cache will be salted with the provided string to limit the kv cache reuse to the requests with the same string. Defaults to None.
+            cache_salt (str, optional):
+                If specified, salts the KV cache so reuse is limited to requests that use the same string.
+                Defaults to None.

355-357: Normalize empty/whitespace salt to None to avoid accidental global bucket

Treating "" or " " as a valid salt creates a single common salt bucket that callers might hit by mistake. Normalizing blank/whitespace-only salts to None reduces surprise and aligns with the default behavior.

-        cache_salt_id = get_cache_salt_id(
-            cache_salt) if cache_salt is not None else None
+        # Normalize blank/whitespace-only salts to None
+        cache_salt_id = (
+            get_cache_salt_id(cache_salt.strip())
+            if (cache_salt and cache_salt.strip())
+            else None
+        )

1-1: Repository guideline: ensure NVIDIA copyright header present

Per coding guidelines, source files should carry the NVIDIA copyright header (current year). This file appears to be missing it.

If the project’s standard header is the canonical NVIDIA header, prepend it here (or confirm it’s managed by a repo-wide tooling step).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4d3fa49 and 57ea549.

📒 Files selected for processing (31)
  • benchmarks/cpp/disaggServerBenchmark.cpp (1 hunks)
  • benchmarks/cpp/gptManagerBenchmark.cpp (1 hunks)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (15 hunks)
  • cpp/include/tensorrt_llm/executor/executor.h (4 hunks)
  • cpp/include/tensorrt_llm/executor/types.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/common.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/executor/request.cpp (5 hunks)
  • cpp/tensorrt_llm/executor/requestImpl.h (7 hunks)
  • cpp/tensorrt_llm/executor/serialization.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (5 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (5 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (5 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/pybind/executor/request.cpp (5 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/executor/executor.py (2 hunks)
  • tensorrt_llm/executor/request.py (2 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/inputs/__init__.py (2 hunks)
  • tensorrt_llm/inputs/utils.py (2 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/serve/openai_protocol.py (2 hunks)
  • tensorrt_llm/serve/openai_server.py (1 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_cache_salt.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (25)
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/executor/worker.py
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/executor/types.h
  • tensorrt_llm/serve/openai_protocol.py
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
  • tensorrt_llm/inputs/init.py
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
  • benchmarks/cpp/disaggServerBenchmark.cpp
  • tensorrt_llm/executor/request.py
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
  • benchmarks/cpp/gptManagerBenchmark.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/executor/request.cpp
  • tensorrt_llm/serve/openai_server.py
  • cpp/tensorrt_llm/executor/requestImpl.h
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/executor/request.cpp
  • tests/unittest/llmapi/apps/_test_openai_cache_salt.py
  • tests/integration/defs/test_e2e.py
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/nanobind/executor/request.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/include/tensorrt_llm/runtime/common.h
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/include/tensorrt_llm/runtime/common.h
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/runtime/common.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/include/tensorrt_llm/runtime/common.h
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • tensorrt_llm/executor/executor.py
  • cpp/tensorrt_llm/executor/serialization.cpp
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/llmapi/llm.py
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • cpp/tensorrt_llm/executor/serialization.cpp
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/llmapi/llm.py
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

345-345: Line too long (186 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
cpp/include/tensorrt_llm/runtime/common.h (1)

47-47: Add CacheSaltIDType alias (uint64_t) — looks good

Fixed-width unsigned fits the hashing/salting use-case and matches other IDs in this header. No functional concerns here.

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

547-549: Threading cache_salt_id into LlmRequest — correct and consistent

Forwarding executor_request.cache_salt_id matches the new binding signature and keeps the parameter optional. This maintains backward compatibility.


547-549: cache_salt_id is fully exposed and propagated end-to-end — no further changes needed

  • Python bindings
    executor_request.cache_salt_id is surfaced in worker.py (line 503), request.py (line 99), and executor.py (line 150).
    – Nanobind/pybind both expose cache_salt_id as a read/write property on tle::Request.

  • C++ Request serialization
    – The C++ Request constructor and its Impl (in requestImpl.h lines 89–92) store the salt in mCacheSaltID.
    – Serialization/deserialization in executor/serialization.cpp (lines 714–725) carries cacheSaltID through the stream.

  • KV cache key hashing
    BlockKey is constructed with llmRequest.getCacheSaltID() (kvCacheManager.cpp line 154) and stored in the key.
    – Unit tests in kvCacheManagerTest.cpp verify that requests with different cache_salt_id produce distinct cache blocks (Test Case 2), confirming the salt is mixed into the key hash.

All three verification points pass; this review comment can be resolved.

cpp/tensorrt_llm/executor/serialization.cpp (1)

714-715: Deserializing optional cacheSaltID — correct placement

Reading std::optional at the end of Request’s stream is straightforward and keeps older fields untouched.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

192-193: Expose cache_salt_id on GenericLlmRequest — good Python ergonomics

Read-only property mapped to GenLlmReq::getCacheSaltID is appropriate. Optional maps naturally to Python’s None.

tensorrt_llm/llmapi/llm.py (1)

30-31: LGTM: proper import of cache-salt utilities

Importing create_input_processor_with_hash and get_cache_salt_id from tensorrt_llm.inputs cleanly wires the new salting and MM hashing paths into this module.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16229 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12200 completed with status: 'FAILURE'

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM on the llmapi changes.

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
@chang-l
Copy link
Collaborator Author

chang-l commented Sep 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17624 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17624 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13248 completed with status: 'FAILURE'

Copy link
Collaborator

@thorjohnsen thorjohnsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Chang Liu <[email protected]>
@chang-l chang-l requested a review from a team as a code owner September 4, 2025 20:14
Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17718 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17718 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User chang-l is not in the prioritized list. L0 testing cannot be triggered.

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17803 [ run ] triggered by Bot

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
@chang-l chang-l enabled auto-merge (squash) September 5, 2025 22:49
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17803 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13330 completed with status: 'SUCCESS'

@chang-l
Copy link
Collaborator Author

chang-l commented Sep 6, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17898 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17898 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #17803 for commit e8855be

@chang-l chang-l merged commit 23500b5 into NVIDIA:main Sep 6, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants