-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse #7106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds optional per-request KV-cache salting across C++ and Python: new CacheSaltID type, request fields/getters/setters, propagated through LlmRequest/Request constructors and bindings, mixed into BlockKey hashing for first-block seeding, utilities to derive salt IDs, and tests/benchmarks updated to exercise the feature. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant OpenAI_Server as OpenAI Server
participant LLM as BaseLLM.generate_async
participant PyExec as Python Executor
participant CppReq as C++ Request
participant KV as KV Cache Manager
Client->>OpenAI_Server: ChatCompletionRequest(cache_salt?)
OpenAI_Server->>LLM: generate_async(..., cache_salt)
LLM->>LLM: if cache_salt -> get_cache_salt_id() -> cache_salt_id
LLM->>PyExec: generate_async(..., cache_salt_id)
PyExec->>CppReq: construct Request(..., cacheSaltID=cache_salt_id)
CppReq->>KV: buildBlockKeys(cacheSaltID)
Note over KV: first-block hash mixes cacheSaltID when present
KV-->>CppReq: allocate/reuse blocks (salted keys)
CppReq-->>PyExec: generation output
PyExec-->>LLM: result
LLM-->>OpenAI_Server: response
OpenAI_Server-->>Client: completion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/include/tensorrt_llm/executor/executor.h (1)
675-703
: Misaligned constructor signature between declaration and definitionThe implementation in
cpp/tensorrt_llm/executor/request.cpp
does not exactly match the declaration incpp/include/tensorrt_llm/executor/executor.h
:
- In the definition (lines 28–100), the 21st parameter is named
std::optional<LogitsPostProcessor> logitslogitsPostProcessor
but the header declares it asstd::optional<LogitsPostProcessor> logitsPostProcessor
.- The
priority
parameter’s type isfloat
in the .cpp, whereas the header usesPriorityType priority
. Even ifPriorityType
is an alias forfloat
, the signature must use the same identifier.Please update the definition to align exactly with the declaration:
• Change the parameter name from
logitslogitsPostProcessor
→logitsPostProcessor
.
• UsePriorityType priority
(notfloat
) in the implementation.Everything else—the placement of new optional parameters and the Python/Nanobind bindings for
cache_salt_id
—looks correct.cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)
75-79
: Bug: optEncoderInputTokens engaged with nullptr when mEncoderTokens is absentCurrent construction wraps a nullptr in an engaged
std::optional
, which can lead downstream code to dereference a null shared_ptr. Use nullopt when there is no encoder input.- auto const encoderInputTokens = mEncoderTokens.has_value() - ? std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get()) - : nullptr; - auto const optEncoderInputTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(encoderInputTokens); + std::optional<std::shared_ptr<std::vector<TokenIdType>>> optEncoderInputTokens = + mEncoderTokens.has_value() + ? std::make_optional( + std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get())) + : std::nullopt;cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp (1)
78-130
: Critical: Binding passescacheSaltID
without header supportThe pybind bridge at
•cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
(lines 78–130)
now appendsmCacheSaltID
to thetb::LlmRequest
constructor call, but:• In
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
, theLlmRequest
class declaration and its constructors do not list acacheSaltID
(or similarly named) parameter.This mismatch will break the build and ABI.
Please update the
LlmRequest
constructor signature (and its implementation) in the header and source to include thecacheSaltID
parameter—positioned aftermContextPhaseParams
(or in the correct spot relative to any defaulted parameters)—or remove the extra argument from the bridge.cpp/tensorrt_llm/pybind/executor/request.cpp (1)
518-555
: Pickle get/set state does not persist language_adapter_uid/allotted_time_ms/cache_salt_id.
requestGetstate
/requestSetstate
still serialize 33 entries and don’t carry the three trailing fields. That silently dropscache_salt_id
(and potentiallylanguage_adapter_uid
/allotted_time_ms
) across pickle boundaries.Recommend:
- Append these fields to the tuple returned by
getstate
.- Make
setstate
backward compatible by accepting both the legacy 33-tuple and a new 36-tuple.Proposed patch (minimal and backward compatible):
@@ - auto requestGetstate = [](tle::Request const& self) + auto requestGetstate = [](tle::Request const& self) { - return py::make_tuple(self.getInputTokenIds(), self.getMaxTokens(), self.getStreaming(), + return py::make_tuple(self.getInputTokenIds(), self.getMaxTokens(), self.getStreaming(), self.getSamplingConfig(), self.getOutputConfig(), self.getEndId(), self.getPadId(), self.getPositionIds(), self.getBadWords(), self.getStopWords(), self.getEmbeddingBias(), self.getExternalDraftTokensConfig(), self.getPromptTuningConfig(), self.getMultimodalInput(), self.getMultimodalEmbedding(), self.getMropeConfig(), self.getLoraConfig(), self.getLookaheadConfig(), self.getKvCacheRetentionConfig(), self.getLogitsPostProcessorName(), self.getLogitsPostProcessor(), self.getEncoderInputTokenIds(), self.getClientId(), self.getReturnAllGeneratedTokens(), self.getPriority(), self.getRequestType(), self.getContextPhaseParams(), self.getEncoderInputFeatures(), self.getEncoderOutputLength(), - self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(), - self.getGuidedDecodingParams()); + self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(), + self.getGuidedDecodingParams(), + self.getLanguageAdapterUid(), self.getAllottedTimeMs(), self.getCacheSaltID()); }; - auto requestSetstate = [](py::tuple const& state) + auto requestSetstate = [](py::tuple const& state) { - if (state.size() != 33) + if (state.size() != 33 && state.size() != 36) { throw std::runtime_error("Invalid Request state!"); } - return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(), + // Legacy 33-field pickle (no language_adapter_uid/allotted_time_ms/cache_salt_id) + if (state.size() == 33) + { + return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(), state[2].cast<bool>(), state[3].cast<tle::SamplingConfig>(), state[4].cast<tle::OutputConfig>(), state[5].cast<std::optional<SizeType32>>(), state[6].cast<std::optional<SizeType32>>(), state[7].cast<std::optional<std::vector<SizeType32>>>(), state[8].cast<std::optional<std::list<VecTokens>>>(), state[9].cast<std::optional<std::list<VecTokens>>>(), state[10].cast<std::optional<Tensor>>(), state[11].cast<std::optional<tle::ExternalDraftTokensConfig>>(), state[12].cast<std::optional<tle::PromptTuningConfig>>(), state[13].cast<std::optional<tle::MultimodalInput>>(), state[14].cast<std::optional<Tensor>>(), state[15].cast<std::optional<tle::MropeConfig>>(), state[16].cast<std::optional<tle::LoraConfig>>(), state[17].cast<std::optional<tle::LookaheadDecodingConfig>>(), state[18].cast<std::optional<tle::KvCacheRetentionConfig>>(), state[19].cast<std::optional<std::string>>(), state[20].cast<std::optional<tle::LogitsPostProcessor>>(), state[21].cast<std::optional<VecTokens>>(), state[22].cast<std::optional<IdType>>(), state[23].cast<bool>(), state[24].cast<tle::PriorityType>(), state[25].cast<tle::RequestType>(), state[26].cast<std::optional<tle::ContextPhaseParams>>(), state[27].cast<std::optional<tle::Tensor>>(), state[28].cast<std::optional<SizeType32>>(), - state[29].cast<std::optional<tle::Tensor>>(), 1, state[30].cast<std::optional<tle::EagleConfig>>(), - state[31].cast<std::optional<tle::Tensor>>(), state[32].cast<std::optional<tle::GuidedDecodingParams>>()); + state[29].cast<std::optional<tle::Tensor>>(), 1, + state[30].cast<std::optional<tle::EagleConfig>>(), + state[31].cast<std::optional<tle::Tensor>>(), + state[32].cast<std::optional<tle::GuidedDecodingParams>>(), + /*languageAdapterUid*/ std::nullopt, + /*allottedTimeMs*/ std::nullopt, + /*cacheSaltID*/ std::nullopt); + } + // New 36-field pickle (includes language_adapter_uid, allotted_time_ms, cache_salt_id) + return std::make_unique<tle::Request>(state[0].cast<VecTokens>(), state[1].cast<SizeType32>(), + state[2].cast<bool>(), state[3].cast<tle::SamplingConfig>(), state[4].cast<tle::OutputConfig>(), + state[5].cast<std::optional<SizeType32>>(), state[6].cast<std::optional<SizeType32>>(), + state[7].cast<std::optional<std::vector<SizeType32>>>(), + state[8].cast<std::optional<std::list<VecTokens>>>(), state[9].cast<std::optional<std::list<VecTokens>>>(), + state[10].cast<std::optional<Tensor>>(), state[11].cast<std::optional<tle::ExternalDraftTokensConfig>>(), + state[12].cast<std::optional<tle::PromptTuningConfig>>(), + state[13].cast<std::optional<tle::MultimodalInput>>(), state[14].cast<std::optional<Tensor>>(), + state[15].cast<std::optional<tle::MropeConfig>>(), state[16].cast<std::optional<tle::LoraConfig>>(), + state[17].cast<std::optional<tle::LookaheadDecodingConfig>>(), + state[18].cast<std::optional<tle::KvCacheRetentionConfig>>(), state[19].cast<std::optional<std::string>>(), + state[20].cast<std::optional<tle::LogitsPostProcessor>>(), state[21].cast<std::optional<VecTokens>>(), + state[22].cast<std::optional<IdType>>(), state[23].cast<bool>(), state[24].cast<tle::PriorityType>(), + state[25].cast<tle::RequestType>(), state[26].cast<std::optional<tle::ContextPhaseParams>>(), + state[27].cast<std::optional<tle::Tensor>>(), state[28].cast<std::optional<SizeType32>>(), + state[29].cast<std::optional<tle::Tensor>>(), 1, + state[30].cast<std::optional<tle::EagleConfig>>(), + state[31].cast<std::optional<tle::Tensor>>(), + state[32].cast<std::optional<tle::GuidedDecodingParams>>(), + state[33].cast<std::optional<tle::SizeType32>>(), // language_adapter_uid + state[34].cast<std::optional<tle::MillisecondsType>>(), // allotted_time_ms + state[35].cast<std::optional<tle::CacheSaltIDType>>() // cache_salt_id + ); };
🧹 Nitpick comments (43)
cpp/include/tensorrt_llm/runtime/common.h (1)
47-47
: CacheSaltIDType alias addition — LGTM; add brief Doxygen for intentGood place for the alias alongside other ID types. Please add a short Doxygen comment to clarify semantics and intended range.
Apply this diff to document the alias:
-using CacheSaltIDType = std::uint64_t; +//! \brief Per-request KV cache salt used to namespace KV reuse. +//! Requests with different salts must not match for reuse. +using CacheSaltIDType = std::uint64_t;cpp/include/tensorrt_llm/executor/types.h (1)
61-61
: Avoid type drift: re-export CacheSaltIDType from runtimeTo ensure the executor stays in sync with the runtime and prevent future divergence, please update the alias in
cpp/include/tensorrt_llm/executor/types.h
:• In executor/types.h (around line 61), replace:
using CacheSaltIDType = std::uint64_t;with:
#include "tensorrt_llm/runtime/common.h" // add near other includes using CacheSaltIDType = tensorrt_llm::runtime::CacheSaltIDType;• This aligns executor/types.h with other modules (e.g. batch_manager/llmRequest.h and kvCacheManager.h) that already re-export
runtime::CacheSaltIDType
.• After applying, rerun:
rg -nP --glob '!**/build/**' '\busing\s+CacheSaltIDType\s*=' -C2to confirm no remaining direct definitions of
std::uint64_t
.tensorrt_llm/executor/request.py (1)
86-101
: API extension is backward compatible; document and ensure end-to-end plumbingThe new cache_salt_id argument is appended, so callsites using positional args are unaffected. Make sure this is plumbed through the pybind/nanobind layer and into Request::Impl::setCacheSaltID, and add a docstring entry so downstream users understand purpose and bounds (uint64).
If helpful, I can add/update the Google-style docstring for GenerationRequest.init.
cpp/tensorrt_llm/executor/requestImpl.h (1)
301-305
: Add brief Doxygen for new accessorsSmall clarity boost for public API.
Apply this diff:
- [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const + //! \brief Per-request KV cache salt if present. Same value used to namespace KV-cache reuse. + [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const { return mCacheSaltID; } @@ - void setCacheSaltID(CacheSaltIDType cacheSaltID) + //! \brief Set per-request KV cache salt (namespaces KV reuse). Idempotent overwrite. + void setCacheSaltID(CacheSaltIDType cacheSaltID) { mCacheSaltID = cacheSaltID; }Also applies to: 480-484
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
118-119
: Add tokens-based BlockKey constructor overload for cacheSaltIDThe existing C++ constructors in
kvCacheManager.h
are:
explicit BlockKey(VecTokens const& tokens, std::optional<LoraTaskIdType> loraTaskId = std::nullopt)
explicit BlockKey(bool usesExtraIds, std::optional<LoraTaskIdType> loraTaskId, VecUniqueTokens uniqueTokens, std::vector<MmKey> extraKeys = {}, std::optional<CacheSaltIDType> cacheSaltID = std::nullopt)
All salting-sensitive call sites in
kvCacheManager.cpp
already invoke the full overload and correctly passllmRequest.getCacheSaltID()
. No other C++ call sites use the tokens-only constructor. To reduce boilerplate where callers start from a token vector, add a convenience overload that accepts an optionalcacheSaltID
directly:Diff in
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
:class BlockKey { public: // … existing members … + // Convenience ctor: tokens + optional Lora task + optional cache salt + explicit BlockKey( + VecTokens const& tokens, + std::optional<LoraTaskIdType> loraTaskId = std::nullopt, + std::optional<CacheSaltIDType> cacheSaltID = std::nullopt) + : loraTaskId{loraTaskId} + , cacheSaltID{cacheSaltID} + { + uniqueTokens.reserve(tokens.size()); + for (auto const& token : tokens) + { + uniqueTokens.push_back(UniqueToken{token, 0}); + } + } // … existing full ctor already handles cacheSaltID … };• Verified that in
kvCacheManager.cpp
all salting-sensitive sites pass through the full ctor withgetCacheSaltID()
• No additional C++ construction sites use the tokens-only ctor; Python code inserve/router.py
still uses the tokens-only binding—update Python bindings if salt support is needed therecpp/include/tensorrt_llm/executor/executor.h (4)
673-674
: Clarify parameter doc: the API accepts an ID, not a string.The docstring says “limit the kv cache reuse to the requests with the same string,” but the parameter is a numeric CacheSaltIDType. Suggest rewording to avoid confusion and to note that None/absence means “unsalted/unrestricted reuse.”
Apply this diff to tighten the wording:
- /// @param cacheSaltID Salt ID for KV cache blocks to limit the kv cache reuse to the requests with the same string. + /// @param cacheSaltID Salt identifier for KV cache blocks. When set, KV cache reuse is limited to requests using + /// the same salt identifier. Leave std::nullopt to allow unrestricted reuse (unsalted).
750-751
: Add accessor docs; consider naming consistency with existing Id/ID usage.
- Please add a brief Doxygen comment for getCacheSaltID() clarifying semantics (std::nullopt means unsalted).
- Naming is mixed across the codebase (getClientId vs getCacheSaltID). Consider aligning to “Id” for API consistency in future cleanup.
Apply this diff to document the accessor:
- [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const; + /// @brief Returns the KV cache salt identifier if set; std::nullopt indicates unsalted reuse. + [[nodiscard]] std::optional<CacheSaltIDType> getCacheSaltID() const;
786-787
: Offer a way to clear the salt at runtime.There’s a setter but no way to clear (unset) the salt after construction besides rebuilding the Request. Consider adding either setCacheSaltID(std::nullopt) overload or a clear/reset method.
Here’s a minimal additive API:
void setCacheSaltID(CacheSaltIDType cacheSaltID); + /// @brief Clear the KV cache salt identifier (revert to unsalted). + void clearCacheSaltID();And in the Impl/definition:
+void Request::clearCacheSaltID() +{ + mImpl->setCacheSaltID(std::nullopt); +}
1-15
: Update copyright year.Header still lists 2022–2024. Project guidelines ask to use the current year in source headers. Update to include 2025 if that’s the project-wide convention.
tensorrt_llm/executor/executor.py (2)
125-126
: Document and validate cache_salt_id; allow broader int types.
- Please document cache_salt_id in the method docstring.
- Add a small validation/conversion to handle numpy scalars and ensure the ID fits uint64 (0..2^64-1). This avoids surprises coming from client code.
Apply this diff:
@@ - def generate_async( + def generate_async( self, prompt_token_ids: List[int], sampling_params: SamplingParams, query_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None, lora_request: Optional[LoRARequest] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, streaming: bool = False, kv_cache_retention_config: Optional[KvCacheRetentionConfig] = None, disaggregated_params: Optional[DisaggregatedParams] = None, postproc_params: Optional[PostprocParams] = None, multimodal_params: Optional[MultimodalParams] = None, scheduling_params: Optional[SchedulingParams] = None, - cache_salt_id: Optional[int] = None, + cache_salt_id: Optional[int] = None, ) -> GenerationResult: - """Generate output for the given prompt token ids in the asynchronous mode. + """Generate output for the given prompt token ids in the asynchronous mode. Asynchronous generation accepts single prompt only. + + Args: + cache_salt_id: Optional uint64 identifier to isolate KV-cache reuse. Requests sharing the + same salt id may reuse KV cache; different ids prevent reuse. None means unsalted. """ @@ + if cache_salt_id is not None: + # Accept plain int, numpy scalar, or types convertible to int + cache_salt_id = int(cache_salt_id) + if cache_salt_id < 0 or cache_salt_id >= (1 << 64): + raise ValueError("cache_salt_id must be in [0, 2**64 - 1]")
149-151
: Propagation LGTM; consider adding cache_salt_id to synchronous generate() for API parity.generate() doesn’t expose cache_salt_id, so there’s no way to use salting via the sync path. Consider adding an optional parameter there and forwarding it to generate_async (batched case could accept a single int or a list aligned with prompts).
Proposed minimal change:
@@ def generate( - disaggregated_params: Optional[DisaggregatedParams] = None, + disaggregated_params: Optional[DisaggregatedParams] = None, + cache_salt_id: Optional[Union[int, List[int]]] = None, ) -> Union[GenerationResult, List[GenerationResult]]: @@ - future = self.generate_async( + # Support per-request salting when batched (single int or a list matching prompts) + _salt = (cache_salt_id[i] if isinstance(cache_salt_id, list) + else cache_salt_id) + future = self.generate_async( p, sampling_params=sp, query_token_ids=query_token_ids, lora_request=lora_req, prompt_adapter_request=pa_req, streaming=False, - disaggregated_params=disaggregated_params) + disaggregated_params=disaggregated_params, + cache_salt_id=_salt)tensorrt_llm/inputs/utils.py (2)
605-610
: Add docstring and minor guardrails.This public API should have a short docstring per guidelines. Also, defensively strip the string to avoid accidental whitespace differences.
Apply this diff:
-def get_cache_salt_id(cache_salt: str) -> int: - b = cache_salt.encode("utf-8") +def get_cache_salt_id(cache_salt: str) -> int: + """Return a 64-bit integer ID derived from a salt string. + + The same input string deterministically maps to the same uint64. Different strings are unlikely + to collide. Use this to isolate KV-cache reuse across requests sharing the same salt. + """ + b = cache_salt.strip().encode("utf-8") h = default_hasher(b).digest(length=8) return int.from_bytes(h, "little", signed=False)Logic is otherwise sound and efficient.
1-15
: Add header per project convention.Python files typically include the NVIDIA copyright header. Consider adding it here.
tensorrt_llm/executor/worker.py (1)
502-504
: Propagatecache_salt_id
correctly; optional uint64 validation in PythonPropagation of
cache_salt_id=request.cache_salt_id
is already consistent:
- All Python call sites to
executor.Request
include the newcache_salt_id
keyword.- The nanobind/pybind layer exposes it via
.def_property("cache_salt_id", &tle::Request::getCacheSaltID, &tle::Request::setCacheSaltID)
.To guard against subtle runtime errors (e.g. passing a non‐int, negative value, or an integer > 2⁶⁴–1) before hitting the C++/nanobind boundary, you may optionally coerce and range‐check locally:
- type=request_type, - cache_salt_id=request.cache_salt_id) + type=request_type, + cache_salt_id=( + None if request.cache_salt_id is None + else (lambda _s: (_s if (isinstance(_s, int) and 0 <= _s <= (1 << 64) - 1) + else (_ for _ in ()).throw(ValueError("cache_salt_id must be uint64")))) + (int(request.cache_salt_id)) + ))To double‐check consistency across the codebase, you can run:
rg -nP --type=py -C2 'executor\.Request\([^)]*cache_salt_id\s*=' rg -nP --type=cpp -C3 'cacheSaltID|cache_salt_id|CacheSaltIDType' cpp/ tensorrt_llm/benchmarks/cpp/disaggServerBenchmark.cpp (1)
546-547
: Tidy optional pass-through and document new trailing cacheSaltIDThe ternary is redundant for
std::optional
. Pass the optional directly; keep the explicitstd::nullopt
for the new cacheSaltID.- encoderInputTokenIds.has_value() ? encoderInputTokenIds : std::nullopt, - std::nullopt); // cacheSaltID + encoderInputTokenIds, + std::nullopt); // cacheSaltIDArchitecture note: consider adding a CLI flag (e.g.,
--cache_salt_id
) wired into these requests so benchmarks can exercise salted vs. unsalted reuse behavior end-to-end.benchmarks/cpp/gptManagerBenchmark.cpp (1)
840-842
: Simplify optional and add explicit cacheSaltID placeholderSame cleanup applies here; the ternary is unnecessary when forwarding an
std::optional
.- encoderInputTokenIds.has_value() ? encoderInputTokenIds : std::nullopt, - std::nullopt); // cacheSaltID + encoderInputTokenIds, + std::nullopt); // cacheSaltIDtensorrt_llm/serve/openai_server.py (1)
50-51
: Typo in formatter directive.The comment reads “# yapf: enale”. Consider fixing to “enable” or remove if unused.
-# yapf: enale +# yapf: enablecpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)
54-54
: Update parameter-count comment to match reality.Constructor now has 50 parameters (including contextPhaseParams and cacheSaltID) but the comment still says 49.
- // 49 parameters + // 50 parameters
88-90
: Forwarding of cacheSaltID is correct; consider keeping comments and guards consistent.
- The new std::optional cacheSaltID parameter and its forwarding into Base(...) look correct and maintain append-only ordering. Good.
- Minor: coding guidelines require include guards for headers; this file uses #pragma once. If the codebase prefers #pragma once, ignore; otherwise consider adding TRTLLM_LLMREQUEST_H and removing #pragma once in a follow-up.
Also applies to: 150-152
tests/integration/defs/test_e2e.py (1)
1453-1463
: Good targeted integration for cache-salt behavior.Installing example requirements and running the dedicated _test_openai_cache_salt.py mirrors existing serve tests. LGTM.
If install time becomes an issue, consider factoring the pip install into a session-scoped fixture reused by related OpenAI serve tests to avoid repeated installs.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
1778-1784
: Make assertions independent of a shared numBlocks variable across cases.In Cases 2–4 you assert against
blockManager.getNumAllocatedBlocks()
usingnumBlocks
computed in Case 1. While it currently works (same token lengths), it couples cases and can lead to fragile tests if sequence lengths diverge. Recompute per case or compute local constants.Apply the following diffs to compute per-case block counts and avoid cross-case coupling:
@@ - llmRequest1->addNewToken(3, beamIdx); - llmRequest1->addNewToken(4, beamIdx); - EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks); - EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks); + llmRequest1->addNewToken(3, beamIdx); + llmRequest1->addNewToken(4, beamIdx); + numTokens = llmRequest1->getNumTokens(beamIdx); + auto const numBlocks1 = tc::ceilDiv(numTokens, tokensPerBlock); + EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks1); + EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks1); @@ - llmRequest2->addNewToken(3, beamIdx); - llmRequest2->addNewToken(4, beamIdx); - EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks); - EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks); + llmRequest2->addNewToken(3, beamIdx); + llmRequest2->addNewToken(4, beamIdx); + numTokens = llmRequest2->getNumTokens(beamIdx); + auto const numBlocks2 = tc::ceilDiv(numTokens, tokensPerBlock); + EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks2); + EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks2); @@ - llmRequest3->addNewToken(5, beamIdx); - llmRequest3->addNewToken(6, beamIdx); - EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks); - EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks); + llmRequest3->addNewToken(5, beamIdx); + llmRequest3->addNewToken(6, beamIdx); + numTokens = llmRequest3->getNumTokens(beamIdx); + auto const numBlocks3 = tc::ceilDiv(numTokens, tokensPerBlock); + EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks3); + EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks3); @@ - llmRequest4->addNewToken(7, beamIdx); - numTokens = llmRequest4->getNumTokens(beamIdx); - numBlocks = tc::ceilDiv(numTokens, tokensPerBlock); - EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocks * 2); - EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - numBlocks * 2); + llmRequest4->addNewToken(7, beamIdx); + auto const numBlocksCase3 = tc::ceilDiv(llmRequest3->getNumTokens(beamIdx), tokensPerBlock); + numTokens = llmRequest4->getNumTokens(beamIdx); + auto const numBlocksCase5 = tc::ceilDiv(numTokens, tokensPerBlock); + EXPECT_EQ(blockManager.getNumAllocatedBlocks(), numBlocksCase3 + numBlocksCase5); + EXPECT_EQ(blockManager.getNumFreeBlocks(), blocksInPrimaryPool - (numBlocksCase3 + numBlocksCase5));Also applies to: 1811-1815, 1843-1847, 1869-1874
1736-1743
: Optional: Add EXPECT_NO_THROW wrappers around addSequence for consistency and clearer failures.Other tests in this file occasionally wrap state-changing calls in EXPECT_NO_THROW. Mirroring that here keeps failure points crisp without affecting behavior.
Example for Case 1:
- blockManager.addSequence(seq0, promptLen0, numContextBlocks0, *llmRequest0, maxAttentionWindow); + EXPECT_NO_THROW(blockManager.addSequence(seq0, promptLen0, numContextBlocks0, *llmRequest0, maxAttentionWindow));Repeat similarly for seq1–seq4 additions in this test.
Also applies to: 1773-1779, 1804-1810, 1836-1842, 1863-1868
1758-1769
: Consider asserting reuse metrics to complement block ID checks.If available in this scope, asserting
llmRequestX->getReusedBlocksPerRequest()
orgetKVCacheHitRatePerRequest()
would make the test resilient to incidental ID allocation differences while still validating the core reuse/no-reuse semantics governed by cache_salt_id.tensorrt_llm/serve/openai_protocol.py (3)
554-560
: Clarify description and mark as prototype in schema.The current description is slightly awkward and doesn’t advertise feature maturity. Recommend clearer wording and tagging the field as “prototype” in the schema to set expectations.
Apply this diff:
- cache_salt: Optional[str] = Field( - default=None, - description= - ("If specified, KV cache will be salted with the provided string " - "to limit the kv cache reuse on with the requests having the same string." - )) + cache_salt: Optional[str] = Field( + default=None, + description=( + "Optional: scope KV-cache reuse to requests that provide the same salt string. " + "Use to isolate prompts or tenants and mitigate cross-request cache reuse." + ), + json_schema_extra={"status": "prototype"}, + )
640-650
: Trim and validate non-empty salt strings; normalize input in-place.Accepting whitespace-only strings as valid salts is likely unintended and can lead to hard-to-debug behavior. Normalize with strip(), then validate non-empty.
Apply this diff:
@model_validator(mode="before") @classmethod def check_cache_salt_support(cls, data): - if data.get("cache_salt") is not None: - if not isinstance(data["cache_salt"], - str) or not data["cache_salt"]: - raise ValueError( - "Parameter 'cache_salt' must be a non-empty string if provided." - ) + cache_salt = data.get("cache_salt") + if cache_salt is not None: + if not isinstance(cache_salt, str): + raise ValueError("Parameter 'cache_salt' must be a string if provided.") + cache_salt = cache_salt.strip() + if not cache_salt: + raise ValueError("Parameter 'cache_salt' must be a non-empty string after trimming.") + data["cache_salt"] = cache_salt return data
194-257
: Ensure cache_salt parity between CompletionRequest and ChatCompletionRequestCompletionRequest (in tensorrt_llm/serve/openai_protocol.py, lines 194–257) currently lacks the
cache_salt
field and the corresponding validation logic that ChatCompletionRequest provides (lines 554–647). To avoid confusion and maintain a consistent API surface, please choose one of the following:• Add
cache_salt
to CompletionRequest, mirroring ChatCompletionRequest:cache_salt: Optional[str] = Field( default=None, description=( "If specified, KV cache will be salted with the provided string " "to limit the kv cache reuse to the requests having the same string." ), )and register the same
@root_validator(pre=True)
–basedcheck_cache_salt_support
to enforce non‐empty strings.• If salting is intentionally chat-only, add a brief note to CompletionRequest’s doc comments explaining why
cache_salt
isn’t supported for the Completions API.cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
197-197
: Expose a short docstring for cache_salt_id property.Minor ergonomics: adding a docstring helps discoverability in Python help() and IDEs.
Apply this diff:
- .def_property_readonly("cache_salt_id", &GenLlmReq::getCacheSaltID) + .def_property_readonly( + "cache_salt_id", + &GenLlmReq::getCacheSaltID, + "Optional uint64 salt that scopes KV-cache reuse; None means no salt.")tests/unittest/llmapi/apps/_test_openai_cache_salt.py (5)
1-1
: Add NVIDIA copyright header (2025) to comply with repo guidelines.All source files must carry the current-year NVIDIA header. Please prepend the SPDX header before the module docstring.
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 """Test cache_salt functionality in OpenAI API to ensure it prevents cache reuse"""
21-47
: Avoid fixed temp filename to prevent collisions in parallel test runs.Using a constant filename in the system temp directory can race when pytest runs in parallel. Prefer mkstemp/NamedTemporaryFile.
- temp_dir = tempfile.gettempdir() - temp_file_path = os.path.join(temp_dir, "cache_salt_test_options.yaml") + fd, temp_file_path = tempfile.mkstemp(prefix="cache_salt_test_options_", suffix=".yaml") + os.close(fd)
140-162
: Relax assertion to tolerate metric sampling noise.Hit-rate snapshots may not strictly increase step-to-step. Use a tiny epsilon and include values in the failure message.
- assert hit_rate_after_reuse >= initial_hit_rate, \ - "Cache hit rate should increase when reusing cache without salt" + eps = 1e-6 + assert hit_rate_after_reuse + eps >= initial_hit_rate, \ + f"Expected non-decreasing hit rate on reuse; got initial={initial_hit_rate:.4f}, after={hit_rate_after_reuse:.4f}"
176-181
: Prefer checking counter deltas over global hit-rate for salt/no-salt behavior.A lower hit-rate is a proxy and can flake. More robust: compare deltas of reusedBlocks/missedBlocks before/after requests (expect more “missed” and no “reused” when salt changes).
I can refactor the test to record (reusedBlocks, missedBlocks) before/after each call and assert on those deltas. Want me to push a patch?
223-231
: Catch a specific OpenAI exception for empty salt validation.Catching Exception is too broad; assert protocol semantics explicitly.
- with pytest.raises(Exception) as exc_info: + with pytest.raises(openai.BadRequestError) as exc_info: client.chat.completions.create( model=model_name, messages=messages, max_tokens=max_tokens, extra_body={"cache_salt": ""} # Empty string should be rejected )cpp/include/tensorrt_llm/batch_manager/llmRequest.h (4)
100-101
: Public alias CacheSaltIDType: OK; add brief Doxygen for semantics.Add a short comment clarifying what the salt represents (e.g., per-tenant or per-user cache domain) and its valid range/type as defined in runtime.
1770-1774
: Getter getCacheSaltID(): add Doxygen.Expose intended use in hashing/block-key derivation; note that absence implies unsalted behavior.
2053-2055
: mCacheSaltID member: consider privacy and documentation.Keeping it protected is fine for derived access. Recommend documenting lifecycle (e.g., copied to children via createChildRequest if applicable) and thread-safety expectations.
If child requests inherit salt, confirm in LlmRequest::createChildRequest implementation or tests.
1-3
: Update header copyright year and consider include guards per guidelines.File header still says 2022-2024; please bump to 2025. Repo guidelines also prefer explicit include guards (TRTLLM_BATCH_MANAGER_LLMREQUEST_H) over pragma once.
Would you like me to generate a guard template consistent with the project’s naming?
cpp/tensorrt_llm/executor/request.cpp (1)
422-425
: New mutator setCacheSaltID — OK; minor style nit.Delegation is consistent. Optional nit: most codebases avoid
return expr;
invoid
functions; if you prefer, change to a plain call for readability.-void Request::setCacheSaltID(CacheSaltIDType cacheSaltID) -{ - return mImpl->setCacheSaltID(cacheSaltID); -} +void Request::setCacheSaltID(CacheSaltIDType cacheSaltID) +{ + mImpl->setCacheSaltID(cacheSaltID); +}cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
134-136
: Widened multimodal start offset to 64-bit — ensure MmKey’s second type matches.
mmStartInBlock
is nowuint64_t
and later hashed as 64-bit. Make sure theMmKey
alias (and any containers) was also updated to useuint64_t
as the second element to avoid truncation when emplacing. The local comment still saysSizeType32
— please update it for accuracy.- std::vector<MmKey> extraKeys; // MmKey = std::pair<std::array<uint8_t, 32>, SizeType32> + std::vector<MmKey> extraKeys; // MmKey = std::pair<std::array<uint8_t, 32>, uint64_t>
170-178
: Salt mixed into root-block hash only — design ACK; document and test.Mixing the salt only when
parentHash == 0
is sufficient, as descendants derive from the salted root hash. Two asks:
- Document this in
BlockKeyHasher
’s header-level comment for future maintainers.- Ensure tests cover “same tokens, different salt → no reuse” and “same tokens, same salt → reuse,” including partial-reuse scenarios.
The implementation looks sound.
cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h (1)
55-90
: Constructor extended with cacheSaltID — verify bindings parity and header defaults.Forwarding
cacheSaltID
toBase
is correct. Please verify:
- pybind and nanobind call sites include the new kwarg at the tail to keep backward compatibility (looks like they do).
- The underlying C++ Base constructors have defaulted trailing parameters so older code paths remain callable.
Also, consider updating the header’s SPDX year range to include 2025 to match project guidance.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tensorrt_llm/llmapi/llm.py (3)
319-320
: Public API: add cache_salt to generate_async — nice; consider parity in generate().Adding
cache_salt
togenerate_async
is great. For parity and ergonomics, consider also threading acache_salt
(or per-item list) throughgenerate()
to avoid forcing users into the async path for salting.
343-346
: Derive cache_salt_id once — OK; add a small guard for empty strings?Current logic treats
""
as a valid salt. If you’d prefer “empty string = no salt,” add a truthy check before hashing. Otherwise, it’s fine as-is.- cache_salt_id = get_cache_salt_id( - cache_salt) if cache_salt is not None else None + cache_salt_id = get_cache_salt_id(cache_salt) if cache_salt else None
334-335
: Docstring line too long (ruff E501) — wrap for 120 cols.Reflow the new line to satisfy the project’s 120-col wrap.
- cache_salt (str, optional): If specified, KV cache will be salted with the provided string to limit the kv cache reuse to the requests with the same string. Defaults to None. + cache_salt (str, optional): If specified, KV cache will be salted with + the provided string to limit KV-cache reuse to requests with the + same string. Defaults to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
579-605
: Backward-incompatible pickle shape and ctor arg mismatch; fix setstate and restore compatibility.
- The size check forces
34
, breaking older pickles.- The placement-new omits
languageAdapterUid
andallottedTimeMs
and passescache_salt_id
in their slot, which is a constructor arity/order mismatch and will not bind to the intended overload.Make
__setstate__
accept 33 (legacy), 34 (intermediate: onlycache_salt_id
), and 35 (current:allotted_time_ms
+cache_salt_id
), and passstd::nullopt
forlanguageAdapterUid
, then the extractedallottedTimeMs
andcacheSaltID
in order.Apply this diff:
- auto requestSetstate = [](tle::Request& self, nb::tuple const& state) + auto requestSetstate = [](tle::Request& self, nb::tuple const& state) { - if (state.size() != 34) + if (state.size() != 33 && state.size() != 34 && state.size() != 35) { throw std::runtime_error("Invalid Request state!"); } + // Backward compatibility: + // 33: legacy (no allotted_time_ms, no cache_salt_id) + // 34: intermediate (cache_salt_id only) + // 35: current (allotted_time_ms and cache_salt_id) + std::optional<tle::MillisecondsType> allottedTimeMs = std::nullopt; + std::optional<tle::CacheSaltIDType> cacheSaltID = std::nullopt; + if (state.size() == 34) + { + cacheSaltID = nb::cast<std::optional<tle::CacheSaltIDType>>(state[33]); + } + else if (state.size() == 35) + { + allottedTimeMs = nb::cast<std::optional<tle::MillisecondsType>>(state[33]); + cacheSaltID = nb::cast<std::optional<tle::CacheSaltIDType>>(state[34]); + } new (&self) tle::Request(nb::cast<VecTokens>(state[0]), nb::cast<SizeType32>(state[1]), nb::cast<bool>(state[2]), nb::cast<tle::SamplingConfig>(state[3]), nb::cast<tle::OutputConfig>(state[4]), nb::cast<std::optional<SizeType32>>(state[5]), nb::cast<std::optional<SizeType32>>(state[6]), nb::cast<std::optional<std::vector<SizeType32>>>(state[7]), nb::cast<std::optional<std::list<VecTokens>>>(state[8]), nb::cast<std::optional<std::list<VecTokens>>>(state[9]), nb::cast<std::optional<Tensor>>(state[10]), nb::cast<std::optional<tle::ExternalDraftTokensConfig>>(state[11]), nb::cast<std::optional<tle::PromptTuningConfig>>(state[12]), nb::cast<std::optional<tle::MultimodalInput>>(state[13]), nb::cast<std::optional<Tensor>>(state[14]), nb::cast<std::optional<tle::MropeConfig>>(state[15]), nb::cast<std::optional<tle::LoraConfig>>(state[16]), nb::cast<std::optional<tle::LookaheadDecodingConfig>>(state[17]), nb::cast<std::optional<tle::KvCacheRetentionConfig>>(state[18]), nb::cast<std::optional<std::string>>(state[19]), nb::cast<std::optional<tle::LogitsPostProcessor>>(state[20]), nb::cast<std::optional<VecTokens>>(state[21]), nb::cast<std::optional<IdType>>(state[22]), nb::cast<bool>(state[23]), nb::cast<tle::PriorityType>(state[24]), nb::cast<tle::RequestType>(state[25]), nb::cast<std::optional<tle::ContextPhaseParams>>(state[26]), nb::cast<std::optional<tle::Tensor>>(state[27]), nb::cast<std::optional<SizeType32>>(state[28]), nb::cast<std::optional<tle::Tensor>>(state[29]), 1, nb::cast<std::optional<tle::EagleConfig>>(state[30]), - nb::cast<std::optional<tle::Tensor>>(state[31]), - nb::cast<std::optional<tle::GuidedDecodingParams>>(state[32]), - nb::cast<std::optional<tle::CacheSaltIDType>>(state[33])); + nb::cast<std::optional<tle::Tensor>>(state[31]), + nb::cast<std::optional<tle::GuidedDecodingParams>>(state[32]), + std::nullopt, // languageAdapterUid not serialized today + allottedTimeMs, + cacheSaltID); };
575-576
: Allotted time is not serialized; pickling drops allotted_time_ms.
__getstate__
appendscache_salt_id
but omitsallotted_time_ms
, so a Request pickled with a non-null allotted time will silently lose it upon unpickling. AppendgetAllottedTimeMs()
to the state for persistence.Apply this diff:
- self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(), - self.getGuidedDecodingParams(), self.getCacheSaltID()); + self.getCrossAttentionMask(), self.getEagleConfig(), self.getSkipCrossAttnBlocks(), + self.getGuidedDecodingParams(), + self.getAllottedTimeMs(), // NEW: preserve allotted_time_ms + self.getCacheSaltID()); // NEW: preserve cache_salt_id
🧹 Nitpick comments (1)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
684-686
: New kw-only args are exposed clearly; consider documenting persistence behavior.
language_adapter_uid
,allotted_time_ms
, andcache_salt_id
are now constructible from Python. Note that onlyallotted_time_ms
andcache_salt_id
are intended to be pickled;language_adapter_uid
is not. Please confirm this is intentional and document it in the Python API docs to avoid surprises.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(3 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(5 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(5 hunks)tensorrt_llm/inputs/utils.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/inputs/utils.py
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
- cpp/tensorrt_llm/pybind/executor/request.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/nanobind/executor/request.cpp
🔇 Additional comments (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
644-646
: Ctor signature addition looks good and consistent with Python exposure.The trailing optional parameters
languageAdapterUid
,allottedTimeMs
, andcacheSaltID
are ordered correctly and keep ABI stability for positional args in Python (kw-only). No further action.
728-729
: Property exposure for cache_salt_id is correct.RW property maps cleanly to the new accessors. Once
__getstate__/__setstate__
are fixed, round-tripping via pickle should preserve this value.
4d3fa49
to
57ea549
Compare
/bot run --disable-fail-fast |
PR_Github #16229 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (9)
cpp/include/tensorrt_llm/runtime/common.h (1)
47-47
: Avoid duplicating CacheSaltIDType across namespacesThere is a parallel alias in executor/types.h. To prevent drift and include-order surprises, consider centralizing the typedef in one header (e.g., runtime/common.h) and aliasing from executor (using CacheSaltIDType = runtime::CacheSaltIDType;).
-// in cpp/include/tensorrt_llm/executor/types.h -using CacheSaltIDType = std::uint64_t; +using CacheSaltIDType = tensorrt_llm::runtime::CacheSaltIDType;cpp/tensorrt_llm/executor/serialization.cpp (2)
716-726
: Outdated parameter count commentComment says “35 parameters” but an extra argument was added. Either update the count or remove the comment to avoid future drift.
- // 35 parameters + // 36 parameters
714-726
: Serialization parity verifiedThe
cacheSaltID
field is correctly included at the end of bothRequestImpl::serialize()
andRequestImpl::serializedSize()
via thevisitMembers
invocation (see the finallambda(mCacheSaltID);
inrequestImpl.h
), and it is deserialized last inSerialization::deserializeRequest
(lines 714–726 inserialization.cpp
). No misalignment between writer and reader was found.Optional improvement:
- If you require cross-version IPC resilience, consider adding a minor version tag to the stream or performing a feature-flagged read (e.g. peeking for EOF before attempting to deserialize
cacheSaltID
).cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
292-293
: Documentcache_salt_id
in nanobind bindingsParity between the nanobind and pybind constructors is correct: both take a trailing
cache_salt_id
parameter defaulting tonullopt
/None
.• File:
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
– At the.def("__init__", … nb::arg("cache_salt_id") = std::nullopt)
invocation (around line 359), add a short docstring explaining thatcache_salt_id
implements per-request KV cache salting and thatNone
means “no salt.”tensorrt_llm/executor/executor.py (1)
125-126
: New cache_salt_id parameter — good, but update the docstringAdd a short docstring entry explaining purpose and expected type/range (uint64). This helps downstream users discoverability.
def generate_async( self, prompt_token_ids: List[int], sampling_params: SamplingParams, query_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None, lora_request: Optional[LoRARequest] = None, prompt_adapter_request: Optional[PromptAdapterRequest] = None, streaming: bool = False, kv_cache_retention_config: Optional[KvCacheRetentionConfig] = None, disaggregated_params: Optional[DisaggregatedParams] = None, postproc_params: Optional[PostprocParams] = None, multimodal_params: Optional[MultimodalParams] = None, scheduling_params: Optional[SchedulingParams] = None, - cache_salt_id: Optional[int] = None, + cache_salt_id: Optional[int] = None, ) -> GenerationResult: - """Generate output for the given prompt token ids in the asynchronous mode. - Asynchronous generation accepts single prompt only. - """ + """Generate output for the given prompt token ids in the asynchronous mode. + Asynchronous generation accepts single prompt only. + + Args: + cache_salt_id: Optional 64-bit integer used to salt KV-cache keys for this request. + Use the same salt to enable reuse across trusted requests; use different salts to isolate reuse. + When None, no additional salting is applied. + """tensorrt_llm/llmapi/llm.py (4)
330-331
: Expose cache_salt in synchronous generate() for API parity
generate_async()
now acceptscache_salt
, butgenerate()
does not pass it through. This makes KV-cache salting inaccessible via the sync API and diverges from the async behavior. Suggest adding an optionalcache_salt
(scalar or per-request sequence) togenerate()
and forwarding it per item, mirroring how other per-request params are handled.Apply this diff to add the parameter, docstring, and pass-through:
@@ def generate( self, inputs: Union[PromptInputs, Sequence[PromptInputs]], sampling_params: Optional[Union[SamplingParams, List[SamplingParams]]] = None, use_tqdm: bool = True, lora_request: Optional[Union[LoRARequest, Sequence[LoRARequest]]] = None, prompt_adapter_request: Optional[Union[ PromptAdapterRequest, Sequence[PromptAdapterRequest]]] = None, kv_cache_retention_config: Optional[Union[ KvCacheRetentionConfig, Sequence[KvCacheRetentionConfig]]] = None, disaggregated_params: Optional[Union[ DisaggregatedParams, Sequence[DisaggregatedParams]]] = None, scheduling_params: Optional[Union[SchedulingParams, List[SchedulingParams]]] = None, + cache_salt: Optional[Union[str, Sequence[str]]] = None, ) -> Union[RequestOutput, List[RequestOutput]]: @@ scheduling_params (tensorrt_llm.scheduling_params.SchedulingParams, List[tensorrt_llm.scheduling_params.SchedulingParams], optional): Scheduling parameters. Defaults to None. + cache_salt (str or Sequence[str], optional): + If specified, salts the KV cache so reuse is limited to requests that use the same salt. + For batched inputs, pass a sequence of salts aligned with the batch. Defaults to None. @@ for i, request_inputs in enumerate(inputs): future = self.generate_async( request_inputs, sampling_params=_item_at(sampling_params, i), lora_request=_item_at(lora_request, i), prompt_adapter_request=_item_at(prompt_adapter_request, i), kv_cache_retention_config=_item_at(kv_cache_retention_config, i), disaggregated_params=_item_at(disaggregated_params, i), scheduling_params=_item_at(scheduling_params, i), - streaming=False) + streaming=False, + cache_salt=_item_at(cache_salt, i))Also applies to: 238-256, 258-276, 293-305
345-345
: Wrap docstring line to satisfy E501 (<=120 chars)Line length exceeds the configured limit; wrap for readability and to keep linters quiet.
- cache_salt (str, optional): If specified, KV cache will be salted with the provided string to limit the kv cache reuse to the requests with the same string. Defaults to None. + cache_salt (str, optional): + If specified, salts the KV cache so reuse is limited to requests that use the same string. + Defaults to None.
355-357
: Normalize empty/whitespace salt to None to avoid accidental global bucketTreating "" or " " as a valid salt creates a single common salt bucket that callers might hit by mistake. Normalizing blank/whitespace-only salts to None reduces surprise and aligns with the default behavior.
- cache_salt_id = get_cache_salt_id( - cache_salt) if cache_salt is not None else None + # Normalize blank/whitespace-only salts to None + cache_salt_id = ( + get_cache_salt_id(cache_salt.strip()) + if (cache_salt and cache_salt.strip()) + else None + )
1-1
: Repository guideline: ensure NVIDIA copyright header presentPer coding guidelines, source files should carry the NVIDIA copyright header (current year). This file appears to be missing it.
If the project’s standard header is the canonical NVIDIA header, prepend it here (or confirm it’s managed by a repo-wide tooling step).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (31)
benchmarks/cpp/disaggServerBenchmark.cpp
(1 hunks)benchmarks/cpp/gptManagerBenchmark.cpp
(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(15 hunks)cpp/include/tensorrt_llm/executor/executor.h
(4 hunks)cpp/include/tensorrt_llm/executor/types.h
(1 hunks)cpp/include/tensorrt_llm/runtime/common.h
(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(3 hunks)cpp/tensorrt_llm/executor/request.cpp
(5 hunks)cpp/tensorrt_llm/executor/requestImpl.h
(7 hunks)cpp/tensorrt_llm/executor/serialization.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(5 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(3 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(5 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(5 hunks)cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h
(3 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(5 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/executor/executor.py
(2 hunks)tensorrt_llm/executor/request.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tensorrt_llm/inputs/__init__.py
(2 hunks)tensorrt_llm/inputs/utils.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(5 hunks)tensorrt_llm/serve/openai_protocol.py
(2 hunks)tensorrt_llm/serve/openai_server.py
(1 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/unittest/llmapi/apps/_test_openai_cache_salt.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (25)
- tensorrt_llm/inputs/utils.py
- tensorrt_llm/executor/worker.py
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/executor/types.h
- tensorrt_llm/serve/openai_protocol.py
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
- tensorrt_llm/inputs/init.py
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- benchmarks/cpp/disaggServerBenchmark.cpp
- tensorrt_llm/executor/request.py
- cpp/tensorrt_llm/pybind/batch_manager/llmRequest.cpp
- benchmarks/cpp/gptManagerBenchmark.cpp
- cpp/tensorrt_llm/pybind/batch_manager/llmRequest.h
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
- cpp/tensorrt_llm/pybind/executor/request.cpp
- tensorrt_llm/serve/openai_server.py
- cpp/tensorrt_llm/executor/requestImpl.h
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
- cpp/tensorrt_llm/executor/request.cpp
- tests/unittest/llmapi/apps/_test_openai_cache_salt.py
- tests/integration/defs/test_e2e.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/include/tensorrt_llm/executor/executor.h
- cpp/tensorrt_llm/nanobind/executor/request.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/include/tensorrt_llm/runtime/common.h
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/include/tensorrt_llm/runtime/common.h
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
cpp/tensorrt_llm/executor/serialization.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/runtime/common.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/include/tensorrt_llm/runtime/common.h
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
tensorrt_llm/executor/executor.py
cpp/tensorrt_llm/executor/serialization.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/llmapi/llm.py
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
cpp/tensorrt_llm/executor/serialization.cpp
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/executor.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/llmapi/llm.py
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
345-345: Line too long (186 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
cpp/include/tensorrt_llm/runtime/common.h (1)
47-47
: Add CacheSaltIDType alias (uint64_t) — looks goodFixed-width unsigned fits the hashing/salting use-case and matches other IDs in this header. No functional concerns here.
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
547-549
: Threading cache_salt_id into LlmRequest — correct and consistentForwarding executor_request.cache_salt_id matches the new binding signature and keeps the parameter optional. This maintains backward compatibility.
547-549
: cache_salt_id is fully exposed and propagated end-to-end — no further changes needed
Python bindings
–executor_request.cache_salt_id
is surfaced in worker.py (line 503), request.py (line 99), and executor.py (line 150).
– Nanobind/pybind both exposecache_salt_id
as a read/write property ontle::Request
.C++ Request serialization
– The C++Request
constructor and itsImpl
(inrequestImpl.h
lines 89–92) store the salt inmCacheSaltID
.
– Serialization/deserialization inexecutor/serialization.cpp
(lines 714–725) carriescacheSaltID
through the stream.KV cache key hashing
–BlockKey
is constructed withllmRequest.getCacheSaltID()
(kvCacheManager.cpp line 154) and stored in the key.
– Unit tests inkvCacheManagerTest.cpp
verify that requests with differentcache_salt_id
produce distinct cache blocks (Test Case 2), confirming the salt is mixed into the key hash.All three verification points pass; this review comment can be resolved.
cpp/tensorrt_llm/executor/serialization.cpp (1)
714-715
: Deserializing optional cacheSaltID — correct placementReading std::optional at the end of Request’s stream is straightforward and keeps older fields untouched.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
192-193
: Expose cache_salt_id on GenericLlmRequest — good Python ergonomicsRead-only property mapped to GenLlmReq::getCacheSaltID is appropriate. Optional maps naturally to Python’s None.
tensorrt_llm/llmapi/llm.py (1)
30-31
: LGTM: proper import of cache-salt utilitiesImporting
create_input_processor_with_hash
andget_cache_salt_id
fromtensorrt_llm.inputs
cleanly wires the new salting and MM hashing paths into this module.
PR_Github #16229 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM on the llmapi changes.
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
eb52529
to
0b7e78c
Compare
/bot run |
PR_Github #17624 [ run ] triggered by Bot |
PR_Github #17624 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Chang Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/bot run |
PR_Github #17718 [ run ] triggered by Bot |
PR_Github #17718 [ run ] completed with state |
/bot run |
PR_Github #17803 [ run ] triggered by Bot |
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
PR_Github #17803 [ run ] completed with state |
/bot reuse-pipeline |
PR_Github #17898 [ reuse-pipeline ] triggered by Bot |
PR_Github #17898 [ reuse-pipeline ] completed with state |
NVIDIA#7106) Signed-off-by: Chang Liu (Enterprise Products) <[email protected]> Signed-off-by: Chang Liu <[email protected]>
Refer to vllm-project/vllm#16016 for context. This is for security risks such as prompt theft attacks.
Summary by CodeRabbit
New Features
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.