-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) #7294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughRemoved the public Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant TRTExec as TRT Executor
participant Sampler as TorchSampler
participant Gen as Generator
participant Req as LlmRequest[]
Client->>TRTExec: sample_async(requests)
TRTExec->>Sampler: _process_requests(requests, logits, states)
Sampler->>Sampler: _group_requests_by_sampling_strategy(requests)
Sampler->>Gen: get_generator(device)
par Per-strategy batched sampling
Sampler->>Sampler: pack indices (torch_multi_arange / indexers)
alt embedding-bias present
Sampler->>Sampler: _apply_embedding_bias
end
Sampler->>Sampler: sample(strategy, logits_batch, generator, softmax_indices?)
end
Sampler-->>TRTExec: _BatchedSamplingResult (tokens, softmax, mappings)
TRTExec->>TRTExec: unbatch -> update per-request states
TRTExec-->>Client: next tokens (+ optional probs)
sequenceDiagram
autonumber
participant Drafter as ModelDrafter
participant Sampler as TorchSampler
participant Req as LlmRequest(draft)
Drafter->>Sampler: should_provide_draft_probs(request)
Drafter->>Req: create(exclude_last_generation_logits=True, return_generation_logits=from sampler)
Drafter->>Sampler: sample_async(sample_state)
Sampler-->>Drafter: draft tokens (+/- draft probs)
Drafter->>Drafter: forward draft logits into target request (py_draft_logits)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-8
: Missing NVIDIA license headerAdd the standard 2025 NVIDIA SPDX header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +tensorrt_llm/_torch/modules/rms_norm.py (1)
56-62
: FlashInfer path returns stale residual; semantics diverge from non-Flash pathWhen residual is provided, non-Flash returns (normed, input+residual). FlashInfer currently returns (normed, original residual). This breaks downstream skip/residual chaining.
Apply this fix to compute and return the updated residual consistently:
if IS_FLASHINFER_AVAILABLE: from ..custom_ops import (flashinfer_fused_add_rmsnorm, flashinfer_rmsnorm) - if residual is not None: - flashinfer_fused_add_rmsnorm(hidden_states, residual, - self.weight, self.variance_epsilon) + if residual is not None: + # Keep semantics consistent with the non-Flash path: + # residual_out = (input + residual) in input dtype. + input_dtype = hidden_states.dtype + new_residual = (hidden_states.to(torch.float32) + + residual.to(torch.float32)).to(input_dtype) + flashinfer_fused_add_rmsnorm(hidden_states, residual, + self.weight, self.variance_epsilon) + residual = new_residual else: hidden_states = flashinfer_rmsnorm(hidden_states, self.weight, self.variance_epsilon) @@ - if residual is None: + if residual is None: return hidden_states else: return hidden_states, residualAlso applies to: 74-78
tensorrt_llm/_torch/pyexecutor/_util.py (2)
1-8
: Missing NVIDIA license headerAdd the standard 2025 NVIDIA SPDX header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +
576-587
: Remove staleenable_mixed_sampler
reference in API stability testsThe
create_torch_sampler_args
signature change is correctly reflected in code, and all internal callers have been updated. However, the API‐stability specification still lists the removedenable_mixed_sampler
field:• tests/unittest/api_stability/references/llm.yaml: lines 110–112 reference
enable_mixed_sampler
Please remove or update that entry so the stability tests remain in sync with the new
TorchSampler.Args
.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)
17-19
: TYPE_CHECKING import: LGTMStatic-only import avoids runtime dependency cycles. Optionally enable postponed annotations to drop quotes.
Apply this small tweak (optional):
+from __future__ import annotations
tests/unittest/llmapi/test_llm_pytorch.py (1)
229-233
: Broaden coverage: parametrize enable_logprobs for both True/FalseCurrent test only covers logprobs=True. Include False to exercise the non-logprobs path with embedding_bias.
Apply:
-@pytest.mark.parametrize( - "enable_logprobs", - [True], -) +@pytest.mark.parametrize("enable_logprobs", [False, True])Also applies to: 253-259
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
166-192
: Consider extracting magic values as constantsThe batched sampling functions use hardcoded default values. Consider extracting them for maintainability.
+# Sampling defaults +DEFAULT_TOP_K = 50 +DEFAULT_TOP_P = 0.9 +DEFAULT_TEMPERATURE = 1.0 +MIN_TEMPERATURE = 1e-5 + def top_k_sampling_batch( logits, - top_k=50, + top_k=DEFAULT_TOP_K, generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:def top_p_sampling_batch( logits: torch.Tensor, - top_p: float = 0.9, - temperature: float = 1.0, + top_p: float = DEFAULT_TOP_P, + temperature: float = DEFAULT_TEMPERATURE, generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:
353-361
: Add type hints for improved maintainabilityThe function would benefit from explicit return type hints.
def _group_requests_by_sampling_strategy( - requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]: + requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]: strategy_dict: dict[Strategy, list[int]] = defaultdict(list)
1052-1190
: Well-structured batched sampling pipelineThe
_sample_batched_by_strategy
method effectively groups requests by strategy and processes them in batches. The handling of draft logits indices for speculation is properly tracked.One concern: The d2t validation at lines 1071-1075 might be better placed in
ModelDrafter.prepare_draft_tokens
as suggested by the FIXME comment.Consider moving the d2t validation to ModelDrafter as suggested by the FIXME comment to maintain better separation of concerns.
1332-1333
: Fix line length violationsThese lines exceed the 120-character limit.
- # range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i], next_context_req_offsets_cuda[i]) - # And if generation requests are present, those tensors include a trailing entries selecting all tokens/logits + # range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i], + # next_context_req_offsets_cuda[i]) + # And if generation requests are present, those tensors include a trailing + # entries selecting all tokens/logits
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
tensorrt_llm/_torch/auto_deploy/llm_args.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(0 hunks)tensorrt_llm/_torch/modules/rms_norm.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(0 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(23 hunks)tensorrt_llm/_torch/speculative/eagle3.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(6 hunks)tensorrt_llm/llmapi/llm_args.py
(0 hunks)tensorrt_llm/scaffolding/worker.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
💤 Files with no reviewable changes (5)
- tensorrt_llm/scaffolding/worker.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/_torch/auto_deploy/llm_args.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/speculative/eagle3.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/speculative/model_drafter.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/speculative/eagle3.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/speculative/model_drafter.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
EagleDecodingConfig
(418-444)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
llm_test_harness
(109-136)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len
(558-568)max_seq_len
(571-575)
tensorrt_llm/_torch/modules/rms_norm.py (4)
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2)
flashinfer_fused_add_rmsnorm
(32-35)flashinfer_rmsnorm
(21-23)tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward
(148-159)tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward
(102-120)tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward
(23-34)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest
(282-422)append
(79-98)append
(125-142)get_draft_token_length
(556-567)tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async
(271-318)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)
tensorrt_llm/_torch/speculative/model_drafter.py (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
Sampler
(52-83)SampleState
(43-49)should_provide_draft_probs
(81-83)should_provide_draft_probs
(1413-1415)beam_width
(72-75)is_generation_model
(78-79)is_generation_model
(109-110)is_generation_model
(162-163)is_generation_model
(718-719)is_generation_model
(1449-1450)sample_async
(63-65)sample_async
(93-96)sample_async
(134-141)sample_async
(940-956)sample_async
(1596-1684)update_requests
(68-69)update_requests
(99-106)update_requests
(144-159)update_requests
(901-925)update_requests
(1688-1701)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest
(282-422)append
(79-98)append
(125-142)generation_logits
(214-223)tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
HandleLogits
(11-81)tensorrt_llm/_torch/speculative/mtp.py (2)
sample_async
(271-318)update_requests
(245-269)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
1332-1332: Line too long (125 > 120)
(E501)
1333-1333: Line too long (122 > 120)
(E501)
🔇 Additional comments (18)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
366-373
: Type annotation addition: LGTMoptional 1D cache for embedding bias is clearly typed and initialized.
tensorrt_llm/_torch/modules/rms_norm.py (1)
49-53
: No Ellipsis Passed toRMSNorm
Residual
I ran targeted searches for any use of the literal...
passed—either positionally or viaresidual=...
—toRMSNorm
(and its implicit__call__
toforward
) and found zero occurrences. This confirms that no call sites propagateEllipsis
into the newresidual=None
API. The change is safe to merge.tensorrt_llm/_torch/pyexecutor/_util.py (1)
593-596
: No change needed:engine.max_seq_len
already covers both KV-cache and no-cache casesThe
max_seq_len
property onTRTLLMEngine
is implemented to:
- return the KVCacheManager’s maximum sequence length when KV caching is enabled
- fall back to the internally stored
_max_seq_len_storage
for no-cache attentionThus, passing
engine.max_seq_len
intocreate_torch_sampler_args
already applies the correct cap in both scenarios—no further modifications are required here.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/speculative/model_drafter.py (6)
59-59
: LGTM! Improved clarityThe error message change from f-string to plain string is appropriate since there are no variables to interpolate.
85-86
: Good comment explaining overlap schedulingThe comment explaining
exclude_last_generation_logits=True
for overlap scheduling provides valuable context.
268-285
: Simplified sampling logic with proper error handlingThe refactored
_sample_async
method:
- Correctly computes
num_context_logits_prefix_sum
accounting forpy_return_context_logits
- Properly delegates to
HandleLogits
and the sampler'ssample_async
- Has appropriate error handling that returns
None
on failureThe removal of optional sampler handling simplifies the control flow.
300-301
: LGTM! Simplified request update logicThe removal of the None-check guard for sampler is consistent with the overall simplification where the sampler is now always required.
317-317
: Properly handles None propagation for draft logitsThe comment indicates that
None
values are correctly forwarded when draft logits are not available, which maintains backward compatibility.
76-90
: Draft logit policy integration verifiedThe
should_provide_draft_probs
API is declared inSampler
(returnsTrue
by default) and correctly overridden inTRTLLMSampler
to disable draft logits under greedy sampling. A search shows its sole invocation for draft requests inmodel_drafter.py
, ensuring consistent, per-request logic without residual flags.tensorrt_llm/_torch/pyexecutor/sampler.py (9)
81-83
: Conservative default for draft probability handlingThe base
should_provide_draft_probs
method returnsTrue
as a conservative default, ensuring backward compatibility. Derived classes can override for optimization.
276-285
: Efficient greedy search with optional softmax filteringThe
softmax_indices
parameter enables selective softmax computation, skipping unnecessary calculations for greedy sampling - a nice performance optimization.
288-310
: Fixed potential issue with padded draft tokensGood catch! The comment at line 291-292 correctly notes that
draft_probs
is not padded whiledraft_tokens
may be padded byModelDrafter._pad_to_max_draft_tokens
. The implementation correctly handles this by slicingdraft_tokens[:num_draft_tokens]
.
422-477
: Excellent implementation of multi-arange utilityThe
torch_multi_arange
function provides an efficient batched implementation with clear documentation and algorithm explanation. The use oftorch.repeat_interleave()
andtorch.cumsum()
to avoid loops is clever.
767-767
: Good device consistency checkAdding the assertion ensures the generator is on the expected device, preventing subtle CUDA errors.
965-1050
: Sophisticated embedding bias applicationThe
_apply_embedding_bias
implementation:
- Handles per-request bias tensors efficiently
- Uses smart batching to minimize memory operations
- Includes detailed comments explaining the trade-offs
- Correctly uses non-blocking transfers
The complexity is justified given the performance considerations.
1192-1286
: Complex but necessary unbatching logicThe
_unbatch_sampling_results
method correctly handles:
- Dimension ordering verification
- Efficient scatter operations using custom indexers
- Proper handling of log probabilities with different tensor layouts
- Non-blocking D2H transfers for performance
The FIXMEs at lines 1256 and 1260 suggest potential optimizations for memory layout that could be addressed in future iterations.
1344-1410
: Main processing pipeline successfully refactoredThe
_process_requests
method integrates all the batched sampling components effectively:
- Computes per-request steps
- Selects generated logits
- Applies embedding bias
- Performs batched sampling by strategy
- Unbatches results
This is the core of the performance improvements replacing the old "fast_path" code.
1413-1415
: Good optimization for greedy samplingThe override correctly avoids requesting draft probabilities for greedy sampling, which don't require rejection sampling.
2b209dc
to
6e89d3b
Compare
6e89d3b
to
f750bba
Compare
/bot run |
PR_Github #16682 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)
1-1
: Add required NVIDIA copyright headerPer repo guidelines, prepend the NVIDIA copyright header (current year).
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. from __future__ import annotations
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
288-310
: get_rejected_indices(): fix index dtypes and shape bug for single-step case
- Index tensors must be int64.
- Removing squeeze(0) avoids breaking 2D indexing when num_steps == 1.
Apply:
- token_idx = torch.arange(num_draft_tokens, - dtype=torch.int32, - device=generator.device) - draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.int32).to( + token_idx = torch.arange(num_draft_tokens, + dtype=torch.long, + device=generator.device) + draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.long).to( device=generator.device, non_blocking=True) p = draft_probs[token_idx, draft_tokens_cuda] - q = target_probs.squeeze(0)[token_idx, draft_tokens_cuda] + q = target_probs[token_idx, draft_tokens_cuda]
763-777
: Support multi-device by storing per-device RNGAsserting a single device can break in multi-GPU contexts. Keep one deterministic generator per device.
Apply:
def __init__(self, args: Args): @@ - self._generator = None + self._generators: dict[torch.device, torch.Generator] = {} @@ - def get_generator(self, device: torch.device) -> torch.Generator: + def get_generator(self, device: torch.device) -> torch.Generator: @@ - if self._generator is None: - # Fallback to a default seed if not set - self._generator = torch.Generator(device=device) - self._generator.manual_seed(self._global_seed) - assert self._generator.device == device - return self._generator + gen = self._generators.get(device) + if gen is None or gen.device != device: + gen = torch.Generator(device=device) + gen.manual_seed(self._global_seed) + self._generators[device] = gen + return gen
🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/model_drafter.py (1)
315-319
: Forwarding generation logits: guard against tensor-device mismatchesAssigning req.py_result.generation_logits directly to target_model_req.py_draft_logits is fine; ensure both tensors reside on the same device (expected CUDA). If PyResult ever stores CPU tensors (e.g., for CPU-only tests), consider moving to the target device or documenting the assumption.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
860-898
: Consistent attribute: use py_seq_slot instead of seq_slotElsewhere the code uses py_seq_slot; mixing both can cause subtle bugs if they ever diverge.
Apply:
- new_tokens[i, request.seq_slot, self.BEAM] = new_token + new_tokens[i, request.py_seq_slot, self.BEAM] = new_token @@ - new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token + new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
tensorrt_llm/_torch/auto_deploy/llm_args.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(0 hunks)tensorrt_llm/_torch/modules/rms_norm.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(0 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(23 hunks)tensorrt_llm/_torch/speculative/eagle3.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(6 hunks)tensorrt_llm/llmapi/llm_args.py
(0 hunks)tensorrt_llm/scaffolding/worker.py
(0 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
💤 Files with no reviewable changes (5)
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/auto_deploy/llm_args.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/scaffolding/worker.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🚧 Files skipped from review as they are similar to previous changes (5)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/speculative/eagle3.py
- tests/unittest/llmapi/test_llm_pytorch.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/modules/rms_norm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/speculative/model_drafter.py (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
Sampler
(52-83)SampleState
(43-49)should_provide_draft_probs
(81-83)should_provide_draft_probs
(1433-1435)beam_width
(72-75)is_generation_model
(78-79)is_generation_model
(109-110)is_generation_model
(162-163)is_generation_model
(727-728)is_generation_model
(1469-1470)sample_async
(63-65)sample_async
(93-96)sample_async
(134-141)sample_async
(951-967)sample_async
(1616-1704)update_requests
(68-69)update_requests
(99-106)update_requests
(144-159)update_requests
(911-935)update_requests
(1708-1721)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)tensorrt_llm/llmapi/llm_args.py (1)
DecodingBaseConfig
(349-404)tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest
(282-424)append
(79-98)append
(125-142)generation_logits
(214-223)tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
HandleLogits
(11-81)tensorrt_llm/_torch/speculative/mtp.py (2)
sample_async
(271-318)update_requests
(245-269)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest
(282-424)get_draft_token_length
(558-569)tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async
(271-318)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
1353-1353: Line too long (125 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)
271-282
: Prefix-sum logic: confirm 1-logit accounting when context logits are disabledYou count 1 per-context request when return_context_logits is False, to keep the last-token logit for potential generation logits. Please verify this matches upstream model output layout for all decoders.
Would you like a quick script to assert that lengths of num_context_logits_prefix_sum align with actual logits slices during tests?
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
1433-1436
: Good: draft probs suppressed for greedySkipping draft probabilities when greedy is selected avoids unnecessary softmax work.
166-173
: Use LongTensor for indices and adjust type hintsIndexing tensors in PyTorch requires int64 (LongTensor). Using IntTensor/int32 for indices will raise runtime errors in advanced indexing and index_select.
Apply:
-def top_k_sampling_batch( +def top_k_sampling_batch( logits, top_k=50, generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:And below (see related comments) ensure all index tensors are int64; also change any annotations using torch.IntTensor to torch.Tensor and cast to long at use sites.
Likely an incorrect or invalid review comment.
PR_Github #16682 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
230-253
: Force deterministic greedy sampling and avoid numeric overflow in embedding biasThe test currently relies on default sampling behavior (temperature=None → C++ default 1.0f) and an extreme bias value (
torch.finfo(torch.float32).max
), which can lead to non-greedy sampling and potential numeric instabilities (infs or NaNs). To ensure determinism and stability across backends, please update the test as follows:• In
tests/unittest/llmapi/test_llm_pytorch.py
attest_embedding_bias_with_torch_sampler_strategies()
(around line 243):
- Replace the use of
torch.finfo(torch.float32).max
with a large but finite bias (e.g.1e6
).- Add
temperature=0.0
(to force greedy sampling) andtop_k=1
(to restrict choices to the top token).Suggested diff:
def test_embedding_bias_with_torch_sampler_strategies(): @@ - vocab_size_padded = 32000 - embedding_bias = torch.zeros(vocab_size_padded) - embedding_bias[biased_word_id] = torch.finfo(torch.float32).max + vocab_size_padded = 32000 # consider deriving from tokenizer/model if it varies + embedding_bias = torch.zeros(vocab_size_padded) + embedding_bias[biased_word_id] = 1e6 # large finite bias to avoid overflow @@ - sampling_kwargs = { - "max_tokens": 6, - "embedding_bias": embedding_bias, - } + sampling_kwargs = { + "max_tokens": 6, + "temperature": 0.0, # force deterministic greedy sampling + "top_k": 1, # restrict to the highest-scoring token + "embedding_bias": embedding_bias, + }These changes will make the test behavior consistent and robust across different PyTorch builds and hardware.
tensorrt_llm/_torch/modules/rms_norm.py (1)
63-71
: Inconsistent residual semantics between FLASHINFER and non-FLASH paths.Non-FLASH returns updated residual (pre-norm sum), FLASHINFER returns the original residual. This diverges behavior and can corrupt downstream fused-add pipelines.
Apply:
if IS_FLASHINFER_AVAILABLE: from ..custom_ops import (flashinfer_fused_add_rmsnorm, flashinfer_rmsnorm) - if residual is not None: - flashinfer_fused_add_rmsnorm(hidden_states, residual, - self.weight, self.variance_epsilon) + if residual is not None: + # Preserve parity with non-FLASH path: return pre-norm sum as residual + input_dtype = hidden_states.dtype + _updated_residual = (hidden_states.to(torch.float32) + + residual.to(torch.float32)).to(input_dtype) + flashinfer_fused_add_rmsnorm(hidden_states, residual, + self.weight, self.variance_epsilon) + residual = _updated_residual else: hidden_states = flashinfer_rmsnorm(hidden_states, self.weight, self.variance_epsilon)
🧹 Nitpick comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
234-241
: Consider deriving vocab_size_padded instead of hard-coding 32000.Hard-coding may break with model/tokenizer variants that pad differently.
I can propose a helper to compute padded size from tokenizer vocab and alignment used by the sampler if you confirm the pad multiple.
tensorrt_llm/_torch/modules/rms_norm.py (1)
75-78
: Clarify intent: residual holds pre-norm sum (not original residual).This is correct for fused-add pipelines; consider a brief comment to avoid future regressions.
Apply:
- if residual is not None: + if residual is not None: + # Update residual to the pre-norm sum(hidden_states, residual) for downstream fused-add hidden_states = hidden_states + residual.to(torch.float32) residual = hidden_states.to(input_dtype)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/rms_norm.py
(5 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
llm_test_harness
(109-136)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward
(148-159)tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward
(102-120)tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward
(23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)
63-71
: In-place semantics confirmed for flashinfer_fused_add_rmsnorm
Verified thatflashinfer_fused_add_rmsnorm
is registered as a custom op withmutates_args=("input", "residual")
and its Python signature returnsNone
, meaning it mutateshidden_states
(andresidual
) in-place. No changes required.
/bot run |
PR_Github #16697 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml
(2 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/test_lists/test-db/l0_dgx_b200.yml
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
97-101
: LGTM: standardized lowercase backend specifiers for GPTOSS post_merge.Consistent with the broader rename pattern in this PR.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
154-159
: LGTM: backend tag casing normalized (CUTLASS/TRITON → cutlass/triton).Matches the naming scheme adopted elsewhere; should reduce param-id drift across files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)
64-73
: Residual return semantics diverge between FLASHINFER and fallback paths.Fallback returns the pre-norm sum as
residual
(Line 78), while the FLASHINFER branch returns the original inputresidual
unchanged, breaking parity and likely confusing downstream consumers expecting the updated residual. Fix by computing the summed residual for return in the FLASHINFER path without altering the inputs to the fused op.if IS_FLASHINFER_AVAILABLE: from ..custom_ops import (flashinfer_fused_add_rmsnorm, flashinfer_rmsnorm) - if residual is not None: - flashinfer_fused_add_rmsnorm(hidden_states, residual, - self.weight, self.variance_epsilon) + if residual is not None: + # Maintain the same return semantics as the non-FLASHINFER path: + # return the pre-norm sum as residual. + input_dtype = hidden_states.dtype + residual_out = (hidden_states.to(torch.float32) + + residual.to(torch.float32)).to(input_dtype) + flashinfer_fused_add_rmsnorm(hidden_states, # normalized in-place/return + residual, + self.weight, + self.variance_epsilon) + residual = residual_out else: hidden_states = flashinfer_rmsnorm(hidden_states, self.weight, self.variance_epsilon)
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/rms_norm.py (4)
28-30
: Make sentinel definition clearer and immutable.Minor readability/typing tweak: define the alias before the constant and mark the sentinel
Final
so type-checkers treat it as immutable.-from typing import Optional, Tuple, TypeAlias, Union, cast +from typing import Final, Optional, Tuple, TypeAlias, Union, cast @@ - _ARGUMENT_NOT_SPECIFIED_SENTINEL = ... - _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType + _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType + _ARGUMENT_NOT_SPECIFIED_SENTINEL: Final[_ArgumentNotSpecifiedSentinelType] = ...
55-63
: Streamline the “unspecified residual” logic.Equivalent but tighter boolean makes intent obvious and avoids the extra flag mutation.
- return_residual = True - if residual is self._ARGUMENT_NOT_SPECIFIED_SENTINEL: - return_residual = False - residual = None + return_residual = residual is not self._ARGUMENT_NOT_SPECIFIED_SENTINEL + if not return_residual: + residual = None
76-79
: Avoid an extra allocation in the fallback add.You can do the add in-place on the FP32 buffer to reduce one temporary tensor.
- if residual is not None: - hidden_states = hidden_states + residual.to(torch.float32) - residual = hidden_states.to(input_dtype) + if residual is not None: + hidden_states.add_(residual.to(torch.float32)) + residual = hidden_states.to(input_dtype)Note: ensure no autograd requirements/in-place aliasing concerns in this inference path.
110-116
: Fix mutable default and Optional types in group_rms_norm signature.
weights
uses a mutable default[]
, andeps
/weight_bias
are annotatedOptional
but never acceptNone
. Clean this to avoid surprises.-def group_rms_norm( - inputs: list[torch.Tensor], - weights: Optional[list[torch.Tensor]] = [], - eps: Optional[float] = 1e-5, - weight_bias: Optional[float] = 0.0, +def group_rms_norm( + inputs: list[torch.Tensor], + weights: Optional[list[torch.Tensor]] = None, + eps: float = 1e-5, + weight_bias: float = 0.0, kernel: GroupRMSNormKernelSelection = GroupRMSNormKernelSelection.heuristic, outputs: Optional[list[torch.Tensor]] = None) -> list[torch.Tensor]: @@ - out = outputs + out = outputs if out is None: out = [torch.empty_like(input) for input in inputs] + if weights is None: + weights = []
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/rms_norm.py
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/rms_norm.py
🧠 Learnings (3)
📚 Learning: 2025-08-27T14:41:56.613Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:96-99
Timestamp: 2025-08-27T14:41:56.613Z
Learning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-27T14:23:55.545Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.545Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.py : Code must target Python 3.8+
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward
(148-159)tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward
(102-120)tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward
(23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/rms_norm.py (3)
17-18
: Python 3.10+ typing usage is fine; ensure packaging/CI reflect this floor.Using
TypeAlias
andEllipsisType
is correct given the project’s effective Python ≥3.10 policy. Please confirmpython_requires
and CI images align (≥3.10) to avoid drift.
85-88
: Return contract looks good.Tuple is returned only when the caller explicitly provided a
residual
(includingNone
), matching the sentinel contract.
93-101
: skip_forward parity with sentinel handling is correct.Behavior mirrors
forward
: unspecified → single tensor; specified (evenNone
) → tuple. Good.
/bot run |
PR_Github #16697 [ run ] completed with state |
PR_Github #16712 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
769-783
: Make generators per-device; current assert breaks multi-GPUA single cached generator with a hard device-assert will fail when sampling across devices.
- self._global_seed = 42 - self._generator = None + self._global_seed = 42 + self._generators: dict[torch.device, torch.Generator] = {} @@ - def get_generator(self, device: torch.device) -> torch.Generator: + def get_generator(self, device: torch.device) -> torch.Generator: @@ - if self._generator is None: - # Fallback to a default seed if not set - self._generator = torch.Generator(device=device) - self._generator.manual_seed(self._global_seed) - assert self._generator.device == device - return self._generator + gen = self._generators.get(device) + if gen is None: + gen = torch.Generator(device=device) + gen.manual_seed(self._global_seed) + self._generators[device] = gen + return gen
885-897
: Use py_seq_slot consistently (fixes wrong writes with binding seq_slot)TorchSampler elsewhere uses py_seq_slot; mixing in seq_slot can misroute writes.
- new_tokens[i, request.seq_slot, self.BEAM] = new_token + new_tokens[i, request.py_seq_slot, self.BEAM] = new_token @@ - new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token + new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_tokenAlso applies to: 899-904
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
374-397
: Int32 softmax_indices and filtering behavior LGTMType hint and greedy softmax-skipping logic align with established TRT-LLM conventions.
1-2
: Add required NVIDIA copyright headerPer repo guidelines, prepend the NVIDIA header.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import dataclasses import enum
1018-1064
: _apply_embedding_bias(): use id()-based dedup and Long indices for index_selectTensor keys are unsafe/expensive; index_select requires Long indices.
- # Indices of unique bias tensors - bias_to_index: dict[torch.Tensor, - int] = defaultdict(provision_bias_index) + # Indices of unique bias tensors (group by identity, not contents) + bias_to_index: dict[int, int] = defaultdict(provision_bias_index) + biases: list[torch.Tensor] = [] @@ - if req_bias is not None: + if req_bias is not None: logits_bias_mask[i:(i + steps)] = True - req_bias_index = bias_to_index[req_bias] + key = id(req_bias) + req_bias_index = bias_to_index[key] + if req_bias_index == len(biases): + biases.append(req_bias) bias_gather_indices.extend(repeat(req_bias_index, steps)) @@ - bias_gather_indices_cuda = torch.tensor(bias_gather_indices, - pin_memory=True, - dtype=torch.int32).to( - logits.device, - non_blocking=True) + bias_gather_indices_cuda = torch.tensor( + bias_gather_indices, pin_memory=True, dtype=torch.long + ).to(logits.device, non_blocking=True) @@ - biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape), - pin_memory=True) - biases_tensor = torch.stack( - tuple(bias_to_index.keys()), - out=biases_tensor, - ) + biases_tensor = torch.stack( + biases, out=torch.empty((len(biases), *req_bias.shape), pin_memory=True) + ) @@ - biases_tensor_cuda = torch.index_select(biases_tensor_cuda, 0, - bias_gather_indices_cuda) + biases_tensor_cuda = torch.index_select( + biases_tensor_cuda, 0, bias_gather_indices_cuda + )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
39-56
: Confirm minimum supported Python version (uses 3.10+ features)This file uses kw_only dataclasses, PEP 634 pattern matching (match/case), and PEP 604 unions (X | Y), which require Python 3.10+. Repo guideline says “Python 3.8+”; please confirm runtime targets 3.10+ or adjust.
Would you like a codemod to replace match/case with if/elif and kw_only with runtime-safe alternatives if 3.8/3.9 must be supported?
Also applies to: 119-133, 374-397
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/sampler.py
(24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.127Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064
Timestamp: 2025-08-27T14:52:26.324Z
Learning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238
Timestamp: 2025-08-27T14:47:49.339Z
Learning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest
(282-424)get_draft_token_length
(558-569)tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async
(271-318)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
1359-1359: Line too long (125 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1235-1244
: dim_order() usage OKPublic API; assertions ensure canonical layout before view/scatter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
885-897
: Usepy_seq_slot
consistently;seq_slot
may be different/undefined here.Other TorchSampler paths use
py_seq_slot
. Mixing the two risks indexing the wrong slot or AttributeError.Apply:
- new_tokens[i, request.seq_slot, self.BEAM] = new_token + new_tokens[i, request.py_seq_slot, self.BEAM] = new_token @@ - new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token + new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1-2
: Add required NVIDIA copyright header (current year).Repository guideline requires the header on all source files.
Apply:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import dataclasses import enum
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)
184-191
: Guard top-k against k > vocab_size.Prevents runtime errors when configs overspecify k.
Apply:
- if top_k > 0: - values, indices = torch.topk(logits, top_k, dim=-1) + if top_k > 0: + top_k = min(top_k, vocab_size) + values, indices = torch.topk(logits, top_k, dim=-1)
249-256
: Same guard for top-k in top-k+top-p path.Apply:
- if top_k > 0: - values, _ = torch.topk(logits, top_k, dim=-1) + if top_k > 0: + top_k = min(top_k, vocab_size) + values, _ = torch.topk(logits, top_k, dim=-1)
778-783
: Make generator device-robust (multi-GPU).Reinitialize when device changes instead of asserting.
Apply:
- if self._generator is None: - # Fallback to a default seed if not set - self._generator = torch.Generator(device=device) - self._generator.manual_seed(self._global_seed) - assert self._generator.device == device + if self._generator is None or self._generator.device != device: + self._generator = torch.Generator(device=device) + self._generator.manual_seed(self._global_seed) return self._generator
1026-1066
: Prefer id()-based dedup for clarity; avoid relying on tensor keys.Identity hashing of tensors is fine, but explicit id() clarifies intent and decouples from tensor-object key semantics.
Apply:
- bias_to_index: dict[torch.Tensor, - int] = defaultdict(provision_bias_index) + bias_to_index: dict[int, int] = defaultdict(provision_bias_index) + biases: list[torch.Tensor] = [] @@ - if req_bias is not None: + if req_bias is not None: logits_bias_mask[i:(i + steps)] = True - req_bias_index = bias_to_index[req_bias] + key = id(req_bias) + req_bias_index = bias_to_index[key] + if req_bias_index == len(biases): + biases.append(req_bias) bias_gather_indices.extend(repeat(req_bias_index, steps)) @@ - biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape), - pin_memory=True) - biases_tensor = torch.stack( - tuple(bias_to_index.keys()), - out=biases_tensor, - ) + biases_tensor = torch.empty((len(biases), *req_bias.shape), pin_memory=True) + biases_tensor = torch.stack(tuple(biases), out=biases_tensor)
52-56
: Annotate optional fields as Optional[...] for type correctness.Current defaults are None but annotations are non-optional.
Apply:
class SampleState: scheduled_requests: ScheduledRequests - - device: SampleStateTensors = None - host: SampleStateTensors = None - - sampler_event: torch.cuda.Event = None + device: Optional[SampleStateTensors] = None + host: Optional[SampleStateTensors] = None + sampler_event: Optional[torch.cuda.Event] = None
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/sampler.py
(24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.127Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064
Timestamp: 2025-08-27T14:52:26.324Z
Learning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238
Timestamp: 2025-08-27T14:47:49.339Z
Learning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:37:38.228Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:991-1070
Timestamp: 2025-08-27T15:37:38.228Z
Learning: PyTorch tensors are hashable as dict keys because hash(torch.Tensor) is equivalent to id(torch.Tensor) and uses identity-based hashing, not content-based hashing. This means tensor hashing doesn't scan tensor contents and tensors can be safely used as dictionary keys.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest
(282-424)get_draft_token_length
(558-569)tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async
(271-318)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
1362-1362: Line too long (125 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
359-371
: Int32 indexing usage is consistent with established sampler design.Acknowledging prior discussion: keeping int32 for index tensors here is intentional and works in this codepath.
Also applies to: 282-292, 1188-1194, 1301-1305, 430-488, 504-523, 1238-1247, 1260-1267
1229-1247
: Use of torch.Tensor.dim_order is valid.Public API; check is correct for canonical row-major ordering.
PR_Github #16712 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16722 [ run ] triggered by Bot |
PR_Github #16722 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved from qwen-dev side
Signed-off-by: ixlmar <[email protected]>
Head branch was pushed to by a user without write access
4c98e8b
to
b0981e2
Compare
/bot run |
PR_Github #19563 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the llmapi changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR_Github #19563 [ run ] completed with state |
/bot run |
PR_Github #19638 [ run ] triggered by Bot |
/bot run |
PR_Github #19666 [ run ] triggered by Bot |
PR_Github #19638 [ run ] completed with state |
PR_Github #19666 [ run ] completed with state |
Description
This PR removes the
enable_mixed_sampler
option, including the "fast_path" code path controlled by it. Meanwhile, the changes aim at improving the performance of the general code path to avoid any significant performance degradation in the case of greedy sampling without drafting, previously covered by "fast_path".Removing
enable_mixed_sampler
should also avoid non-intuitive behavior, as evidenced in Investigate TRTLLM runtime repetitive issue #5254.However, the revised code is designed to also streamline request processing via batching for non-greedy sampling, as well as in the presence of draft tokens --- i.e., in situations previously not covered by "fast_path".
This PR opportunistically makes changes to vectorize
TorchSampler._apply_embedding_bias
.This PR opportunistically makes changes to skip softmax computation when it is not needed (e.g., for greedy sampling).
Add
@torch.inference_mode()
inTorchSampler
.Lastly, the PR includes some minor code improvements (type annotations etc.).
Test Coverage
enable_mixed_sampler
option removal.GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Tests
Chores