-
Notifications
You must be signed in to change notification settings - Fork 1.8k
draft: DO NOT MERGE #6172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft: DO NOT MERGE #6172
Conversation
Signed-off-by: Jinyang Yuan <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…ck. (#4264) Signed-off-by: Bo Li <[email protected]>
…4522) Signed-off-by: Zheng Duan <[email protected]>
…uting unit test (#5065) Signed-off-by: Christina Zhang <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Dongxu Yang <[email protected]> Signed-off-by: ShiXiaowei02 <[email protected]> Signed-off-by: Enwei Zhu <[email protected]> Signed-off-by: Jun Yang <[email protected]> Co-authored-by: Dongxu Yang <[email protected]> Co-authored-by: ShiXiaowei02 <[email protected]> Co-authored-by: Jun Yang <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>
Signed-off-by: Jin Li <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: xinhe-nv <[email protected]>
Signed-off-by: Tyler Burt <[email protected]>
Signed-off-by: Po-Wei Wang (Vincent)
Signed-off-by: Zheng Duan <[email protected]>
Signed-off-by: Zheng Duan <[email protected]>
#4635) Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Netanel Haber <[email protected]>
…5106) Signed-off-by: Lucas Liebenwein <[email protected]>
…a) + extras ✨ (#5066) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Venky <[email protected]>
…4961) Signed-off-by: moraxu <[email protected]>
Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…d more cases for llama_v3.1/3.3 70b fp8 models (#5149) Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>
…4933) Signed-off-by: moraxu <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: yizhan <[email protected]>
Signed-off-by: ruodil <[email protected]>
Signed-off-by: Pengyun Lin <[email protected]>
Signed-off-by: Robin Kobus <[email protected]>
…o max attention window (#5874) Signed-off-by: Netanel Haber <[email protected]> Signed-off-by: Netanel Haber <[email protected]>
…on store_cubin error (#5865) Signed-off-by: Zhenhuan Chen <[email protected]>
Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Amir Klein <[email protected]>
… to 0.75 to prevent OOM on CI. (#5896) Signed-off-by: Bo Li <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Yingge He <[email protected]> Signed-off-by: Martin Marciniszyn Mehringer <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: nv-guomingz <[email protected]> Co-authored-by: Yingge He <[email protected]> Co-authored-by: Martin Marciniszyn Mehringer <[email protected]> Co-authored-by: Kaiyu Xie <[email protected]> Co-authored-by: zpatel <[email protected]>
Signed-off-by: Superjomn <[email protected]>
Signed-off-by: Nikita Korobov <[email protected]>
Signed-off-by: zhengd-nv <[email protected]>
…5947) Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Yi Zhang <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: Iman Tabrizian <[email protected]>
#6039) Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]>
Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]> Signed-off-by: qqiao <[email protected]> Co-authored-by: Emma Qiao <[email protected]>
Signed-off-by: bhsueh <[email protected]>
WalkthroughThis update introduces major enhancements and refactoring across the batch manager, executor, runtime, and kernel components. Key changes include support for a new NIXL communication backend in cache transceivers, extensive interface and buffer management updates for multimodal and MLA (Matrix Layout Accelerator) features, new CUDA kernel implementations, and significant improvements in MoE benchmarking infrastructure. Build configurations, documentation, and test utilities are also updated to support these new features and maintain compatibility. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Executor
participant BatchManager
participant CacheTransceiver
participant CacheFormatter
participant BufferManager
participant ConnectionManager
User->>Executor: Submit multimodal/generation request
Executor->>BatchManager: Prepare request (multimodal, draft, MLA, etc.)
BatchManager->>CacheTransceiver: Initiate cache communication (MPI/UCX/NIXL)
CacheTransceiver->>BufferManager: Allocate send/recv buffers (optionally FabricMemory)
CacheTransceiver->>ConnectionManager: Establish connections (including NIXL)
CacheTransceiver->>CacheFormatter: Format cache blocks (duplication, MLA, etc.)
CacheFormatter-->>CacheTransceiver: Provide block ranges/connections
CacheTransceiver->>CacheTransceiver: Send/receive cache blocks
CacheTransceiver->>BatchManager: Notify completion
BatchManager->>Executor: Return results
Executor->>User: Deliver output (with multimodal/MLA support)
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 32
🔭 Outside diff range comments (1)
cpp/kernels/fmha_v2/src/fmha/utils.h (1)
2889-2905
: Replace integer PTX max with floating-point max for correctnessUsing
max.s16x2
on reinterpreted float bits can mix halves from different operands and yield invalid results. Switch to the proper FP max instructions (or thefmaxf
intrinsic) to guarantee correct behavior:• File: cpp/kernels/fmha_v2/src/fmha/utils.h
• Lines: 2897–2903- int32_t tmp; - asm volatile("max.s16x2 %0, %1, %2;\n" : "=r"(tmp) : "r"(a_), "r"(b_)); - asm volatile("max.s16x2 %0, %0, %1;\n" : "+r"(tmp) : "r"(tmp), "r"(c_)); - res = reinterpret_cast<float const&>(tmp); + float tmp_fp; + asm volatile("max.f32 %0, %1, %2;\n" : "=f"(tmp_fp) : "f"(a), "f"(b)); + asm volatile("max.f32 %0, %0, %1;\n" : "+f"(tmp_fp) : "f"(c)); + res = tmp_fp;Alternatively, drop the inline PTX and simply do:
res = fmaxf(a, fmaxf(b, c));This ensures correct IEEE-754 behavior across all positive float inputs.
🧹 Nitpick comments (41)
.github/workflows/blossom-ci.yml (1)
42-44
: Deduplicate and externalise the author list
"ixlmar"
now appears twice in the hard-coded JSON array.
Beyond duplicates, this single ultra-long line is becoming unmanageable. Consider moving the list into:
- A YAML sequence (multi-line) and using
join()
- Or a separate JSON file checked into the repo and read with
cat
/jq
to keep the workflow readable and reduce merge-conflict risk.
README.md (1)
20-33
: Heading style violates project markdown rules
## Tech Blogs
and## Latest News
are ATX headings, while the rest of the file uses Setext style (===
).
To silencemarkdownlint
(MD003) and keep consistency, convert these two headings to Setext or disable the rule locally:-## Tech Blogs +Tech Blogs +----------Same for Latest News.
constraints.txt (1)
6-9
: Good proactive pinning – add “why not upper-bounds” noteThe lower-bound WAR pins are clear, but without an upper bound the environment may pull incompatible major versions in the future.
Add a brief comment explaining that only minimum versions are intended and are expected to stay compatible with the base image.-# WAR against … tornado>=6.5.0 +# WAR against … tornado>=6.5.0 # no upper-bound; rely on image testscpp/kernels/fmha_v2/README.md (1)
5-5
: Fix punctuation as suggested by static analysis.The static analysis tool correctly identified a missing period after "etc." in American English.
-FMHA_v2 is just a bunch of Multi-head Attention kernels that we've enabled for known cases. It's not built as a library (cuBLAS, cuDNN, HazyResearch's MHA, etc) that is supposed to deliver good perf for all cases. End users will get access to FMHA through products or libraries, not directly through FMHA_v2. +FMHA_v2 is just a bunch of Multi-head Attention kernels that we've enabled for known cases. It's not built as a library (cuBLAS, cuDNN, HazyResearch's MHA, etc.) that is supposed to deliver good perf for all cases. End users will get access to FMHA through products or libraries, not directly through FMHA_v2.CONTRIBUTING.md (1)
666-666
: Fix markdown linting issue with bare URL.The static analysis tool flagged a bare URL that should be properly formatted.
- "Allowed values are: fp8, fp4, wfp4afp8, int4, int8, float, half, bfloat16\n" + "Allowed values are: fp8, fp4, wfp4afp8, int4, int8, float, half, bfloat16\\n"cpp/micro_benchmarks/CMakeLists.txt (1)
56-56
: Fix typo in comment.-# Temporary opend-sourced version. Will be daleted when open-sourced moe_gemm +# Temporary open-sourced version. Will be deleted when open-sourced moe_gemm.github/workflows/label_community_pr.yml (1)
28-28
: Fix grammatical error in label text.The label text "Community want to contribute" has a grammatical error. It should be "Community wants to contribute" or "Community contribution".
- COMMUNITY_LABEL: "Community want to contribute" + COMMUNITY_LABEL: "Community wants to contribute"cpp/libnuma_conan.py (2)
24-38
: Good architecture-specific library path handling.The multi-candidate approach for library paths is excellent, covering RHEL/CentOS, Ubuntu/Debian, and various architectures. However, consider enhancing the architecture handling.
Consider adding more architecture variants and improving the fallback logic:
- elif arch in ["armv8", "aarch64"]: + elif arch in ["armv8", "aarch64", "armv8-a"]: # Debian/Ubuntu aarch64 lib_candidates.append("/usr/lib/aarch64-linux-gnu/libnuma.so") + elif arch.startswith("arm"): + # Generic ARM fallback + lib_candidates.append("/usr/lib/arm-linux-gnueabihf/libnuma.so") else: self.output.info( - f"Unrecognized architecture: {arch}, falling back to /usr/lib/libnuma.so" + f"Unrecognized architecture: {arch}, trying common paths" ) lib_candidates.append("/usr/lib/libnuma.so") + lib_candidates.append("/usr/local/lib/libnuma.so")
39-46
: Enhance error handling with more descriptive messages.The error handling is good, but could provide more helpful information for debugging.
else: - raise ConanInvalidConfiguration( - "libnuma.so not found on system") + searched_paths = ", ".join(lib_candidates) + raise ConanInvalidConfiguration( + f"libnuma.so not found on system. Searched paths: {searched_paths}. " + f"Please install libnuma-dev or numactl-devel package.")cpp/kernels/xqa/test/warmup.cu (1)
7-11
: Potential issue with clock comparison logicThe while loop condition
tic + cycles < clock64()
could be problematic if clock values wrap around or if there are precision issues. Consider using a more robust comparison.- uint64_t const tic = clock64(); - while (tic + cycles < clock64()) + uint64_t const tic = clock64(); + uint64_t elapsed = 0; + while (elapsed < cycles) { + elapsed = clock64() - tic; }.github/scripts/label_community_user.py (3)
26-27
: Consider more robust rate limitingThe fixed 0.5-second sleep might not be sufficient during high API usage. Consider implementing exponential backoff or checking rate limit headers from the API response.
- time.sleep(0.5) response = requests.get(url, headers=HEADERS) + + # Check rate limit headers + if 'X-RateLimit-Remaining' in response.headers: + remaining = int(response.headers['X-RateLimit-Remaining']) + if remaining < 10: + reset_time = int(response.headers.get('X-RateLimit-Reset', 0)) + sleep_time = max(reset_time - time.time() + 1, 1) + time.sleep(sleep_time) + else: + time.sleep(0.5)
36-39
: Fix line length to comply with style guideLine 38 exceeds the 120-character limit. Split the error message for better readability.
raise RuntimeError( f"Forbidden (403) when fetching members for 'NVIDIA'. " - f"This may be due to insufficient token permissions or rate limits. Details: {error_message}. Cannot fetch members." + f"This may be due to insufficient token permissions or rate limits. " + f"Details: {error_message}. Cannot fetch members." )
54-56
: Consider more specific exception handlingThe generic
Exception
catch might mask unexpected errors. Consider catching specific exceptions to better handle different failure scenarios.- except Exception as e: + except requests.exceptions.RequestException as e: print(f"Error fetching NVIDIA members: {e}") return [] + except (KeyError, ValueError) as e: + print(f"Error parsing API response: {e}") + return []benchmarks/cpp/utils/prepare_synthetic_data.py (1)
10-32
: Good refactoring to consolidate duplicate logic.The helper function effectively consolidates task ID generation and LoRA configuration logic. The warning for inconsistent argument usage is helpful for users.
Consider adding type hints to improve code clarity:
-def _generate_task_ids_and_lora_config(root_args, num_reqs): +def _generate_task_ids_and_lora_config(root_args, num_reqs: int) -> tuple[list[int], list[int] | None, dict[str, str] | None]:cpp/kernels/xqa/mla_sm120.cuh (1)
78-89
: Consider documenting the hash function's purpose and limitationsThe
hashRegData
function uses a simple XOR hash. While functional, it has limitations:
- XOR of all elements means different inputs can produce same hash
- Only works for 4-byte types as enforced by static_assert
Consider adding a comment explaining the hash function's purpose and that it's not cryptographically secure:
+// Simple XOR-based hash for debugging/verification purposes. +// Not suitable for cryptographic use or collision-resistant hashing. template <typename T, uint32_t n> __device__ inline uint32_t hashRegData(Vec<T, n> const& data)cpp/include/tensorrt_llm/deep_gemm/jit_utils.cuh (1)
65-65
: Consider adding parameter documentation.The new
swap_ab
parameter is added to both functions but lacks documentation. Consider adding doxygen-style comments to explain:
- What
swap_ab
means (swapping A and B matrix order/layout)- How it affects the computation
- When it should be used
Also applies to: 68-68
cpp/include/tensorrt_llm/deep_gemm/compiler.cuh (1)
247-254
: Consider documenting swapAB limitations.The swapAB mode only supports
Normal
andGroupedWithOffset
gemm types. Consider adding a comment explaining why other types are unsupported or plan for future support.else { + // SwapAB mode currently supports only Normal and GroupedWithOffset types + // due to specific scheduling requirements for transposed operations switch (gemm_type) { case deep_gemm::GemmType::Normal: input_type = "NormalSchedulerInputSwapAB"; break; case deep_gemm::GemmType::GroupedWithOffset: input_type = "GroupedWithOffsetSchedulerInputSwapAB"; break; default: throw std::runtime_error("Unsupported gemm type"); } }cpp/include/tensorrt_llm/deep_gemm/fp8_gemm_impl.cuh (1)
764-772
: Consider caching scale computations.The scale computations in the inner loop could benefit from pre-computation outside the WGMMA section if the scales don't change frequently.
// Each thread reads consecutive two b scales, each thread needs to read WGMMA::N / 4 * 2 b // scales float scale_0_0[WGMMA::kNumAccum / 4], scale_0_1[WGMMA::kNumAccum / 4]; + // Pre-compute scales to reduce redundant operations #pragma unroll for (int i = 0; i < WGMMA::kNumAccum / 4; ++i) { float2 scale_b = ld_shared(reinterpret_cast<const float2*>(smem_scales_b[s] + i * 8 + scale_offset)); scale_0_0[i] = scale_a_0 * scale_b.x; scale_0_1[i] = scale_a_0 * scale_b.y; }
cpp/kernels/fmha_v2/setup.py (1)
3053-3059
: Consider making the cubin usage conditions configurableThe hardcoded conditions for using cubin headers (sm90 with head_size 128 and sm89 with e4m3 dtype) should ideally be configurable through environment variables or a configuration file. This would make it easier to modify the behavior without changing the source code.
Consider adding configuration options:
def use_cubin_header(sm, head_size, dtype): + # Check if there's an override in environment + if os.environ.get('FMHA_FORCE_CUBIN_HEADER'): + return os.environ['FMHA_FORCE_CUBIN_HEADER'].lower() == 'true' + return (sm == 90 and head_size == 128) or (sm == 89 and 'e4m3' in dtype)cpp/include/tensorrt_llm/executor/executor.h (1)
285-304
: Consider performance optimizations for MultimodalInput classThe implementation is functionally correct, but consider these performance improvements:
- Pass constructor parameters by const reference to avoid unnecessary copies
- Return const references from getter methods to avoid copying vectors
- explicit MultimodalInput(std::vector<std::vector<SizeType32>> multimodalHashes, - std::vector<SizeType32> multimodalPositions, std::vector<SizeType32> multimodalLengths); + explicit MultimodalInput(std::vector<std::vector<SizeType32>> const& multimodalHashes, + std::vector<SizeType32> const& multimodalPositions, std::vector<SizeType32> const& multimodalLengths); - [[nodiscard]] std::vector<std::vector<SizeType32>> getMultimodalHashes() const; - [[nodiscard]] std::vector<SizeType32> getMultimodalPositions() const; - [[nodiscard]] std::vector<SizeType32> getMultimodalLengths() const; + [[nodiscard]] std::vector<std::vector<SizeType32>> const& getMultimodalHashes() const; + [[nodiscard]] std::vector<SizeType32> const& getMultimodalPositions() const; + [[nodiscard]] std::vector<SizeType32> const& getMultimodalLengths() const;cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
2259-2267
: Simplify multimodal parameter initializationThe ternary operator chains for converting optional values to shared_ptrs reduce readability. Consider extracting these conversions to helper functions or local variables.
- multimodalHashes.has_value() - ? std::make_shared<std::vector<std::vector<SizeType32>>>(std::move(multimodalHashes.value())) - : std::optional<std::shared_ptr<std::vector<std::vector<SizeType32>>>>(std::nullopt), - multimodalPositions.has_value() - ? std::make_shared<std::vector<SizeType32>>(std::move(multimodalPositions.value())) - : std::optional<std::shared_ptr<std::vector<SizeType32>>>(std::nullopt), - multimodalLengths.has_value() - ? std::make_shared<std::vector<SizeType32>>(std::move(multimodalLengths.value())) - : std::optional<std::shared_ptr<std::vector<SizeType32>>>(std::nullopt), + convertToSharedPtr(std::move(multimodalHashes)), + convertToSharedPtr(std::move(multimodalPositions)), + convertToSharedPtr(std::move(multimodalLengths)),Where
convertToSharedPtr
is a helper template function that handles the conversion.cpp/include/tensorrt_llm/deep_gemm/fp8_gemm.cuh (1)
291-333
: Well-implemented SwapAB variant with clear parameter documentationThe function correctly implements the swapped semantics with N-dimension offsets. The inline comments (/* weight*/, /* act*/) help clarify the swapped parameter roles.
Consider adding similar clarifying comments to the other SwapAB functions for consistency.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncherOss.cu (2)
1-19
: Temporary file notice needs clarification.The comment indicates this is a "temporary open-sourced version" that will be deleted when "open-sourced moe_gemm support MXFP4". This should be tracked properly to ensure cleanup happens.
Would you like me to create an issue to track the removal of this temporary file once MXFP4 support is added to the open-sourced moe_gemm?
294-487
: Consider refactoring this large function for better maintainability.The
argGenLoadFile
function is over 190 lines and handles multiple responsibilities including JSON parsing, routing configuration management, data type filtering, and benchmark argument generation. Consider breaking it into smaller, focused functions.Split into functions like:
parseRoutingConfig()
- Handle routing configuration parsingfilterByDataType()
- Handle data type filtering logicparseTactics()
- Handle tactic ID parsinggenerateBenchmarkArgs()
- Handle final argument generationcpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixtureOss.h (3)
44-44
: Fix typos in comment.-// Temporary opend-sourced version. Will be daleted when open-sourced moe_gemm support MXFP4 +// Temporary open-sourced version. Will be deleted when open-sourced moe_gemm supports MXFP4
133-133
: Consider making random seeds configurable.Fixed seeds (0xD5) make benchmarks reproducible but may not represent real-world variance. Consider making seeds configurable.
struct RandomDistributionRoutingConfig : public RoutingConfig { using ElementType = float; - std::mt19937_64 twister{0xD5}; + std::mt19937_64 twister; // ... other members ... RandomDistributionRoutingConfig(std::vector<ElementType> const& in_probabilities, int64_t num_experts, int64_t k, - std::string name = "random_distribution") + std::string name = "random_distribution", uint64_t seed = 0xD5) - : probabilities(std::move(in_probabilities)) + : twister(seed) + , probabilities(std::move(in_probabilities))Also applies to: 213-213
796-805
: Fix incomplete label string construction.The stringstream builds a string with all parameters but only uses the routing config name as the label.
Either use the full string or remove the unused construction:
- std::stringstream ss; - ss << "Experts,K,Hidden,Inter,TP,EP,Rank,Tokens,Bias,Scale,Actfn,Tactic,Routing="; - for (auto v : {num_experts, top_k, hidden_size, inter_size, tp_size, ep_size, world_rank, num_tokens, - (int) mUseBias, (int) mUseFinalScale, (int) mActType, tactic_idx1, tactic_idx2}) - { - ss << v << ","; - } - ss << routingConfigCache.at(routing_config)->getName(); - // state.SetLabel(ss.str()); state.SetLabel(routingConfigCache.at(routing_config)->getName());cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (5)
73-73
: Remove commented out code.- // int selfTpRank = selfIdx % selfConfig.getParallelConfig().mTensorParallelism; auto targetInfo = executor::kv_cache::targetIRanks(destConfig, selfConfig, selfIdx);
228-228
: Fix typo in variable name.- auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); + auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
516-516
: Fix typo in variable name.- auto* agentConnnecion + auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]);
516-527
: Simplify cacheBufferId assignment logic.The conditional assignment makes the code harder to follow.
- auto* agentConnnecion - = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]); - if (agentConnnecion != nullptr) - { - cacheBufferId = agentConnnecion->getCacheBufferId(); - TLLM_CHECK(cacheBufferId.has_value()); - } - else - { - cacheBufferId = mCacheTransBufferManager->assignBufferIndexForRecv(); - } - TLLM_CHECK(cacheBufferId.has_value()); + auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]); + cacheBufferId = agentConnection != nullptr + ? agentConnection->getCacheBufferId() + : mCacheTransBufferManager->assignBufferIndexForRecv(); + TLLM_CHECK(cacheBufferId.has_value(), "Failed to obtain cache buffer ID");
688-754
: Optimize validation order for early exit.Check simpler conditions first for better performance.
Reorder checks from cheapest to most expensive:
[[nodiscard]] bool CacheFormatter::inquireSupport(CacheState const& selfConfig, CacheState const& destConfig) const { // Check simple conditions first if (selfConfig.getDataType() != destConfig.getDataType()) { TLLM_LOG_WARNING("CacheFormatter::inquireSupport: selfConfig.getDataType() != destConfig.getDataType()"); return false; } if (selfConfig.getAttentionConfig().mAttentionType != destConfig.getAttentionConfig().mAttentionType) { TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same attention type"); return false; } // Then check MLA early since it's not supported if (selfConfig.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA) { TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support non-MLA"); return false; } // More expensive checks last...cpp/include/tensorrt_llm/executor/transferAgent.h (1)
235-236
: Enhance error message with library name.Include the library name in the error message for better debugging.
- void* funcPtr = dlSym(handle, funcName.c_str()); - TLLM_CHECK_WITH_INFO(funcPtr, funcName + " function is not open correctly."); + void* funcPtr = dlSym(handle, funcName.c_str()); + TLLM_CHECK_WITH_INFO(funcPtr, "Failed to load function '" + funcName + "' from library '" + libName + "'");cpp/kernels/xqa/utils.cuh (2)
927-985
: Well-implemented cluster synchronization utilities.The cluster query functions and synchronization primitives are correctly implemented using appropriate PTX instructions.
Consider adding brief documentation for the
mapa
function to explain its purpose:/** * Maps a pointer from the current CTA's address space to another CTA's address space within the cluster. * This enables direct memory access across CTAs in a cluster. */ template <typename T> __device__ inline T* mapa(T* src, uint32_t clusterCtaRank)
1033-1058
: Timer class is useful but be cautious with printf in kernels.The Timer implementation is clean, but be aware that
printf
in CUDA kernels can significantly impact performance and should only be used for debugging.Consider adding a compile-time flag to disable timer output in production:
#ifdef ENABLE_KERNEL_PROFILING printf("%s: %u (block={%u, %u, %u})\n", name, toc - mTic, blockIdx.x, blockIdx.y, blockIdx.z); #endifcpp/include/tensorrt_llm/deep_gemm/scheduler.cuh (1)
739-757
: Consider documenting why only Normal and GroupedWithOffset are supported.The
SchedulerSelectorSwapAB
correctly implements selection for the two supported types with appropriate compile-time validation. However, it would be helpful to add a comment explaining whyStridedBatched
,GroupedContiguous
, andGroupedMasked
types are not supported for SwapAB operations.cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (1)
74-80
: Consider adding documentation for the complex return tuple.The expanded
operator()
signature appropriately handles the increased complexity with clear parameter types. However, the 4-element return tuple would benefit from documentation explaining what each component represents.Add a brief comment above the method documenting the return value components:
// Returns: tuple of <tensorPtr, samplingConfigs, tensorPtrs, lookaheadConfigs>
cpp/kernels/xqa/mla_sm120.cu (3)
245-273
: Consider documenting the ping-pong synchronization pattern.The
PingPongMutex
implementation is clever and correct, but the synchronization pattern would benefit from documentation explaining how the two barriers alternate to ensure mutual exclusion between groups.Add a comment explaining the pattern:
// PingPongMutex uses two barriers alternating between groups to ensure // only one group uses tensor cores at a time while the other prepares data
344-350
: Static assertion may fail in RTC compilation.The
#ifndef __CUDACC_RTC__
guard around the static assertion is good, but consider adding a runtime assertion as backup for NVRTC compilation scenarios.#ifdef __CUDACC_RTC__ assert(nbBars == nbBarsRef); #endif
1882-1885
: Consider more descriptive error messages.The beam width check throws a generic "not implemented" error. Consider providing more specific error messages to help users understand the limitation.
if (beamWidth != 1) { throw std::runtime_error("MLA kernel currently only supports beam width of 1, got " + std::to_string(beamWidth)); }cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
1315-1334
: Comprehensive enhancement of block allocation strategy.The
calculateMaxNumBlocks
method has been significantly improved to:
- Consider per-window memory requirements through
windowSizeToLayers
parameter- Account for both primary and secondary memory allocation
- Support the new
BlocksPerWindow
return type for granular allocationThis enables more efficient memory utilization across different attention window configurations and layer distributions.
Consider documenting the memory allocation algorithm in detail, especially the weighting strategy used for distributing blocks across window sizes. This will help future maintainers understand the design decisions.
#pragma once | ||
|
||
#include <cstddef> | ||
#include "tensorrt_llm/executor/serialization.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing required header for size_t
The file uses size_t
in multiple method signatures (lines 49, 51, 64) but removes <cstddef>
which provides this type. While it may compile if size_t
is transitively included through serialization.h
, this creates a fragile dependency.
Include both headers to ensure all used types are properly declared:
+#include <cstddef>
#include "tensorrt_llm/executor/serialization.h"
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/executor/cacheCommunicator.h at line 19, the header
<cstddef> is missing, which is required to properly declare the size_t type used
in method signatures on lines 49, 51, and 64. Add #include <cstddef> alongside
the existing #include "tensorrt_llm/executor/serialization.h" to explicitly
include the definition of size_t and avoid fragile dependencies on transitive
includes.
What's the difference between cubins and cu files? | ||
|
||
''' | ||
Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins. | ||
You can modify code in this repo to change or create your own kernels and run. | ||
Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details. | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve FAQ formatting and avoid hardcoded line numbers.
The FAQ content is helpful but has formatting issues:
- Triple quotes (
'''
) are not standard markdown syntax - Hardcoded line numbers in setup.py references may become outdated
-What's the difference between cubins and cu files?
-
-'''
-Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins.
-You can modify code in this repo to change or create your own kernels and run.
-Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details.
-'''
+What's the difference between cubins and cu files?
+
+Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins.
+You can modify code in this repo to change or create your own kernels and run.
+Now there are some kernels still running in cubins. See `use_cubin_header()` and `modify_cubin_header()` functions in setup.py for details.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
What's the difference between cubins and cu files? | |
''' | |
Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins. | |
You can modify code in this repo to change or create your own kernels and run. | |
Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details. | |
''' | |
What's the difference between cubins and cu files? | |
Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins. | |
You can modify code in this repo to change or create your own kernels and run. | |
Now there are some kernels still running in cubins. See `use_cubin_header()` and `modify_cubin_header()` functions in setup.py for details. |
🤖 Prompt for AI Agents
In cpp/kernels/fmha_v2/README.md around lines 84 to 90, replace the triple
quotes used for the FAQ explanation with proper markdown formatting such as
backticks for code or blockquotes for text. Remove the hardcoded line numbers in
the references to setup.py and instead describe the relevant sections or
functions by name only, so the documentation remains accurate even if the file
changes.
uses: actions/setup-python@v3 | ||
with: | ||
python-version: '3.x' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to latest action version.
The actions/setup-python@v3
action is outdated and may not work properly on current GitHub Actions runners.
- uses: actions/setup-python@v3
+ uses: actions/setup-python@v4
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uses: actions/setup-python@v3 | |
with: | |
python-version: '3.x' | |
uses: actions/setup-python@v4 | |
with: | |
python-version: '3.x' |
🧰 Tools
🪛 actionlint (1.7.7)
15-15: the runner of "actions/setup-python@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
In .github/workflows/label_community_pr.yml around lines 15 to 17, the GitHub
Action uses an outdated version actions/setup-python@v3 which may cause
compatibility issues. Update the action to the latest stable version, such as
actions/setup-python@v4, to ensure proper functionality on current GitHub
Actions runners.
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout repository | ||
uses: actions/checkout@v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to latest action version.
The actions/checkout@v3
action is outdated and may not work properly on current GitHub Actions runners.
- uses: actions/checkout@v3
+ uses: actions/checkout@v4
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uses: actions/checkout@v3 | |
uses: actions/checkout@v4 |
🧰 Tools
🪛 actionlint (1.7.7)
12-12: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
In .github/workflows/label_community_pr.yml at line 12, the actions/checkout@v3
is outdated. Update the version to the latest stable release, such as
actions/checkout@v4, to ensure compatibility and proper functionality with
current GitHub Actions runners.
uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz | ||
kernel_warmup<<<16, 128, 0, stream>>>(nbCycles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Verify clockRate unit assumption and improve grid configuration
The code assumes clockRate
is in kHz (as stated in the comment), but this should be verified. Additionally, the hardcoded grid configuration (16, 128) might not be optimal for all GPU architectures.
Consider making the grid configuration adaptive:
- uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz
- kernel_warmup<<<16, 128, 0, stream>>>(nbCycles);
+ uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz
+ int const blocks = std::min(16, prop.multiProcessorCount);
+ int const threads = std::min(128, prop.maxThreadsPerBlock);
+ kernel_warmup<<<blocks, threads, 0, stream>>>(nbCycles);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz | |
kernel_warmup<<<16, 128, 0, stream>>>(nbCycles); | |
uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz | |
int const blocks = std::min(16, prop.multiProcessorCount); | |
int const threads = std::min(128, prop.maxThreadsPerBlock); | |
kernel_warmup<<<blocks, threads, 0, stream>>>(nbCycles); |
🤖 Prompt for AI Agents
In cpp/kernels/xqa/test/warmup.cu around lines 15 to 16, verify that
prop.clockRate is indeed in kHz before using it to calculate nbCycles. Replace
the hardcoded kernel launch configuration (16, 128) with a dynamic calculation
based on the GPU's properties, such as the number of multiprocessors and maximum
threads per block, to optimize performance across different architectures.
size_t ppDomainSize = targetInfo.mDomainPPSize; | ||
size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor; | ||
size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize); | ||
if (bufferIdx < bufferCoverTargetNum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add documentation for complex index calculation.
The buffer index calculation logic is complex and needs explanation.
+ // Calculate buffer index considering duplication factors:
+ // - processIdx: current process index in the loop
+ // - ppDomainSize: pipeline parallelism domain size
+ // - peerDuplicateHeadFactor: how many peers share the same data
+ // - bufferTpRank: tensor parallel rank after accounting for duplication
+ // - bufferIdx: final buffer index combining TP and PP dimensions
size_t ppDomainSize = targetInfo.mDomainPPSize;
size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor;
size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
size_t ppDomainSize = targetInfo.mDomainPPSize; | |
size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor; | |
size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize); | |
if (bufferIdx < bufferCoverTargetNum) | |
// Calculate buffer index considering duplication factors: | |
// - processIdx: current process index in the loop | |
// - ppDomainSize: pipeline parallelism domain size | |
// - peerDuplicateHeadFactor: how many peers share the same data | |
// - bufferTpRank: tensor parallel rank after accounting for duplication | |
// - bufferIdx: final buffer index combining TP and PP dimensions | |
size_t ppDomainSize = targetInfo.mDomainPPSize; | |
size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor; | |
size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize); |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 254 to 257,
the calculation of bufferIdx involves multiple steps and variables, making it
hard to understand. Add a clear comment above this code block explaining the
purpose of each variable (ppDomainSize, bufferTpRank, bufferIdx), the rationale
behind the division and modulo operations, and how these relate to the indexing
logic. This will improve code readability and maintainability.
auto sendBufferIdx = bufferIdx % bufferCoverTargetNum; | ||
bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect index in buffer copy.
Using processIdx
instead of bufferIdx
when copying to pre-allocated buffer.
else if (bufferCoverTargetNum > 0)
{
// copy buffer allocated by cudaMallocAsync to buffer allocated by cudaMalloc before sending
auto sendBufferIdx = bufferIdx % bufferCoverTargetNum;
- bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx));
+ bufferManager.copy(*outputSplitCaches[bufferIdx], *outputSplitCaches.at(sendBufferIdx));
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp at lines 267-268, the
buffer copy uses processIdx as the source index instead of bufferIdx. Replace
processIdx with bufferIdx in the first argument of bufferManager.copy to
correctly reference the intended buffer for copying.
auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); | ||
if (agentConnnecion != nullptr) | ||
{ | ||
TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == bufferTargetNum, "Agent need all buffer pre-allocated"); | ||
TLLM_CHECK(onlyUseDynamicBuffer == false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move agent connection validation before buffer allocation.
The check for agent connections requiring pre-allocated buffers happens after allocation, which could waste resources.
+ auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+ if (agentConnection != nullptr)
+ {
+ // Agent connections require pre-allocated buffers
+ // Validate this early to avoid unnecessary allocations
+ }
+
auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
cacheBufferId, bufferTargetNum, targetBufferSize, bufferManager);
auto& outputSplitCaches = std::get<0>(result);
auto& bufferCoverTargetNum = std::get<1>(result);
auto& onlyUseDynamicBuffer = std::get<2>(result);
- auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
- if (agentConnnecion != nullptr)
+ if (agentConnection != nullptr)
{
TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == bufferTargetNum, "Agent need all buffer pre-allocated");
TLLM_CHECK(onlyUseDynamicBuffer == false);
}
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 228 to 233,
move the validation that checks if the agent connection requires all buffers to
be pre-allocated before any buffer allocation occurs. This prevents unnecessary
allocation of buffers when the agent connection condition is not met. Adjust the
code so the dynamic_cast and related TLLM_CHECK validations happen prior to
buffer allocation logic.
if (requestBlockNum < blockRange.size() && requestBlockNum > 0) | ||
{ | ||
// handle block reuse, the prefix blocks are reused | ||
// TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num | ||
auto const& ids = blockRange.getBlockIds(); | ||
blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()}); | ||
} | ||
return blockRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation for requestBlockNum.
The code assumes requestBlockNum
is valid but doesn't check if it exceeds the available blocks.
if (requestBlockNum < blockRange.size() && requestBlockNum > 0)
{
// handle block reuse, the prefix blocks are reused
// TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num
auto const& ids = blockRange.getBlockIds();
+ TLLM_CHECK(requestBlockNum <= ids.size(), "Requested block number exceeds available blocks");
blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()});
}
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 50 to 57, the
code uses requestBlockNum without validating if it exceeds the number of
available blocks in blockRange. Add a validation step before the if condition to
ensure requestBlockNum is not greater than the size of blockRange and is
positive. If invalid, handle the case appropriately, such as returning an empty
blockRange or throwing an error, to prevent out-of-bounds access.
template <Scope scope = defaultScope, ArriveOrder order = ArriveOrder::RELEASE> | ||
__device__ inline mha::conditional_t<scope == Scope::CTA, ArrivalToken, void> arrive_tx_relaxed(uint32_t txCount) | ||
{ | ||
if (arriveCount == 1) | ||
#if __CUDA_ARCH__ >= 900 | ||
if constexpr (scope == Scope::CTA) | ||
{ | ||
ArrivalToken token; | ||
asm volatile("mbarrier.arrive.expect_tx.release.cta.shared::cta.b64 %0, [%1], %2;\n" | ||
asm volatile("mbarrier.arrive.expect_tx.relaxed.cta.b64 %0, [%1], %2;\n" | ||
: "=l"(token) | ||
: "r"(addr()), "r"(txCount)); | ||
: "l"(addr()), "r"(txCount) | ||
: "memory"); | ||
return token; | ||
} | ||
else | ||
{ | ||
asm volatile("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;\n" ::"r"(addr()), "r"(txCount)); | ||
return arrive(arriveCount); | ||
asm volatile("mbarrier.arrive.expect_tx.relaxed.cluster.b64 _, [%0], %1;\n" ::"l"(addr()), "r"(txCount) | ||
: "memory"); | ||
return; | ||
} | ||
#else | ||
asm volatile("trap;\n"); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix template parameter shadowing.
In arrive_tx_relaxed
, the template parameter order
shadows the method name but is never used. This appears to be a copy-paste error.
-template <Scope scope = defaultScope, ArriveOrder order = ArriveOrder::RELEASE>
+template <Scope scope = defaultScope>
__device__ inline mha::conditional_t<scope == Scope::CTA, ArrivalToken, void> arrive_tx_relaxed(uint32_t txCount)
🤖 Prompt for AI Agents
In cpp/kernels/xqa/barriers.cuh around lines 151 to 173, the template parameter
'order' shadows the method name 'arrive_tx_relaxed' but is unused, likely a
copy-paste mistake. Remove the unused 'order' template parameter from the
function template declaration to fix the shadowing and clean up the code.
Signed-off-by: Pengyun Lin <[email protected]>
Served its purpose! |
PR created to look at diff between releases, not intended to be merged
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores
Style
Documentation
Refactor