-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[fix]: disable cutlass block scaled group gemm for EP #20781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Duncan Moss <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @djmmoss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical issue where the cutlass block scaled group gemm
operation leads to device-side assertions when expert_parallel
is enabled. The changes introduce a mechanism to detect the presence of expert_parallel
and, if active, disable the problematic Cutlass implementation, ensuring a stable fallback to Triton for MoE operations.
Highlights
- Bug Fix for Expert Parallelism: Implemented a fix to prevent device-side assertions that occur when
cutlass block scaled group gemm
is used in conjunction withexpert_parallel
enabled, particularly on SM100 architectures. - Conditional Disablement of Cutlass: Modified the
_valid_cutlass_block_scaled_grouped_gemm
function to accept anexpert_map
argument. Ifexpert_map
is present (indicatingexpert_parallel
is active), the function now explicitly disables the Cutlass-based GEMM and logs a warning. - Fallback to Triton: Ensures that when the Cutlass implementation is disabled due to
expert_parallel
being active, the system gracefully falls back to a compatible Triton implementation for Mixture of Experts (MoE) operations, maintaining stability and preventing runtime errors.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a device-side assertion by disabling the cutlass_block_scaled_grouped_gemm
kernel when expert parallelism is enabled. The implementation is clean and directly addresses the issue. My only recommendation is to add a unit test to verify this new logic and prevent potential regressions.
if expert_map is not None: | ||
logger.warning( | ||
"CutlassBlockScaledGroupedGemm disabled: expert_parallel is" | ||
" not supported.") | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly disables the kernel when expert parallelism is active. However, the pull request lacks tests to verify this new behavior. Adding a unit test is crucial for bug fixes to prevent regressions and ensure the logic is sound.
A simple unit test could be added to tests/kernels/moe/test_cutlass_moe.py
to confirm that _valid_cutlass_block_scaled_grouped_gemm
returns False
when an expert_map
is provided and True
otherwise (assuming other conditions pass).
Here's an example test case:
import pytest
import torch
from vllm.model_executor.layers.fused_moe.cutlass_moe import _valid_cutlass_block_scaled_grouped_gemm
def test_valid_cutlass_block_scaled_grouped_gemm_ep_logic():
# Create tensors that would otherwise be valid for the kernel.
# Shapes must be multiples of 128 for this kernel.
N, K = 128, 128
w1 = torch.empty(1, 2 * N, K, dtype=torch.float8_e4m3fn, device="cuda")
w2 = torch.empty(1, K, N, dtype=torch.float8_e4m3fn, device="cuda")
# When expert_map is provided, the kernel should be disabled.
expert_map = torch.tensor([0], device="cuda")
assert not _valid_cutlass_block_scaled_grouped_gemm(w1, w2, expert_map)
# When expert_map is None, the kernel should be considered valid.
assert _valid_cutlass_block_scaled_grouped_gemm(w1, w2, None)
) | ||
elif (allow_cutlass_block_scaled_grouped_gemm and use_fp8_w8a8 | ||
and _valid_cutlass_block_scaled_grouped_gemm(w1, w2)): | ||
and _valid_cutlass_block_scaled_grouped_gemm(w1, w2, expert_map)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should check a few extra things actually: activation, apply_router_weight_on_input, expert_map and probably inplace
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
…0781) Signed-off-by: Duncan Moss <[email protected]>
…0781) Signed-off-by: Duncan Moss <[email protected]>
…0781) Signed-off-by: Duncan Moss <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…0781) Signed-off-by: Duncan Moss <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…0781) Signed-off-by: Duncan Moss <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
" apply_router_weight_on_input is not supported.") | ||
return False | ||
|
||
if inplace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @djmmoss , why do we need to disable when inplace is True ?
…0781) Signed-off-by: Duncan Moss <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Currently the block scaled group gemm for SM100 doesn't support
enable_expert_parallel
. If the feature is enabled the it can result in device side assertions. This change defaults to back to Triton ifenable_exper_parallel=True
Test Plan
N/A
Test Result
N/A
(Optional) Documentation Update