-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP #24588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Triton matmul-ogs kernels for GPT-OSS models with Data Parallelism and Expert Parallelism, specifically for mxfp4 quantization. This is done by adding a new OAITritonExperts
class. The PR also includes substantial changes to dependency management, removing pinned versions of PyTorch and related packages across various requirement files. My review identified some leftover debugging print statements that should be removed before merging.
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Outdated
Show resolved
Hide resolved
Status: command : getting an IMA during model startup
|
commit |
Update. It appears DeepEP kernels dont support the gpt-oss hidden-size (2880). I have a PR here in the DeepEP repo to fix this deepseek-ai/DeepEP#408 . I see correctness with this. |
d791766
to
8579461
Compare
7bb98dc
to
584fdf2
Compare
You can add padding here so we don't need to change DeepEP side. vllm/vllm/model_executor/layers/fused_moe/layer.py Lines 837 to 848 in 7ae9887
|
This pull request has merge conflicts that must be resolved before it can be |
584fdf2
to
a87c094
Compare
Nice Idea 🙌 I have updated the code to add this padding. PTAL! Thanks 👍 |
vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
Outdated
Show resolved
Hide resolved
if hidden_size_bytes % 512 == 0: | ||
return hidden_size | ||
|
||
hidden_size_bytes = round_up(hidden_size_bytes, xfer_atom_size) | ||
return hidden_size_bytes // dtype.itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with cdiv(hidden_size_bytes, xfer_atom_size)
?
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Outdated
Show resolved
Hide resolved
This pull request has merge conflicts that must be resolved before it can be |
else: | ||
# TODO (bnell): This is a hack to get test_mixtral_moe to work | ||
# since model_config is not set in the pytest test. | ||
moe_in_dtype = params_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a better way to infer the activation datatype entering the MoE layers. At the moment, all of the MoE code interprets model_config.dtype
as the activation datatype as can be seen in the construction of FusedMoEConfig object. I think is very brittle and prone to errors.
cc @mgoin @bnellnm is there a better way that you know of ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wrote this bit of code. I don't know of a better way to get the type but I agree that it would be good to make this more robust. The else
clause is literally only for one test. I wouldn't worry too much about whether or not that is brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I think is it really depends on the output dtype of the layer before the MoE layer (usually attention). If that layer chooses to quantize the activations, then this would be wrong 😟
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have little clue beyond this
else: | ||
# TODO (bnell): This is a hack to get test_mixtral_moe to work | ||
# since model_config is not set in the pytest test. | ||
moe_in_dtype = params_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have little clue beyond this
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
cef84f5
to
5756685
Compare
Signed-off-by: Gregory Shtrasberg <[email protected]>
…project#24588) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…ct#25481) Signed-off-by: Gregory Shtrasberg <[email protected]>
…project#24588) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…project#24588) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: gaojc <[email protected]>
…ct#25481) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: gaojc <[email protected]>
…project#24588) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…ct#25481) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…project#24588) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…ct#25481) Signed-off-by: Gregory Shtrasberg <[email protected]>
Purpose
Enable DP/EP for gpt-oss model via DeepEPHightThroughput and triton_kernel's matmul_ogs
Test Plan
Run gpt-oss eval from link
Test Result
<style type="text/css"></style>
Benchmark
Please find benchmark results here
TLDR - Good TTFT numbers. Bad TPOT numbers as there is no cuda graphs for decode.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.