-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[ROCm][Kernel] MoE weights padding #14454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm][Kernel] MoE weights padding #14454
Conversation
Signed-off-by: Gregory Shtrasberg <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
QQ - would we ever not want to do this if we are on ROCm for MoE? |
It has been mostly tested for Mixtral, other MoE models, especially those with custom MoE implementation may fail due to improper padding handling |
I think this feature should be improved so it generally satisfies the FusedMoE interface. This seems like a footgun if it will fail on other common MoEs than just Mixtral. Could you give an example of a custom MoE impl that would fail with this? |
Hi, this feature should work for any model which extends the FusedMoe class. However, if you are only importing the fused_moe kernel to plug it into a custom layer, then it would require some caution. |
We could do the same condition check just like fp8 padding:
|
There is a way to avoid this. We can also pad the weight tensor and do a slice operation on the weight, just like what we did in the fp8 padding PR #13231:
If we do so, there is no need to have the |
Signed-off-by: charlifu <[email protected]>
Signed-off-by: charlifu <[email protected]>
Signed-off-by: charlifu <[email protected]>
Signed-off-by: charlifu <[email protected]>
bddc6c3
to
fa2b8d1
Compare
Signed-off-by: charlifu <[email protected]>
layer.register_parameter("w2_weight", w2_weight) | ||
set_weight_attrs(w2_weight, extra_weight_attrs) | ||
|
||
def add_padding_to_weight(self, weight: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call maybe_pad_weight
?
"VLLM_ROCM_FP8_PADDING": | ||
lambda: bool(int(os.getenv("VLLM_ROCM_FP8_PADDING", "1"))), | ||
# Divisor for dynamic key scale factor calculation for FP8 KV Cache | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not enabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to be enabled by default.
Signed-off-by: charlifu <[email protected]>
Signed-off-by: charlifu <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: charlifu <[email protected]>
Signed-off-by: charlifu <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Wes Medford <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Optimization ported over from ROCm/vllm.
Applying weight padding for MoE.
The principle and rationale is similar to FP8 padding in #13231 except here it's for the half precision types.
The optimization is more experimental and does not apply to any MoE model, therefore is disabled by default.
Expanded unit tests to cover the padding case.
Performance wise, up to 10% improvement in latency numbers can be observed with this feature enabled on mistralai/Mixtral-8x22B-Instruct-v0.1 in the following configuration: bs=64;in=256;out=256;tp=8