[ROCm][Kernel] MoE weights padding #14454

gshtras · 2025-03-07T18:26:44Z

Optimization ported over from ROCm/vllm.
Applying weight padding for MoE.
The principle and rationale is similar to FP8 padding in #13231 except here it's for the half precision types.
The optimization is more experimental and does not apply to any MoE model, therefore is disabled by default.
Expanded unit tests to cover the padding case.

Performance wise, up to 10% improvement in latency numbers can be observed with this feature enabled on mistralai/Mixtral-8x22B-Instruct-v0.1 in the following configuration: bs=64;in=256;out=256;tp=8

Signed-off-by: Gregory Shtrasberg <[email protected]>

github-actions · 2025-03-07T18:26:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

robertgshaw2-redhat · 2025-03-07T22:49:18Z

QQ - would we ever not want to do this if we are on ROCm for MoE?

gshtras · 2025-03-08T00:04:35Z

QQ - would we ever not want to do this if we are on ROCm for MoE?

It has been mostly tested for Mixtral, other MoE models, especially those with custom MoE implementation may fail due to improper padding handling

mgoin · 2025-03-10T22:32:47Z

does not apply to any MoE model

I think this feature should be improved so it generally satisfies the FusedMoE interface. This seems like a footgun if it will fail on other common MoEs than just Mixtral. Could you give an example of a custom MoE impl that would fail with this?

divakar-amd · 2025-03-11T15:14:26Z

Hi, this feature should work for any model which extends the FusedMoe class. However, if you are only importing the fused_moe kernel to plug it into a custom layer, then it would require some caution.
Here's an example to elaborate the same -In this PR ( #8518 ) we fixed the above for Dbrx model to exted the eniter FusedMoe class/layer instead of just importing the fused_moe kernel and then defining its own layer. That allowed padding to also work for Dbrx

charlifu · 2025-03-11T15:16:28Z

QQ - would we ever not want to do this if we are on ROCm for MoE?

We could do the same condition check just like fp8 padding:

 if (envs.VLLM_ROCM_FP8_PADDING and current_platform.is_rocm()
                and weight.stride(-1) == 1
                and (weight.stride(-2) * weight.element_size()) % 512 == 0):

charlifu · 2025-03-11T15:32:04Z

Hi, this feature should work for any model which extends the FusedMoe class. However, if you are only importing the fused_moe kernel to plug it into a custom layer, then it would require some caution. Here's an example to elaborate the same -In this PR ( #8518 ) we fixed the above for Dbrx model to exted the eniter FusedMoe class/layer instead of just importing the fused_moe kernel and then defining its own layer. That allowed padding to also work for Dbrx

There is a way to avoid this. We can also pad the weight tensor and do a slice operation on the weight, just like what we did in the fp8 padding PR #13231:

weight = F.pad(weight, (0, num_pad), "constant", 0)[..., :-num_pad]

If we do so, there is no need to have the padding_size in the fuse_moe.py, but we have to remove the requirement of weight has to be contiguous.

Signed-off-by: charlifu <[email protected]>

ProExpertProg · 2025-03-20T15:51:26Z

vllm/model_executor/layers/fused_moe/layer.py

        layer.register_parameter("w2_weight", w2_weight)
        set_weight_attrs(w2_weight, extra_weight_attrs)

+    def add_padding_to_weight(self, weight: torch.Tensor) -> torch.Tensor:


Maybe call maybe_pad_weight?

ProExpertProg · 2025-03-20T15:51:50Z

vllm/envs.py

    "VLLM_ROCM_FP8_PADDING":
    lambda: bool(int(os.getenv("VLLM_ROCM_FP8_PADDING", "1"))),
    # Divisor for dynamic key scale factor calculation for FP8 KV Cache
+


Why is this not enabled by default?

It used to be enabled by default.

Signed-off-by: charlifu <[email protected]>

mergify · 2025-03-20T21:40:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gshtras.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]> Signed-off-by: Mu Huai <[email protected]>

gshtras added 3 commits March 7, 2025 00:27

Porting over MoE padding feature from ROCm/vllm

8d657d6

Signed-off-by: Gregory Shtrasberg <[email protected]>

Adding MoE padding to unit tests

ece4e83

Signed-off-by: Gregory Shtrasberg <[email protected]>

Parameterized mixtral moe test for padding as well

b401d67

Signed-off-by: Gregory Shtrasberg <[email protected]>

gshtras requested review from WoosukKwon and tlrmchlsmth as code owners March 7, 2025 18:26

charlifu added 4 commits March 12, 2025 15:50

implmenting moe padding with tensor slicing

83d473a

Signed-off-by: charlifu <[email protected]>

fix grammar issue of error message

718cc26

Signed-off-by: charlifu <[email protected]>

assign Parameter to weight

a0f3706

Signed-off-by: charlifu <[email protected]>

linting

fa2b8d1

Signed-off-by: charlifu <[email protected]>

charlifu force-pushed the moe_padding_upstream branch from bddc6c3 to fa2b8d1 Compare March 12, 2025 15:50

charlifu added 2 commits March 19, 2025 18:37

Merge branch 'main' into moe_padding_upstream

543478b

fix pre-commit

6e36f51

Signed-off-by: charlifu <[email protected]>

ProExpertProg reviewed Mar 20, 2025

View reviewed changes

change padding function name to maybe pad weight

fe3025d

Signed-off-by: charlifu <[email protected]>

charlifu requested review from mgoin and robertgshaw2-redhat as code owners March 20, 2025 21:36

enable moe padding by default

e309417

Signed-off-by: charlifu <[email protected]>

mergify bot added the needs-rebase label Mar 20, 2025

Merge branch 'main' into moe_padding_upstream

1e29760

Signed-off-by: charlifu <[email protected]>

mergify bot removed the needs-rebase label Mar 20, 2025

fix pre-commit

0b5b590

Signed-off-by: charlifu <[email protected]>

ProExpertProg approved these changes Mar 24, 2025

View reviewed changes

robertgshaw2-redhat approved these changes Mar 24, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) March 24, 2025 21:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2025

robertgshaw2-redhat merged commit f533b58 into vllm-project:main Mar 24, 2025
48 checks passed

gshtras deleted the moe_padding_upstream branch April 7, 2025 14:59

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lyz22233 mentioned this pull request Apr 18, 2025

[Bug]: Calling the load_weights method of the MOE model failed #16842

Closed

1 task

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[ROCm][Kernel] MoE weights padding (vllm-project#14454)

98e229a

Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: charlifu <[email protected]> Co-authored-by: charlifu <[email protected]>

Uh oh!

[ROCm][Kernel] MoE weights padding #14454

[ROCm][Kernel] MoE weights padding #14454

Uh oh!

Conversation

gshtras commented Mar 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

robertgshaw2-redhat commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented Mar 8, 2025

Uh oh!

mgoin commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divakar-amd commented Mar 11, 2025

Uh oh!

charlifu commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlifu commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

charlifu Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gshtras commented Mar 7, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Mar 7, 2025 •

edited

Loading

mgoin commented Mar 10, 2025 •

edited

Loading

charlifu commented Mar 11, 2025 •

edited

Loading

charlifu commented Mar 11, 2025 •

edited

Loading