-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility #20449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility #20449
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the testing infrastructure for Mixture-of-Experts (MoE) ModularKernels. It introduces a comprehensive unit test suite to validate various combinations of PrepareFinalize
and FusedExperts
implementations in both single-GPU and multi-GPU (single-node) environments. Additionally, it integrates a utility for generating PyTorch profiler traces, aiding performance analysis and debugging of these MoE configurations.
Highlights
- New MoE Combination Test Suite: A new file
test_modular_kernel_combinations.py
is added, providing extensive unit tests for differentPrepareFinalize
andFusedExperts
combinations, including various quantization settings (FP8, block-quantized) andtopk
values. These tests cover both single-GPU and multi-GPU (single-node) scenarios. - Distributed Test Utility Enhancement: The
parallel_utils.py
file is updated withparallel_launch_with_config
, a new helper function that simplifies launching distributed tests with specific vLLM configurations and environment variables, crucial for testing MoE setups. - MoE Initialization Refactoring: The
FusedMoEMethodBase
class inlayer.py
is refactored to introduce a static methodmaybe_make_prepare_finalize
, centralizing the logic for creatingPrepareAndFinalize
objects. This simplifies the initialization process and improves modularity. - Profiling Utility Integration: The new test script includes an option (
--do-profile
) to generate detailed PyTorch profiler traces for specific MoE kernel executions, enabling in-depth performance analysis.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive test suite for MoE ModularKernel combinations, which is a valuable addition for ensuring code quality and correctness. The ability to profile different combinations is also a great feature.
I've found a couple of issues that could affect the reliability of the tests. Specifically, a hardcoded port in the parallel utilities could lead to flaky tests, and there's a potential argument swap in the weight generation logic that could cause incorrect behavior. Addressing these points will help solidify this excellent contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need some help in verifying if there are more quant configs we should consider. cc @robertgshaw2-redhat @tlrmchlsmth @mgoin Thanks 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to leave placeholders for NVFP4 as well
This pull request has merge conflicts that must be resolved before it can be |
1eb40d7
to
ba71fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cruft : set below as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match the batched case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor : move prepare-finalize init to a staticmethod that can be invoked from the tests.
ac43e90
to
9283aa9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify is some combination / config is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good once it's working smoothly -- one thing I ran to running an example from the PR description:
python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type "BatchedTritonExperts"
and hit the following assert:
(EngineCore_1 pid=835) AssertionError: with expert map, -1 id is used for
(EngineCore_1 pid=835) non-local token; this causes error when casting ids to the
(EngineCore_1 pid=835) topk_indices_dtype() uint32
...which looks like a good assert to me. Expert maps + pplx kernels shouldn't be combined IMO
Thanks @tlrmchlsmth . The error #20714 should fix it. I have noticed that the |
1142bf8
to
670e76a
Compare
Head branch was pushed to by a user without write access
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
438001f
to
6f1bf3e
Compare
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Purpose
The ModularKernel framework is very useful in mix-matching different PrepareFinalize objects with FusedExperts implementations. The catch is that it is hard to test the various combinations of these operations. This PR adds a
test_modular_kernel_combinations
unit test, that tests various combinations in a multi-gpu (single-node) and single-gpu setting.Design
tests/kernels/moe/modular_kernel_tools
tests/kernels/moe/modular_kernel_tools/mk_objects.py
defines all high-level collections, like all prepare-finalize types, all fused experts types and, all quant configstests/kernels/moe/modular_kernel_tools/common.py
defines all high-level utilities. Mainly the functions,make_modular_kernel
andrun_modular_kernel
tests/kernels/moe/test_modular_kernel_combinations.py
, the profiling code and the feature matrix generator code all leverage themake_modular_kernel
/run_modular_kernel
functions.Restrictions
--data-parallel-size=2
and--tensor-parallel-size=1
case.pplx
,deep_ep
anddeep_gemm
packages to run. This is a harsh requirement that can be relaxed.Features
test_modular_kernel_combinations.py
can be run as a standalone script to test specific PrepareAndFinalize and FusedExperts combinations.Profiling command example:
python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts" --torch-trace-dir-path /home/varun/code/vllm/torch_trace_files/
Feature Matrix Generation command example:
python3 -m tests.kernels.moe.modular_kernel_tools.make_feature_matrix -f feature_matrices/feature_matrix.csv
feature_matrix.csv
Test Plan
Machine : H100
pytest :
test_modular_kernel_combinations.py
pass locallye2e tests:
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --enable-expert-parallel --data-parallel-size 2 --port 9010
VLLM_ALL2ALL_BACKEND="pplx" vllm serve deepseek-ai/DeepSeek-V2-Lite --data-parallel-size 2 --enable-expert-parallel --port 9020 --trust-remote-code
Test Result
(Optional) Documentation Update