-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 #26038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 #26038
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a breaking change in the torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter
function signature introduced in PyTorch 2.8. The changes correctly update the function calls in ScaledMMReduceScatterPattern
and CutlassScaledMMReduceScatterPattern
to use the new positional arguments, ensuring compatibility with the latest PyTorch version. The implementation is a direct and accurate adaptation of the new API. The changes are correct and necessary. I have no further comments.
2072f4e
to
49a3b8a
Compare
@cascade812 can you also review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the test file to PyTorch Compilation Unit Tests in test-pipeline.yml
?
for example in this ci run https://buildkite.com/vllm/ci/builds/32809/steps/canvas?jid=019995c2-21ec-4c60-ba64-5052e2ada32f it outputs:
Not sure why some of these say PASSED despite having the output from the skip:
|
cc: @louiswang524 re: #24393 |
@jasonlizhengjian thanks, LGTM |
Regarding the skipped test, it needs to be run on multiple GPUs. Can we move it to the Distributed Tests (2 GPUs) section? Alternatively, a new PyTorch Compilation Unit Tests (with 2 GPUs) section would also work. |
@jasonlizhengjian Thank you for taking care of this, LGTM! |
Yeah please move the test to distributed tests for now, we can rearrange later |
Updated torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter calls to match the new PyTorch API signature. The function signature changed from PyTorch 2.7.1 to require additional positional parameters. Changes: - Added orig_scatter_dim and scatter_dim_after_maybe_reshape as positional parameters - Added output_shape calculation: [*input.shape[:-1], mat2.shape[1]] - Changed all optional parameters (bias, result_scale, out_dtype, use_fast_accum) from keyword arguments to positional arguments to match PyTorch's torch._inductor implementation References: - PyTorch function definition: torch/distributed/_symmetric_memory/__init__.py:454-461 - PyTorch test usage: test/distributed/test_symmetric_memory.py:579-590 - PyTorch inductor usage: torch/_inductor/fx_passes/micro_pipeline_tp.py:816-834 Signed-off-by: jasonlizhengjian <[email protected]>
Moved compile/test_async_tp.py from Compilation Tests to Distributed Tests (2 GPUs) section as it requires 2 GPUs to run (@multi_gpu_test decorator). Also added tests/compile/test_async_tp.py to source_file_dependencies. Signed-off-by: jasonlizhengjian <[email protected]>
Signed-off-by: Jason Li <[email protected]> Signed-off-by: jasonlizhengjian <[email protected]>
26c0001
to
8fe845f
Compare
Moved compile/test_sequence_parallelism.py from Compilation Tests to Distributed Tests (2 GPUs) section as it requires 2 GPUs to run (@multi_gpu_test decorator). Also added tests/compile/test_sequence_parallelism.py to source_file_dependencies. Signed-off-by: jasonlizhengjian <[email protected]>
The test_async_tp.py uses PyTorch's symmetric memory operations (torch.ops.symm_mem.fused_matmul_reduce_scatter and fused_all_gather_matmul) which require Hopper architecture (H100/H200) or newer GPUs with multicast and fabric support. L4 GPUs (Ada Lovelace, compute capability 8.9) lack these features, causing "invalid device ordinal" errors during symmetric memory rendezvous when the test runs on the default L4 runners. Changes: - Removed test_async_tp.py from "Distributed Tests (2 GPUs)" section which runs on L4 GPUs - Added test_async_tp.py to "Distributed Tests (H200)" section which runs on H200 GPUs with full symmetric memory support - test_sequence_parallelism.py remains on L4 as it uses standard NCCL collectives that don't require symmetric memory Signed-off-by: jasonlizhengjian <[email protected]> Signed-off-by: <>
Head branch was pushed to by a user without write access
@ProExpertProg the Async TP test was failing in distributed tests https://buildkite.com/vllm/ci/builds/33859/steps/canvas?sid=0199bf91-e585-47ed-adff-0300604a3c12#0199bf91-e674-4667-b280-e1c472cebec7/165-4049 |
…ection Both test_async_tp.py and test_sequence_parallelism.py use PyTorch's symmetric memory operations which require Hopper architecture (H100/H200) or newer GPUs with multicast and fabric support. L4 GPUs (Ada Lovelace, compute capability 8.9) lack these features, causing "invalid device ordinal" errors during symmetric memory rendezvous when the tests run on the default L4 runners. Changes: - Removed test_async_tp.py from "Distributed Tests (2 GPUs)" section which runs on L4 GPUs - Removed test_sequence_parallelism.py from "Distributed Tests (2 GPUs)" section which runs on L4 GPUs - Added both tests to "Distributed Tests (H200)" section which runs on H200 GPUs with full symmetric memory support Signed-off-by: jasonlizhengjian <[email protected]> Signed-off-by: <>
Head branch was pushed to by a user without write access
@ProExpertProg I had to move seq parallel test to the H200 section as well due to the fp8 parts |
Signed-off-by: Luka Govedič <[email protected]>
I reordered tests because CP is failing - let's see that async tp and seq par tests pass ✅ |
seems like the async tp and seq par tests are passing now |
Purpose
Fixes #24376 part 2.d
Updated torch.ops.symm_mem.fused_scaled_matmul_reduce_scatter calls to match the new PyTorch API signature. The function signature changed from PyTorch 2.7.1 to require additional positional parameters.
Changes:
References:
Test Plan
tests/compile/test_async_tp.py
Test Result
pytest -s tests/compile/test_async_tp.py
on 2 x B200Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.