FSDP grad fusion support #2191

sanandaraj5597 · 2025-09-21T05:33:11Z

This PR adds support gradient fusion for MCore FSDP.

Signed-off-by: Selvaraj Anandaraj <[email protected]>

for more information, see https://pre-commit.ci

timmoon10

I don't think this makes sense. If you configure a TE module with fuse_wgrad_accumulation=True (e.g. here), the correct behavior is to fuse wgrad accumulation. If Mcore FSDP doesn't support it, then it should be Mcore's responsibility to not set that arg.

timmoon10 · 2025-09-23T20:30:54Z

The root problem is that Mcore DDP and FSDP have different behaviors and require different contracts with TE:

DDP uses persistent main_grad buffers and it expects TE to accumulate into it. To adhere to this contract, Mcore zeros out the main_grad before the first microbatch step.
FSDP uses temporary main_grad buffers and it expects TE to overwrite it.

I don't like this PR's approach of switching between these two cases based on whether Mcore is using DDP or FSDP, since that's not actually the important thing. It also needlessly blocks some possible optimizations (DDP might want to overwrite main_grads in the first microbatch, FSDP might want to accumulate into main_grads if a weight is shared).

There are a few possible redesigns:

Deprecate the fuse_wgrad_accumulation kwarg in favor of something like output_wgrad_to_main_grad. Then check param flags to decide whether to overwrite or accumulate into the main_grad:

    grad_weight: torch.Tensor
    accumulate: bool = False
    if output_wgrad_to_main_grad:
        if getattr(weight, "get_main_grad", None) is not None:
            grad_weight = weight.get_main_grad()
        else:
            grad_weight = weight.main_grad
        accumulate = getattr(weight, "_overwrite_main_grad", True)
    else:
        grad_weight = torch.empty(...)

    gemm(..., out=grad_weight, accumulate=accumulate)

Ensuring backward compatibility will be tricky.

Have separate kwargs for fuse_wgrad_accumulation and overwrite_wgrad_main_grad. This means that the two cases are separate code paths and backward compatibility is easier to maintain. However, it also means we can't change behavior between steps.
Keep the fuse_wgrad_accumulation kwarg and purely control behavior with param flags. This is basically the approach used in this PR, although it could be improved by using better names rather than just checking weight.__fsdp_param__. One problem is that fuse_wgrad_accumulation will no longer be an accurate name.

Selvaraj Anandaraj and others added 2 commits September 20, 2025 22:31

FSDP grad fusion support

44b4730

Signed-off-by: Selvaraj Anandaraj <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f7ebad

for more information, see https://pre-commit.ci

timmoon10 requested changes Sep 22, 2025

View reviewed changes

timmoon10 mentioned this pull request Sep 22, 2025

FSDP grad fusion support #2192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP grad fusion support #2191

FSDP grad fusion support #2191

Uh oh!

sanandaraj5597 commented Sep 21, 2025

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

timmoon10 commented Sep 23, 2025

Uh oh!

Uh oh!

FSDP grad fusion support #2191

Are you sure you want to change the base?

FSDP grad fusion support #2191

Uh oh!

Conversation

sanandaraj5597 commented Sep 21, 2025

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Sep 23, 2025

Uh oh!

Uh oh!

timmoon10 left a comment •

edited

Loading