Skip to content

Support complicated use cases with TiedLayerSpec #7208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 9, 2025

Conversation

limjcst
Copy link
Contributor

@limjcst limjcst commented Apr 8, 2025

I want to reuse a composed module in the pipeline. For example, the following MyModule has a member linear, which is also a module.

class MyModule(torch.nn.Module):
    def __init__(self, n_in: int, n_out: int):
        super().__init__()
        self.linear = torch.nn.Linear(n_in, n_out)
        self.layer_norm = torch.nn.LayerNorm(n_out)

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        hidden = self.linear(data)
        hidden = self.layer_norm(hidden)
        return hidden

MyModule.linear.weight should be synchronized among related ranks. As a result, I add linear.weight to TiedLayerSpec.tied_weight_attr.
BTW, I generate the whole tied_weight_attr by the following instruction.

tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1]

However, the builtin getattr used by PipelineModule fails to find a nested attribute like linear.weight.
Hence, this PR first extends the builtin getattr to a recursive version PipelineModule._recursive_getattr, accessing each attribute segment one by one.

Meanwhile, the order of tied weights matters in synchronization. This PR suggests to sort tie_keys in PipelineModule._index_tied_modules to avoid hanging.

Extend the builtin `getattr` to a recursive version `PipelineModule._recursive_getattr`
for nested tied weights, e.g., "linear.weight".
Meanwhile, sort tie_keys in `PipelineModule._index_tied_modules` to avoid hanging.

Signed-off-by: Mingjie Li <[email protected]>
@limjcst limjcst requested review from loadams and tohtana as code owners April 8, 2025 06:33
@tohtana
Copy link
Contributor

tohtana commented Apr 8, 2025

Thank you @limjcst for the contribution! This is a significant improvement.

@loadams loadams added this pull request to the merge queue Apr 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2025
@loadams loadams added this pull request to the merge queue Apr 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 9, 2025
@limjcst
Copy link
Contributor Author

limjcst commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.

Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata".
@agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?

pypa/wheel#660 (comment)

@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.

Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata". @agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?

pypa/wheel#660 (comment)

Thanks @limjcst - I saw this failure on another PR and will take a look and merge the fixes into your PR when ready.

@loadams loadams added this pull request to the merge queue Apr 9, 2025
@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.
Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata". @agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?
pypa/wheel#660 (comment)

Thanks @limjcst - I saw this failure on another PR and will take a look and merge the fixes into your PR when ready.

@limjcst - it looks like the wheel team has yanked the problemabic wheel version so tests should be passing again.

@agronholm
Copy link

Nevertheless, upgrading to setuptools >= 70.1 should prevent any future issues. I noticed that this project lacks a standard pyproject.toml which would be able to specify build requirements.

@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

Nevertheless, upgrading to setuptools >= 70.1 should prevent any future issues. I noticed that this project lacks a standard pyproject.toml which would be able to specify build requirements.

@agronholm - yes, we have a PR for one here which we will prioritize merging as we know this is needed.

Merged via the queue into deepspeedai:master with commit 185330c Apr 9, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants