Skip to content

[Model Averaging] Add a unit test that launches hierarchical SGD by PostLocalSGDOptimizer #74668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Mar 24, 2022

As title.

The added unit test requires 4 GPUs. Please add ciflow/all to enable this test.

Proposal: #73382
Parent proposal: #71325

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 24, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 4fe68da (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 24, 2022
@wayi1
Copy link
Contributor Author

wayi1 commented Mar 25, 2022

The failure on Windows seems to be irrelevant.

@facebook-github-bot
Copy link
Contributor

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@wayi1 wayi1 requested a review from awgu as a code owner March 29, 2022 06:02
@wayi1 wayi1 requested a review from mrshenli March 29, 2022 06:04
@facebook-github-bot
Copy link
Contributor

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 30, 2022
…ostLocalSGDOptimizer (#74668)

Summary:
As title.

The added unit test requires 4 GPUs. Please add `ciflow/all` to enable this test.

Proposal: #73382

Pull Request resolved: #74668

Reviewed By: albanD

Differential Revision: D35173938

Pulled By: mrshenli

fbshipit-source-id: b6d61822bfa12c793050af96a8baa4fc92f6b120
@github-actions
Copy link
Contributor

Hey @wayi1.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@zengk95
Copy link
Contributor

zengk95 commented Mar 31, 2022

@pytorchbot revert this

@pytorchmergebot
Copy link
Collaborator

Reverting PR 74668 failed due to Can't revert PR that was landed via phabricator as D35173938
Raised by https://github.com/pytorch/pytorch/actions/runs/2068592927

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 1, 2022

@zengk95 No need to revert PR #74668. Just adding @skip_if_rocm to these tests should work. These tests require 4 processes, but even with 4 ROCm GPUs, the test environment somehow has only provided 3 processes.

Curious why these failure were not detected when the PR was submitted. The PR has already labeled "ci/master" and "ci/all". cc: @mrshenli @rohan-varma

@pytorchmergebot
Copy link
Collaborator

Reverting PR 74668 failed due to Comment @zengk95 No need to revert PR #74668. Just add @skip_if_rocm to these tests should work.
Curious why these failure were not detected when the PR was submitted. The PR has already labeled "ci/master" and "ci/all". cc: @mrshenli @rohan-varma does not seem to be a valid revert command
Raised by https://github.com/pytorch/pytorch/actions/runs/2075308562

@zengk95
Copy link
Contributor

zengk95 commented Apr 1, 2022

@wayi1 Oh hmm. I was looking at https://hud.pytorch.org/pytorch/pytorch/pull/74668?sha=3491f4c36f63b174a768354af4f1edb8f66f4d38 which had some failures (my mistake). I don't think it got reverted anyways.

As for why they didn't run, I think it did run, since it had trunk workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants