Skip to content

Add test c10d ucc tests #88110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from
Closed

Conversation

Fuzzkatt
Copy link
Collaborator

@Fuzzkatt Fuzzkatt commented Oct 31, 2022

Creates the equivalent c10d test for ucc for https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_nccl.py. Uses test_c10d_gloo.py as the reference and adds all the common ops. More detailed comparison of available ops here: https://docs.google.com/document/d/1yPsa_X9EiEiqo-j2Yn7ierhccBtEjwoqC-B7-amI0MI/edit?usp=sharing

Also removes extra line for ProcessGroupUCC.cpp barrier blocking wait that got duplicated from merging #85047.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 31, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88110

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit e86c068:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 31, 2022
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 31, 2022
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kit1980 kit1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All existing #38095 issues were fixed some time ago and the issue was closed, please don't re-introduce.

@Fuzzkatt
Copy link
Collaborator Author

Fuzzkatt commented Apr 3, 2023

All existing #38095 issues were fixed some time ago and the issue was closed, please don't re-introduce.

Sorry about that, the comments referencing #38095 were outdated (this PR is a couple months old) and I removed them. There are no actual issues relating to #38905.

@zasdfgbnm zasdfgbnm requested a review from kit1980 April 3, 2023 22:11
@Fuzzkatt
Copy link
Collaborator Author

Fuzzkatt commented Apr 5, 2023

Ready for review

@zasdfgbnm
Copy link
Collaborator

@kit1980 Do you still have something to request change or are you OK with the current status?

@Fuzzkatt
Copy link
Collaborator Author

Fuzzkatt commented Apr 5, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 5, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Fuzzkatt
Copy link
Collaborator Author

Fuzzkatt commented Apr 6, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Apr 10, 2023
…not available (#98576)

After the recent change on #88110 to add a new c10d test for UCC backend, the test starts to fail on ROCm distributed job.  I guess ROCm doesn't support that backend yet, so I go ahead and disable the test there.  Please let me know if the support on ROCm is coming, I will close this PR accordingly.  But it's now failing in ROCm trunk with `AssertionError: Unknown c10d backend type UCC`, for example https://hud.pytorch.org/pytorch/pytorch/commit/4adba70cc6fa273f210a94a82b337bbddffc3c1d

Pull Request resolved: #98576
Approved by: https://github.com/Fuzzkatt, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/ZainRizvi
Fuzzkatt added a commit to Fuzzkatt/pytorch that referenced this pull request Apr 20, 2023
pytorchmergebot pushed a commit that referenced this pull request Apr 28, 2023
…8110 (#99654)

* Adds extra test_allgather_base in UccProcessGroupWithDispatchedCollectivesTests; rest of nccl and gloo tests there don't work on ucc
* Adds cpu tests for [op]_work_wait_gpu tests
* Added single tensor input test for allgather_basics; multi tensor input still doesn't seem to be supported by ucc
Pull Request resolved: #99654
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants