[c10d] Working async version of AllGather, test fix and compiler warnings, and CI #10932

teng-li · 2018-08-28T03:14:38Z

The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend.

Multi-node:

tengli@learnfair042:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=0 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 0
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful

tengli@learnfair117:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=1 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 1
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful

CI test:

test_set_get (__main__.FileStoreTest) ... ok
test_set_get (__main__.PrefixFileStoreTest) ... ok
test_set_get (__main__.PrefixTCPStoreTest) ... ok
test_allreduce_ops (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_ops (__main__.ProcessGroupGlooTest) ... ok
test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok
test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok
test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_common_errors (__main__.RendezvousFileTest) ... ok
test_nominal (__main__.RendezvousFileTest) ... ok
test_common_errors (__main__.RendezvousTCPTest) ... ok
test_nominal (__main__.RendezvousTCPTest) ... ok
test_unknown_handler (__main__.RendezvousTest) ... ok
test_set_get (__main__.TCPStoreTest) ... ok

…ings, and CI

pietern · 2018-08-28T16:05:30Z

Looking at the code, was it the test case not allocating enough memory for the output tensors?

teng-li · 2018-08-28T17:47:12Z

@pietern The biggest issue was that each big tensor (flattened) should be on a different GPU, that was causing the error.

facebook-github-bot

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…nd CI (pytorch#10932) Summary: The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend. Multi-node: ``` tengli@learnfair042:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=0 WORLD_SIZE=2 ./ProcessGroupNCCLTest Multi-node world size: 2 rank: 0 Allreduce test successful Broadcast test successful Reduce test successful Allgather test successful tengli@learnfair117:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=1 WORLD_SIZE=2 ./ProcessGroupNCCLTest Multi-node world size: 2 rank: 1 Allreduce test successful Broadcast test successful Reduce test successful Allgather test successful ``` CI test: ``` test_set_get (__main__.FileStoreTest) ... ok test_set_get (__main__.PrefixFileStoreTest) ... ok test_set_get (__main__.PrefixTCPStoreTest) ... ok test_allreduce_ops (__main__.ProcessGroupGlooTest) ... ok test_broadcast_ops (__main__.ProcessGroupGlooTest) ... ok test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_common_errors (__main__.RendezvousFileTest) ... ok test_nominal (__main__.RendezvousFileTest) ... ok test_common_errors (__main__.RendezvousTCPTest) ... ok test_nominal (__main__.RendezvousTCPTest) ... ok test_unknown_handler (__main__.RendezvousTest) ... ok test_set_get (__main__.TCPStoreTest) ... ok ``` Pull Request resolved: pytorch#10932 Differential Revision: D9542067 Pulled By: teng-li fbshipit-source-id: 25513eddcc3119fd736875d69dfb631b10f4ac86

teng-li requested a review from pietern August 28, 2018 03:14

teng-li requested a review from apaszke as a code owner August 28, 2018 03:14

teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 28, 2018

teng-li force-pushed the test_fix branch 2 times, most recently from 109e585 to a96580a Compare August 28, 2018 05:46

teng-li requested review from colesbury, ezyang, gchanan, soumith and zdevito as code owners August 28, 2018 05:46

teng-li changed the title ~~[c10d] fix test, and compiler warnings, enabling nccl for ci~~ [c10d] skip test, and compiler warnings, enabling nccl for ci Aug 28, 2018

teng-li force-pushed the test_fix branch 2 times, most recently from 08a738f to 7a43626 Compare August 28, 2018 07:42

teng-li changed the title ~~[c10d] skip test, and compiler warnings, enabling nccl for ci~~ [c10d] Working async version of AllGather, test fix and compiler warnings, and CI Aug 28, 2018

[c10d] Working async version of AllGather, test fix and compiler warn…

09087bc

…ings, and CI

teng-li force-pushed the test_fix branch from 7a43626 to 09087bc Compare August 28, 2018 07:49

pietern approved these changes Aug 28, 2018

View reviewed changes

pietern mentioned this pull request Aug 28, 2018

[c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests #10871

Closed

facebook-github-bot reviewed Aug 28, 2018

View reviewed changes

Lint related

ca785ac

facebook-github-bot reviewed Aug 28, 2018

View reviewed changes

facebook-github-bot closed this in a88463c Aug 28, 2018

teng-li mentioned this pull request Aug 28, 2018

[ppc64le/pytorch] test_c10d.ProcessGroupNCCLTest.test_allgather_ops illegal memory access #10587

Closed

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Working async version of AllGather, test fix and compiler warnings, and CI #10932

[c10d] Working async version of AllGather, test fix and compiler warnings, and CI #10932

Uh oh!

teng-li commented Aug 28, 2018 •

edited

Loading

Uh oh!

pietern commented Aug 28, 2018

Uh oh!

teng-li commented Aug 28, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

[c10d] Working async version of AllGather, test fix and compiler warnings, and CI #10932

[c10d] Working async version of AllGather, test fix and compiler warnings, and CI #10932

Uh oh!

Conversation

teng-li commented Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Aug 28, 2018

Uh oh!

teng-li commented Aug 28, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teng-li commented Aug 28, 2018 •

edited

Loading