Feature distributed fused adam #184

amd-sriram · 2025-03-19T18:57:52Z

Ported Distributed_fused_adam from Nvidia/apex.

Ref: https://ontrack-internal.amd.com/browse/SWDEV-519796

…dependencies - fused adam, distributed adam. Updated the unit test case for distributed fused adam.

…uted fused adam. Skipped these particular UTs

…locator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184

…cator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184

pruthvistony · 2025-03-21T21:28:59Z

@amd-sriram ,
Assuming it is tested and no breakages.

pruthvistony · 2025-03-21T21:29:32Z

Please cherry-pick this into release/1.6.0 branch.

amd-sriram · 2025-03-22T17:27:39Z

@amd-sriram , Assuming it is tested and no breakages.

@pruthvistony : Yes
Yes It was tested in Rocm. also checked with Nvidia/apex - upstream to match the outcomes of the test in Rocm/apex.

There is one PR in Pytorch that will impact one UT and the nccl_ub feature in distributed_fused_adam.

ROCm/pytorch#1984

Kindly review this as well.

* Updated feature of distributed fused adam from upstream. Updated its dependencies - fused adam, distributed adam. Updated the unit test case for distributed fused adam. * Raise Exception when nccl user buffer / cuda graph is used in distributed fused adam. Skipped these particular UTs * Adding support for rccl_ub in distributed_fused_adam * build nccl_allocator module when cuda_ext flag is mentioned

* Feature distributed fused adam (#184) * Updated feature of distributed fused adam from upstream. Updated its dependencies - fused adam, distributed adam. Updated the unit test case for distributed fused adam. * Raise Exception when nccl user buffer / cuda graph is used in distributed fused adam. Skipped these particular UTs * Adding support for rccl_ub in distributed_fused_adam * build nccl_allocator module when cuda_ext flag is mentioned * Ported distributed fused lamb from upstream repo. Add support for parameters - fused_norm, full_ar, set_param_views_to_flat_buffer, skip_allgather, fuse_scale, param_order, Nccl_allgather_channels (#185) * for distributed fused adam, add condition to remove nccl_allocator only if it is mentioned explicitly

Altering the flag to use the correct streamType for CUDAPluggableAllocator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184 Related Issue : https://ontrack-internal.amd.com/browse/SWDEV-519796

* Updated feature of distributed fused adam from upstream. Updated its dependencies - fused adam, distributed adam. Updated the unit test case for distributed fused adam. * Raise Exception when nccl user buffer / cuda graph is used in distributed fused adam. Skipped these particular UTs * Adding support for rccl_ub in distributed_fused_adam * build nccl_allocator module when cuda_ext flag is mentioned

) Altering the flag to use the correct streamType for CUDAPluggableAllocator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184 Related Issue : https://ontrack-internal.amd.com/browse/SWDEV-519796

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR ROCm/apex#184 Pull Request resolved: #150010 Approved by: https://github.com/jeffdaily

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR ROCm/apex#184 Pull Request resolved: pytorch#150010 Approved by: https://github.com/jeffdaily

) Altering the flag to use the correct streamType for CUDAPluggableAllocator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184 Related Issue : https://ontrack-internal.amd.com/browse/SWDEV-519796

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR ROCm/apex#184 Pull Request resolved: #150010 Approved by: https://github.com/jeffdaily (cherry picked from commit a19b667)

[ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010) Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR ROCm/apex#184 Pull Request resolved: #150010 Approved by: https://github.com/jeffdaily (cherry picked from commit a19b667) Co-authored-by: Sriram Kumar <[email protected]>

amd-sriram added 3 commits March 14, 2025 08:30

Updated feature of distributed fused adam from upstream. Updated its …

665a439

…dependencies - fused adam, distributed adam. Updated the unit test case for distributed fused adam.

Raise Exception when nccl user buffer / cuda graph is used in distrib…

c6df510

…uted fused adam. Skipped these particular UTs

Adding support for rccl_ub in distributed_fused_adam

1c21bc0

amd-sriram requested review from pruthvistony and jithunnair-amd March 19, 2025 18:57

amd-sriram self-assigned this Mar 19, 2025

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Mar 19, 2025

Alteng the flag usto use the correct streamtmType for CUDAPluggableAl…

34a2280

…locator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184

amd-sriram mentioned this pull request Mar 19, 2025

[ROCm6.4_internal_testing] Update CUDAPluggableAllocator.h ROCm/pytorch#1984

Merged

build nccl_allocator module when cuda_ext flag is mentioned

f2b1b43

amd-sriram added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2025

Altering the flag to use the correct streamType for CUDAPluggableAllo…

f12755b

…cator. This is impacting Distributed Fused Adam in Rocm/APEX. See PR ROCm/apex#184

pruthvistony approved these changes Mar 21, 2025

View reviewed changes

pruthvistony merged commit 6fd8b50 into master Mar 21, 2025

pruthvistony deleted the feature_distributed_fused_adam branch March 21, 2025 21:29

amd-sriram mentioned this pull request Mar 26, 2025

[ROCm] Update CUDAPluggableAllocator.h (#1984) pytorch/pytorch#150010

Closed

amd-sriram mentioned this pull request Apr 25, 2025

[ROCm] Update CUDAPluggableAllocator.h pytorch/pytorch#152179

Closed

pytorchbot mentioned this pull request May 20, 2025

[ROCm] Update CUDAPluggableAllocator.h (#1984) pytorch/pytorch#153974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature distributed fused adam #184

Feature distributed fused adam #184

Uh oh!

amd-sriram commented Mar 19, 2025

Uh oh!

pruthvistony commented Mar 21, 2025

Uh oh!

pruthvistony commented Mar 21, 2025

Uh oh!

amd-sriram commented Mar 22, 2025

Uh oh!

Uh oh!

Feature distributed fused adam #184

Feature distributed fused adam #184

Uh oh!

Conversation

amd-sriram commented Mar 19, 2025

Uh oh!

pruthvistony commented Mar 21, 2025

Uh oh!

pruthvistony commented Mar 21, 2025

Uh oh!

amd-sriram commented Mar 22, 2025

Uh oh!

Uh oh!