Adding functionality for metadata communication across hosts #9570

kvshbg-aws · 2025-08-19T23:06:52Z

This PR implement pipeline parallelism support for XLA devices with cross-host metadata communication capabilities. And has been tested on NEURON devices.

Key points -

Feature addition implementing cross-host metadata communication for XLA devices used by pipeline parallelism training
Added and tested this with the pipelining test suite (test_basic_pipelining.py)
Created XLA pipeline stage coordinator module (xla_pipeline_stage_coordinator.py) useful for cross-host communications and used during shape_inference of torch pp code

Corresponding PR on pytorch - pytorch/pytorch#161017

torch_xla/distributed/pipelining/xla_pipeline_stage_coordinator.py

bfolie · 2025-08-26T15:05:11Z

The approach seems good. However I don't think this should be merged until the corresponding PyTorch PR has been merged and these tests can be activated.

pgmoka

LGTM pending the follow-up from #9570 (comment).

torch_xla/distributed/pipelining/xla_pipeline_stage_coordinator.py

kvshbg-aws added 2 commits August 19, 2025 22:59

feat: enabling metadata comm across hosts for xla devices used by pp

e73c23e

fix: adding a todo for test

c2fa38e

kvshbg-aws mentioned this pull request Aug 19, 2025

Add Pipeline Stage Coordinator Abstraction for Device-Specific Pipeline Parallelism pytorch/pytorch#161017

Open

kvshbg-aws marked this pull request as ready for review August 21, 2025 01:08

move neuron test to apt folder

f6b04a1

qihqi requested review from bfolie and bhavya01 August 21, 2025 18:43

bfolie requested a review from pgmoka August 25, 2025 17:57

bfolie reviewed Aug 25, 2025

View reviewed changes

torch_xla/distributed/pipelining/xla_pipeline_stage_coordinator.py Show resolved Hide resolved

bfolie reviewed Aug 25, 2025

View reviewed changes

torch_xla/distributed/pipelining/xla_pipeline_stage_coordinator.py Show resolved Hide resolved

rpsilva-aws self-requested a review August 25, 2025 19:28

Merge branch 'pytorch:master' into pp_xla

0e817da

pgmoka reviewed Aug 26, 2025

View reviewed changes

torch_xla/distributed/pipelining/xla_pipeline_stage_coordinator.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding functionality for metadata communication across hosts #9570

Adding functionality for metadata communication across hosts #9570

Uh oh!

kvshbg-aws commented Aug 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

bfolie commented Aug 26, 2025

Uh oh!

pgmoka left a comment

Uh oh!

Uh oh!

Uh oh!

Adding functionality for metadata communication across hosts #9570

Are you sure you want to change the base?

Adding functionality for metadata communication across hosts #9570

Uh oh!

Conversation

kvshbg-aws commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bfolie commented Aug 26, 2025

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kvshbg-aws commented Aug 19, 2025 •

edited

Loading