Skip to content

Conversation

kvshbg-aws
Copy link

@kvshbg-aws kvshbg-aws commented Aug 19, 2025

This PR implement pipeline parallelism support for XLA devices with cross-host metadata communication capabilities. And has been tested on NEURON devices.

Key points -

  • Feature addition implementing cross-host metadata communication for XLA devices used by pipeline parallelism training
  • Added and tested this with the pipelining test suite (test_basic_pipelining.py)
  • Created XLA pipeline stage coordinator module (xla_pipeline_stage_coordinator.py) useful for cross-host communications and used during shape_inference of torch pp code

Corresponding PR on pytorch - pytorch/pytorch#161017

@qihqi qihqi requested review from bfolie and bhavya01 August 21, 2025 18:43
@bfolie bfolie requested a review from pgmoka August 25, 2025 17:57
@rpsilva-aws rpsilva-aws self-requested a review August 25, 2025 19:28
@bfolie
Copy link
Collaborator

bfolie commented Aug 26, 2025

The approach seems good. However I don't think this should be merged until the corresponding PyTorch PR has been merged and these tests can be activated.

Copy link
Collaborator

@pgmoka pgmoka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the follow-up from #9570 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants