Skip to content

Conversation

frost-intel
Copy link
Contributor

@frost-intel frost-intel commented Sep 18, 2025

See #2076 , this is a cherry-pick for 2.9 release

This is a high impact bug, in many distributed applications this is a large memory leak resulting in OoM error (see #2084)

@frost-intel frost-intel changed the title [2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL [release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL Sep 18, 2025
@riverliuintel
Copy link
Contributor

@CuiYifeng add this fix as torch-xpu-ops cherry-pick fix list for PT2.9

@CuiYifeng
Copy link
Contributor

@CuiYifeng add this fix as torch-xpu-ops cherry-pick fix list for PT2.9

Added.

@zhangxiaoli73
Copy link
Contributor

@frost-intel The issue happened as: 1) you use work in you callback. 2) you use value in Future argument in your callback, and then output tensor will be reference again.

For 1) I think you don't need whole work but some work status. As work status is already updated before return, so you don't need to put it to callback.

setEnqueuedPgStatus(work);

For 2) PyTorch provide option "use_future" to decide if you really need Future argument in callback.

So my suggested change is only:

  auto id = work->trace_id_;
  work->future_->addCallback(
      [id](at::ivalue::Future&) {
        FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false);
      },
       /*use_future*/ false);

@frost-intel
Copy link
Contributor Author

@zhangxiaoli73 Used your suggested change.
@EikanWang This should be ready for review/merge.

@riverliuintel riverliuintel merged commit 789f59d into release/2.9 Sep 22, 2025
24 checks passed
@riverliuintel riverliuintel deleted the frost/fr_memory_revert_xccl branch September 22, 2025 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants