[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077

frost-intel · 2025-09-18T20:47:26Z

See #2076 , this is a cherry-pick for 2.9 release

This is a high impact bug, in many distributed applications this is a large memory leak resulting in OoM error (see #2084)

riverliuintel · 2025-09-18T22:22:30Z

@CuiYifeng add this fix as torch-xpu-ops cherry-pick fix list for PT2.9

src/xccl/ProcessGroupXCCL.cpp

CuiYifeng · 2025-09-19T02:33:33Z

@CuiYifeng add this fix as torch-xpu-ops cherry-pick fix list for PT2.9

Added.

zhangxiaoli73 · 2025-09-19T08:33:07Z

@frost-intel The issue happened as: 1) you use work in you callback. 2) you use value in Future argument in your callback, and then output tensor will be reference again.

For 1) I think you don't need whole work but some work status. As work status is already updated before return, so you don't need to put it to callback.

torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp

Line 784 in 77cc792

setEnqueuedPgStatus(work);

For 2) PyTorch provide option "use_future" to decide if you really need Future argument in callback.

So my suggested change is only:

  auto id = work->trace_id_;
  work->future_->addCallback(
      [id](at::ivalue::Future&) {
        FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false);
      },
       /*use_future*/ false);

src/xccl/ProcessGroupXCCL.cpp

frost-intel · 2025-09-19T13:58:07Z

@zhangxiaoli73 Used your suggested change.
@EikanWang This should be ready for review/merge.

Remove PGstatus callback

48296bc

frost-intel requested a review from Chao1Han September 18, 2025 20:47

frost-intel changed the title ~~[2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL~~ [release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL Sep 18, 2025

EikanWang reviewed Sep 18, 2025

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp Show resolved Hide resolved

Chao1Han requested a review from zhangxiaoli73 September 19, 2025 00:35

CuiYifeng mentioned this pull request Sep 19, 2025

[v.2.9.0] Release Tracker #2028

Open

zhangxiaoli73 reviewed Sep 19, 2025

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp Show resolved Hide resolved

zhangxiaoli73 reviewed Sep 19, 2025

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp Show resolved Hide resolved

zhangxiaoli73 mentioned this pull request Sep 19, 2025

[XCCL] torch OOM in TP scenario with torch2.9 #2084

Closed

zhangxiaoli73 reviewed Sep 19, 2025

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp Show resolved Hide resolved

Fix

892399a

frost-intel requested review from EikanWang and zhangxiaoli73 September 19, 2025 13:59

EikanWang approved these changes Sep 19, 2025

View reviewed changes

zhangxiaoli73 approved these changes Sep 22, 2025

View reviewed changes

riverliuintel merged commit 789f59d into release/2.9 Sep 22, 2025
24 checks passed

riverliuintel deleted the frost/fr_memory_revert_xccl branch September 22, 2025 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077

[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077

Uh oh!

frost-intel commented Sep 18, 2025 •

edited by chuanqi129

Loading

Uh oh!

riverliuintel commented Sep 18, 2025

Uh oh!

Uh oh!

CuiYifeng commented Sep 19, 2025

Uh oh!

zhangxiaoli73 commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frost-intel commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077

[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077

Uh oh!

Conversation

frost-intel commented Sep 18, 2025 • edited by chuanqi129 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riverliuintel commented Sep 18, 2025

Uh oh!

Uh oh!

CuiYifeng commented Sep 19, 2025

Uh oh!

zhangxiaoli73 commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frost-intel commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

frost-intel commented Sep 18, 2025 •

edited by chuanqi129

Loading