-
Notifications
You must be signed in to change notification settings - Fork 60
[release/2.9] Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL #2077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@CuiYifeng add this fix as torch-xpu-ops cherry-pick fix list for PT2.9 |
Added. |
@frost-intel The issue happened as: 1) you use work in you callback. 2) you use value in Future argument in your callback, and then output tensor will be reference again. For 1) I think you don't need whole work but some work status. As work status is already updated before return, so you don't need to put it to callback. torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp Line 784 in 77cc792
For 2) PyTorch provide option "use_future" to decide if you really need Future argument in callback. So my suggested change is only:
|
@zhangxiaoli73 Used your suggested change. |
See #2076 , this is a cherry-pick for 2.9 release
This is a high impact bug, in many distributed applications this is a large memory leak resulting in OoM error (see #2084)