-
Notifications
You must be signed in to change notification settings - Fork 412
Only calls destroy_process_group if the trainer exist successfully #1342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -551,10 +551,11 @@ def close(self) -> None: | |
logger.info("Created seed checkpoint") | ||
else: | ||
trainer.train() | ||
finally: | ||
except Exception: | ||
if trainer: | ||
trainer.close() | ||
|
||
if torch.distributed.is_initialized(): | ||
torch.distributed.destroy_process_group() | ||
logger.info("Process group destroyed.") | ||
raise | ||
else: | ||
trainer.close() | ||
torch.distributed.destroy_process_group() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. destroy_process_group() is causing the hang? or trainer.close()? Becuase ideally calling destroy_process_group() itself would not hang, if it does that seems like another bug we should look into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. Cc @kwen2501 is that supposed to happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no guarantee for "destroy_process_group would not hang in whatever situation".
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is the right behavior upon collective mismatch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My q is, why does the user program mute the exception and not re-throw? Does it believe it is recoverable? In this case it does not seem so? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, then it sounds like we are actually using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the docs here look correct to me, but i think we could add an example of how to do this kind of exception handling on exit the recommended way ill make a PR tmrw. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good question, @kwen2501 , I had the same impression as @wconstab. I thought This PR should be the right way to call |
||
logger.info("Process group destroyed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for wrapping almost the entire program in try-except?
It seems
trainer.close()
just closes a file?May I say that Python will automatically close a file even if the program ends due to an unhanded exception ?
Here is what will happen upon program exit (exception or not):
__del__()
destructor.__del__()
method for files flushes the buffers and closes the file from the operating system’s point of view.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a close() to ensure child processes exiting and finishing correctly, which is currently used by Checkpointer. Due to the GIL issue, there may be background processes running, not necessarily just Checkpointer. A proper close() is required to ensure no data lost.