-
Notifications
You must be signed in to change notification settings - Fork 411
Only calls destroy_process_group if the trainer exist successfully #1342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
raise | ||
else: | ||
trainer.close() | ||
torch.distributed.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destroy_process_group() is causing the hang? or trainer.close()?
Becuase ideally calling destroy_process_group() itself would not hang, if it does that seems like another bug we should look into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destroy_process_group()
, I attached the py-spy result in the summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Cc @kwen2501 is that supposed to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee for "destroy_process_group would not hang in whatever situation".
From the doc of NCCL re ncclCommDestroy
:
This function is an intra-node collective call, which all ranks on the same node should call to avoid a hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchrun
should crash other ranks in case one rank crashed. It seems it failed to do so here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually timeout
This is the right behavior upon collective mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My q is, why does the user program mute the exception and not re-throw? Does it believe it is recoverable? In this case it does not seem so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then it sounds like we are actually using destroy_process_group
wrong in the torchtitan scripts. We should only call it if we are on the clean exit path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docs here look correct to me, but i think we could add an example of how to do this kind of exception handling on exit the recommended way
https://docs.pytorch.org/docs/stable/distributed.html#shutdown
ill make a PR tmrw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My q is, why does the user program mute the exception and not re-throw? Does it believe it is recoverable? In this case it does not seem so?
Good question, @kwen2501 , I had the same impression as @wconstab. I thought destroy_process_group
is a purely local call. That's why I wrapped it in finally:
and it didn't go wrong until this week when we were debugging CP issues.
This PR should be the right way to call destroy_process_group()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp to unblock. but we should prioritize debugging the hang in destroy, fix that, and then revert this, IMO.
this PR looks like the best practice, we were doing the wrong thing before.
finally: | ||
except Exception: | ||
if trainer: | ||
trainer.close() | ||
|
||
if torch.distributed.is_initialized(): | ||
torch.distributed.destroy_process_group() | ||
logger.info("Process group destroyed.") | ||
raise | ||
else: | ||
trainer.close() | ||
torch.distributed.destroy_process_group() | ||
logger.info("Process group destroyed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for wrapping almost the entire program in try-except?
It seems trainer.close()
just closes a file?
def close(self) -> None:
if self.checkpointer:
self.checkpointer.close()
May I say that Python will automatically close a file even if the program ends due to an unhanded exception ?
Here is what will happen upon program exit (exception or not):
- CPython dereferences all objects, and all objects have their destructors called, even if the program ends due to an unhanded exception.
- When the reference count hits zero, no Python code can reach the object anymore, so the object gets deallocated. And when it gets deallocated, Python calls the
__del__()
destructor. - Python’s
__del__()
method for files flushes the buffers and closes the file from the operating system’s point of view.
If we perform the destroy_process_group when some trainers have exceptions while others are doing collectives, the cleanup itself will cause deadlock.
stacktrace: