Only calls destroy_process_group if the trainer exist successfully #1342

fegin · 2025-06-26T02:17:24Z

If we perform the destroy_process_group when some trainers have exceptions while others are doing collectives, the cleanup itself will cause deadlock.

stacktrace:

Thread 0x7F81445A8440 (active): "MainThread"
    destroy_process_group (torch/distributed/distributed_c10d.py:2184)
    <module> (torchtitan/train.py:554)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 0x7F7E83CFF640 (active): "Thread-1 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:61)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:195)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F7D9CFF9640 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

wconstab · 2025-06-26T02:28:39Z

torchtitan/train.py

+        raise
+    else:
+        trainer.close()
+        torch.distributed.destroy_process_group()


destroy_process_group() is causing the hang? or trainer.close()?

Becuase ideally calling destroy_process_group() itself would not hang, if it does that seems like another bug we should look into

destroy_process_group(), I attached the py-spy result in the summary.

Hmm. Cc @kwen2501 is that supposed to happen?

There is no guarantee for "destroy_process_group would not hang in whatever situation".
From the doc of NCCL re ncclCommDestroy:

This function is an intra-node collective call, which all ranks on the same node should call to avoid a hang.

torchrun should crash other ranks in case one rank crashed. It seems it failed to do so here.

eventually timeout

This is the right behavior upon collective mismatch.

My q is, why does the user program mute the exception and not re-throw? Does it believe it is recoverable? In this case it does not seem so?

ok, then it sounds like we are actually using destroy_process_group wrong in the torchtitan scripts. We should only call it if we are on the clean exit path.

the docs here look correct to me, but i think we could add an example of how to do this kind of exception handling on exit the recommended way
https://docs.pytorch.org/docs/stable/distributed.html#shutdown

ill make a PR tmrw.

My q is, why does the user program mute the exception and not re-throw? Does it believe it is recoverable? In this case it does not seem so?

Good question, @kwen2501 , I had the same impression as @wconstab. I thought destroy_process_group is a purely local call. That's why I wrapped it in finally: and it didn't go wrong until this week when we were debugging CP issues.

This PR should be the right way to call destroy_process_group().

wconstab

~~stamp to unblock. but we should prioritize debugging the hang in destroy, fix that, and then revert this, IMO.~~
this PR looks like the best practice, we were doing the wrong thing before.

kwen2501 · 2025-06-27T07:32:22Z

torchtitan/experiments/flux/train.py

-    finally:
+    except Exception:
        if trainer:
            trainer.close()
-
-        if torch.distributed.is_initialized():
-            torch.distributed.destroy_process_group()
-            logger.info("Process group destroyed.")
+        raise
+    else:
+        trainer.close()
+        torch.distributed.destroy_process_group()
+        logger.info("Process group destroyed.")


what is the reason for wrapping almost the entire program in try-except?
It seems trainer.close() just closes a file?

def close(self) -> None: if self.checkpointer: self.checkpointer.close()

May I say that Python will automatically close a file even if the program ends due to an unhanded exception ?

Here is what will happen upon program exit (exception or not):

CPython dereferences all objects, and all objects have their destructors called, even if the program ends due to an unhanded exception.

When the reference count hits zero, no Python code can reach the object anymore, so the object gets deallocated. And when it gets deallocated, Python calls the __del__() destructor.

Python’s __del__() method for files flushes the buffers and closes the file from the operating system’s point of view.

fegin added 2 commits June 25, 2025 19:12

Only perform cleanup if the trainer exist successfully

be2cc6a

misc

e1ba1bc

fegin requested review from tianyu-l and wwwjn as code owners June 26, 2025 02:17

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 26, 2025

wconstab reviewed Jun 26, 2025

View reviewed changes

Misc

8695f05

fegin changed the title ~~Only perform cleanup if the trainer exist successfully~~ Only calls destroy_process_group if the trainer exist successfully Jun 26, 2025

wconstab approved these changes Jun 26, 2025

View reviewed changes

kwen2501 reviewed Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only calls destroy_process_group if the trainer exist successfully #1342

Only calls destroy_process_group if the trainer exist successfully #1342

Uh oh!

fegin commented Jun 26, 2025 •

edited

Loading

Uh oh!

wconstab Jun 26, 2025

Uh oh!

fegin Jun 26, 2025

Uh oh!

wconstab Jun 26, 2025

Uh oh!

kwen2501 Jun 26, 2025

Uh oh!

kwen2501 Jun 26, 2025

Uh oh!

kwen2501 Jun 27, 2025

Uh oh!

kwen2501 Jun 27, 2025 •

edited

Loading

Uh oh!

wconstab Jun 27, 2025

Uh oh!

wconstab Jun 27, 2025

Uh oh!

fegin Jun 27, 2025

Uh oh!

wconstab left a comment •

edited

Loading

Uh oh!

kwen2501 Jun 27, 2025

Uh oh!

Uh oh!

Only calls destroy_process_group if the trainer exist successfully #1342

Are you sure you want to change the base?

Only calls destroy_process_group if the trainer exist successfully #1342

Uh oh!

Conversation

fegin commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin commented Jun 26, 2025 •

edited

Loading

kwen2501 Jun 27, 2025 •

edited

Loading

wconstab left a comment •

edited

Loading