-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Description
Question: what's the longest a distributed operation should reasonably take?
How long would it take to "all-gather" a large amount of memory (like 80 GB)?
Let's set a smaller default timeout... maybe 180 seconds?
And then we can pass an argument to override this.
torchrunx/src/torchrunx/agent.py
Lines 83 to 85 in cd1a895
dist.init_process_group( | |
backend=backend, world_size=worker_args.world_size, rank=worker_args.rank, store=store | |
) |
Metadata
Metadata
Assignees
Labels
No labels