Open
Description
My understanding is that there are always some collective operations between replication groups to allreduce the gradients (if any form of ddp or hsdp is used). If one node fails in a replication group, all other groups will timeout because they will not finish allreduce. As lighthouse already knows that the node failed, should allreduce be aborted to avoid waiting for the timeout?