Skip to content

[lighthouse] fast failure on missing heartbeat instead of timeout #164

Open
@rualark

Description

@rualark

My understanding is that there are always some collective operations between replication groups to allreduce the gradients (if any form of ddp or hsdp is used). If one node fails in a replication group, all other groups will timeout because they will not finish allreduce. As lighthouse already knows that the node failed, should allreduce be aborted to avoid waiting for the timeout?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlighthouseLighthouse and quorum related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions