Skip to content

Add a test that when groupcoordinator dies, the consumer will pick up the new coordinator #1134

Open
@jeffwidman

Description

@jeffwidman

I just had a failure case reported at work where a service endlessly spun:

2017-06-29 08:50:11,407 WARNING         base                    __call__:661    10627   139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,407 WARNING         base                    _handle_heartbeat_failure:692   10627   139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
2017-06-29 08:50:11,508 WARNING         base                    __call__:661    10627   139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,508 WARNING         base                    _handle_heartbeat_failure:692   10627   139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying

Normally this indicates a cluster failure. However, from the ticket description it appears the cluster became healthy again but the consumer never recovered and just kept returning this message for half an hour. Restarting the process immediately fixed the issue.

I wasn't directly involved, I was just called in as the Kafka expert after the fact, so this will likely be impossible to verify that the cluster was fully healthy.

However, we should have an end-to-end test of this scenario that brings up a cluster and consumer group with two processes, kills the broker that is the group coordinator, and verifies that the consumers rejoin successfully once the cluster moves the coordinator to a new broker.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions