Description
I just had a failure case reported at work where a service endlessly spun:
2017-06-29 08:50:11,407 WARNING base __call__:661 10627 139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,407 WARNING base _handle_heartbeat_failure:692 10627 139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
2017-06-29 08:50:11,508 WARNING base __call__:661 10627 139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,508 WARNING base _handle_heartbeat_failure:692 10627 139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
Normally this indicates a cluster failure. However, from the ticket description it appears the cluster became healthy again but the consumer never recovered and just kept returning this message for half an hour. Restarting the process immediately fixed the issue.
I wasn't directly involved, I was just called in as the Kafka expert after the fact, so this will likely be impossible to verify that the cluster was fully healthy.
However, we should have an end-to-end test of this scenario that brings up a cluster and consumer group with two processes, kills the broker that is the group coordinator, and verifies that the consumers rejoin successfully once the cluster moves the coordinator to a new broker.