-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
I have python workers in a Docker Image A (kafka-python). There are 4 workers that connect to another Docker Image B (kafka-server) that is running kafka-server. If Docker Image B (kafka-server) goes down, the workers in Docker Image A go into an infinite loop for DNS lookup until Docker Image B (kafka-server) comes back online.
Here's a part of the log
2023-02-17 15:48:32,489 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/client_async.py:331 - Node 1 connection failed -- refreshing metadata
2023-02-17 15:48:33,430 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:33,430 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:34,323 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:34,323 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,110 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,110 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,955 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,955 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:36,795 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:36,795 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
When Docker Image B (kafka-server) comes back online, the workers will reconnect. But because of timeouts, only one worker will connect and it causes the kafka-server to start the topic with 1 partition instead of the 4 partitions which is what is expected.
It would be nice for the workers to actual fall off trying to connect and return execution to the main loop so I can handle the even when Docker Image B (kafka-server) goes offline.
What I've been seeing is when kafka-server comes back online, 1 worker will reconnect, 2 will connect but not be assigned a partition, and 1 will get a wakeup socket error
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/client_async.py#L937
versions
$ python3 --version
Python 3.6.8
$ cat /usr/local/lib/python3.8/site-packages/kafka/version.py
__version__ = '2.0.2'
Also, random comment, this line should have a return value but is just an empty return.
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/conn.py#L323
I'm sure I'm missing some details but at least this will get a thread/conversation started about what I'm observing.