Skip to content

DNS Lookup Loop Never Quits #2354

@ja-softdevel

Description

@ja-softdevel

I have python workers in a Docker Image A (kafka-python). There are 4 workers that connect to another Docker Image B (kafka-server) that is running kafka-server. If Docker Image B (kafka-server) goes down, the workers in Docker Image A go into an infinite loop for DNS lookup until Docker Image B (kafka-server) comes back online.

Here's a part of the log

2023-02-17 15:48:32,489 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/client_async.py:331 - Node 1 connection failed -- refreshing metadata
2023-02-17 15:48:33,430 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:33,430 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:34,323 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:34,323 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,110 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,110 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:35,955 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:35,955 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)
2023-02-17 15:48:36,795 [WARNING] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:1527 - DNS lookup failed for kafka-server:19092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?
2023-02-17 15:48:36,795 [ERROR] elasticsearch_worker_0 /usr/local/lib/python3.8/site-packages/kafka/conn.py:315 - DNS lookup failed for kafka-server:19092 (AddressFamily.AF_UNSPEC)

When Docker Image B (kafka-server) comes back online, the workers will reconnect. But because of timeouts, only one worker will connect and it causes the kafka-server to start the topic with 1 partition instead of the 4 partitions which is what is expected.

It would be nice for the workers to actual fall off trying to connect and return execution to the main loop so I can handle the even when Docker Image B (kafka-server) goes offline.

What I've been seeing is when kafka-server comes back online, 1 worker will reconnect, 2 will connect but not be assigned a partition, and 1 will get a wakeup socket error
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/client_async.py#L937

versions

$ python3 --version
Python 3.6.8
$ cat /usr/local/lib/python3.8/site-packages/kafka/version.py 
__version__ = '2.0.2'

Also, random comment, this line should have a return value but is just an empty return.
https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/conn.py#L323

I'm sure I'm missing some details but at least this will get a thread/conversation started about what I'm observing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions