Skip to content

[Bug]: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use #8204

@youkaichao

Description

@youkaichao

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

This is a bug we encounter a lot in our ci, e.g. https://buildkite.com/vllm/ci-aws/builds/8098#0191bf43-446d-411d-80c7-3ba10bc392e8/192-1557

I have been tracking this for months, and try to add more logging information to help debugging.

from the logging information:

[2024-09-05T00:38:34Z] INFO: Started server process [60858]

  | [2024-09-05T00:38:34Z] INFO: Waiting for application startup.
  | [2024-09-05T00:38:34Z] INFO: Application startup complete.
  | [2024-09-05T00:38:34Z] ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 44319): [errno 98] address already in use
  | [2024-09-05T00:38:34Z] INFO: Waiting for application shutdown.
  | [2024-09-05T00:38:34Z] INFO: Application shutdown complete.
  | [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] port 44319 is used by process psutil.Process(pid=60914, name='pt_main_thread', status='sleeping', started='17:37:05') launched with command:
  | [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=18) --multiprocessing-fork

we can see that the server process is pid 60858 , and the port 44319 is used by process 60914. scrolling up a little bit, we can find:

[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.

  | [2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:176] Started engine process with PID 60914

it becomes clear that this is the engine process.

I think the problem here, is that we only bind the port after the engine is ready. During engine setup, it might use some ports for ray, or for distributed communication.

there are two possible solutions:

  1. the api server immediately binds to the port after start, and returns unready status when client queries the /healthy endpoint
  2. the api server binds the port immediately (via socket.socket(socket.AF_INET, socket.SOCK_STREAM).bind(("", uvicorn_kwargs["port"]))), and after engine is up, it releases the port, and bind again to serve requests

I think 1 might be better. 2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.

cc @robertgshaw2-neuralmagic @njhill @joerunde

also cc @richardliaw @rkooo567 how to turn on verbose ray logging, so that we can verify if the port is indeed used by ray.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions