-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
This is a bug we encounter a lot in our ci, e.g. https://buildkite.com/vllm/ci-aws/builds/8098#0191bf43-446d-411d-80c7-3ba10bc392e8/192-1557
I have been tracking this for months, and try to add more logging information to help debugging.
from the logging information:
[2024-09-05T00:38:34Z] INFO: Started server process [60858]
| [2024-09-05T00:38:34Z] INFO: Waiting for application startup.
| [2024-09-05T00:38:34Z] INFO: Application startup complete.
| [2024-09-05T00:38:34Z] ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 44319): [errno 98] address already in use
| [2024-09-05T00:38:34Z] INFO: Waiting for application shutdown.
| [2024-09-05T00:38:34Z] INFO: Application shutdown complete.
| [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] port 44319 is used by process psutil.Process(pid=60914, name='pt_main_thread', status='sleeping', started='17:37:05') launched with command:
| [2024-09-05T00:38:34Z] DEBUG 09-04 17:38:34 launcher.py:64] /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=18) --multiprocessing-fork
we can see that the server process is pid 60858 , and the port 44319 is used by process 60914. scrolling up a little bit, we can find:
[2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b6851f4d-4d78-46b8-baba-ae179b0088c2 for RPC Path.
| [2024-09-05T00:37:05Z] INFO 09-04 17:37:05 api_server.py:176] Started engine process with PID 60914
it becomes clear that this is the engine process.
I think the problem here, is that we only bind the port after the engine is ready. During engine setup, it might use some ports for ray, or for distributed communication.
there are two possible solutions:
- the api server immediately binds to the port after start, and returns unready status when client queries the
/healthy
endpoint - the api server binds the port immediately (via
socket.socket(socket.AF_INET, socket.SOCK_STREAM).bind(("", uvicorn_kwargs["port"]))
), and after engine is up, it releases the port, and bind again to serve requests
I think 1 might be better. 2 would suffer from the fact that client will get 404 not found before the engine is up, because this is just a raw socket without any response.
cc @robertgshaw2-neuralmagic @njhill @joerunde
also cc @richardliaw @rkooo567 how to turn on verbose ray logging, so that we can verify if the port is indeed used by ray.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.