Skip to content

[Bug]: "Address already in use" for 1 minute after crash (since 0.6.2) #9737

@hibukipanim

Description

@hibukipanim

🐛 Describe the bug

Since version 0.6.2 (happens also in 0.6.3.post1), after the server dies (due to an exception/crash or hitting ctrl-c), for about a minute, it fails to start again with:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/home/user/code/debug/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/user/code/debug/.venv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 544, in run_server
    sock.bind(("", args.port))
OSError: [Errno 98] Address already in use

This prolongs recovery from crashes. In example upon crash Kubernetes immediately restarts the container - previously it would immediately start loading the model again, but now it will do several crash/restart loops until the port is freed.

Verified it happens also with --disable-frontend-multiprocessing.

To reproduce it, start vllm with default args, in example:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

and then send at least one chat or completion request to it (without this it won't reproduce).
then hit Ctrl-C to kill the server.
starting vllm again should throw the "Address already in use" error.
This doesn't happen with vllm <= 0.6.1.

I tried to see why the port is busy, and interestingly the vllm process is dead during this ~1 minute and no other process listens on it. However I noticed that there is a socket open from the 8000 port. Can see it via:

netstat | grep  ':8000'

which would show something like:

tcp        0      0 localhost:8000          localhost:40452         TIME_WAIT   -
tcp        0      0 localhost:8000          localhost:56324         TIME_WAIT   -
tcp        0      0 localhost:8000          localhost:40466         TIME_WAIT   -

After a minute these entries will disappear and then also vllm will manage to start.
I couldn't attribute it to a PID, nor with various nestat or lsof flags. Maybe it remains open in the kernel due to unclean process exit?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions