vLLM hangs after 10 minutes without any error message

Hi vLLM team,

I started a vLLM server (OpenAI API) to serve LLaMA-7b and had multiple processes sending requests to it simultaneously to saturate the GPU (I tried both 1xA100 40G and 1xA40 40G).

However, after 5-10 minutes, the vLLM server will hang there (no more new requests get handled) forever without error messages. Most recent stats show that "INFO 10-27 20:44:35 llm_engine.py:624] Avg prompt throughput: 642.0 tokens/s, Avg generation throughput: 61.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 20 reqs, GPU KV cache u
sage: 98.7%, CPU KV cache usage: 0.0%".

After hanging happens, `v1/models` endpoint still works (give correct responses), but chat completion and completion requests will receive `openai.error.APIError: Invalid response object from API: 'Internal Server Error' (HTTP response code was 500).` and there were NO error messages from the vLLM side.

Any idea what might cause this? Is it because there are too many requests to be handled?

Thanks!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

vLLM hangs after 10 minutes without any error message #1492

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

vLLM hangs after 10 minutes without any error message #1492

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions