Skip to content

vLLM hangs after 10 minutes without any error message #1492

@xingyaoww

Description

@xingyaoww

Hi vLLM team,

I started a vLLM server (OpenAI API) to serve LLaMA-7b and had multiple processes sending requests to it simultaneously to saturate the GPU (I tried both 1xA100 40G and 1xA40 40G).

However, after 5-10 minutes, the vLLM server will hang there (no more new requests get handled) forever without error messages. Most recent stats show that "INFO 10-27 20:44:35 llm_engine.py:624] Avg prompt throughput: 642.0 tokens/s, Avg generation throughput: 61.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 20 reqs, GPU KV cache u
sage: 98.7%, CPU KV cache usage: 0.0%".

After hanging happens, v1/models endpoint still works (give correct responses), but chat completion and completion requests will receive openai.error.APIError: Invalid response object from API: 'Internal Server Error' (HTTP response code was 500). and there were NO error messages from the vLLM side.

Any idea what might cause this? Is it because there are too many requests to be handled?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions