-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Description
Hi vLLM team,
I started a vLLM server (OpenAI API) to serve LLaMA-7b and had multiple processes sending requests to it simultaneously to saturate the GPU (I tried both 1xA100 40G and 1xA40 40G).
However, after 5-10 minutes, the vLLM server will hang there (no more new requests get handled) forever without error messages. Most recent stats show that "INFO 10-27 20:44:35 llm_engine.py:624] Avg prompt throughput: 642.0 tokens/s, Avg generation throughput: 61.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 20 reqs, GPU KV cache u
sage: 98.7%, CPU KV cache usage: 0.0%".
After hanging happens, v1/models
endpoint still works (give correct responses), but chat completion and completion requests will receive openai.error.APIError: Invalid response object from API: 'Internal Server Error' (HTTP response code was 500).
and there were NO error messages from the vLLM side.
Any idea what might cause this? Is it because there are too many requests to be handled?
Thanks!