[Bug]: Server crashes when kv cache exhausted

### Your current environment

GPUs: 8xL4 
v0.6.1 (docker)
model: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic

options:
VLLM_ATTENTION_BACKEND=FLASHINFER 
--tensor-parallel-size 8 --max_model_len 50000 --max-num-batched-tokens 50000 --gpu-memory-utilization 0.90 --enable-chunked-prefill false

### Model Input Dumps

_No response_

### 🐛 Describe the bug

The model runs fine when there is not too much load.
When the load increases the KV cache eventually reaches 100% and then vllm crashes.

```
Exception in callback functools.partial(<function _log_task_completion at 0x7f704067b2e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f703c90f2f0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f704067b2e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f703c90f2f0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1584, in execute_model
    model_input.async_callback()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1438, in _process_model_outputs
    self.do_log_stats(scheduler_outputs, outputs, finished_before)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1748, in do_log_stats
    stats = self._get_stats(scheduler_outputs, model_output,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1860, in _get_stats
    latency = seq_group.get_last_latency(now)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 686, in get_last_latency
    raise ValueError(
ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 52, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 869, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 809, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 344, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper
    pickle.dump(dumped_inputs, filep)
TypeError: cannot pickle 'flashinfer._prefill.BatchPrefillWithPagedKVCachePyTorchWrapper' object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Server crashes when kv cache exhausted #8738

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Server crashes when kv cache exhausted #8738

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions