Skip to content

[Bug]: Server crashes when kv cache exhausted #8738

@thies1006

Description

@thies1006

Your current environment

GPUs: 8xL4
v0.6.1 (docker)
model: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic

options:
VLLM_ATTENTION_BACKEND=FLASHINFER
--tensor-parallel-size 8 --max_model_len 50000 --max-num-batched-tokens 50000 --gpu-memory-utilization 0.90 --enable-chunked-prefill false

Model Input Dumps

No response

🐛 Describe the bug

The model runs fine when there is not too much load.
When the load increases the KV cache eventually reaches 100% and then vllm crashes.

Exception in callback functools.partial(<function _log_task_completion at 0x7f704067b2e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f703c90f2f0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f704067b2e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f703c90f2f0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1584, in execute_model
    model_input.async_callback()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1438, in _process_model_outputs
    self.do_log_stats(scheduler_outputs, outputs, finished_before)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1748, in do_log_stats
    stats = self._get_stats(scheduler_outputs, model_output,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1860, in _get_stats
    latency = seq_group.get_last_latency(now)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 686, in get_last_latency
    raise ValueError(
ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 52, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 869, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 809, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 344, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper
    pickle.dump(dumped_inputs, filep)
TypeError: cannot pickle 'flashinfer._prefill.BatchPrefillWithPagedKVCachePyTorchWrapper' object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions