-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Closed
Copy link
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Using the latest vLLM off of main
.
🐛 Describe the bug
When running the online server with a model with an MLP speculator, sending a request that request prompt logprobs causes the server to crash with an AssertionError
.
Stacktrace:
Traceback (most recent call last):
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 125, in generate
async for request_output in results_generator:
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 1054, in generate
async for output in await self.add_request(
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 114, in generator
raise result
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 920, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 863, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 332, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/executor/gpu_executor.py", line 170, in execute_model_async
output = await make_async(self.driver_worker.execute_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 387, in execute_model
return self._run_no_spec(execute_model_req,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 481, in _run_no_spec
self.previous_hidden_states.update(
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/sequence.py", line 1199, in update
assert len(seq_group_metadata_list) == len(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
To Reproduce
Run a server with an MLP speculator, eg. one of IBM's granite models:
vllm serve ibm-granite/granite-3b-code-instruct --speculative-model ibm-granite/granite-3b-code-instruct-accelerator --use-v2-block-manager --enforce-eager
Send an echo
request with logprobs requested for the prompt tokens:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-3b-code-instruct",
"prompt": "Hello World",
"echo": 1,
"logprobs": 1,
"temperature": 0
}'
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working