-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
CentOS 9 - CPU only
remote-vllm image from docker.io
Information
- The official example scripts
- My own modified scripts
🐛 Describe the bug
I'm running a llama-stack server using the docker.io/llamastack/distribution-remote-vllm
image and getting a BadRequest
. The last image that worked for me was the 0.1.9
tag and every other one after this fails with the same error.
From the client side, I'm just using curl
like this:
$ curl http://localhost:8321/v1/inference/chat-completion -H "Content-Type: application/json" -d '{
"model_id": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write me a limerick about Llama Stack."}],
"max_tokens": 100,
"temperature": 0
}'
This same request works perfectly using the 0.1.9
tag of the container.
And below is how I start the server:
podman run -it --privileged --rm -p 8321:8321 docker.io/llamastack/distribution-remote-vllm:latest --port 8321 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct --env VLLM_URL=$VLLM_URL --env VLLM_API_TOKEN=$VLLM_API_TOKEN --env VLLM_MAX_TOKENS=200 --env LLAMA_STACK_PORT=8321
Error logs
BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'list_type', 'loc': ('body',
'tools'), 'msg': 'Input should be a valid list', 'input': {}}]", 'type': 'BadRequestError', 'param': None,
'code': 400}
INFO: ::1:54744 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error
09:25:37.793 [END] /v1/inference/chat-completion [StatusCode.OK] (1393.66ms)
09:25:37.791 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 201, in endpoint
return await maybe_await(value)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 161, in maybe_await
return await value
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 324, in chat_completion
response = await provider.chat_completion(**params)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 307, in chat_completion
return await self._nonstream_chat_completion(request, self.client)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 313, in _nonstream_chat_completion
r = await client.chat.completions.create(**params)
File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2002, in create
return await self._post(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
return await self._request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1524, in _request
return await self._retry_request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1594, in _retry_request
return await self._request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'list_type', 'loc': ('body', 'tools'), 'msg': 'Input should be a valid list', 'input': {}}]", 'type': 'BadRequestError', 'param': None, 'code': 400}
Expected behavior
On the working version, the server replies just fine:
$ curl http://localhost:8321/v1/inference/chat-completion -H "Content-Type: application/json" -d '{
"model_id": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write me a limerick about Llama Stack."}],
"max_tokens": 100,
"temperature": 0
}'
{"metrics":[{"metric":"prompt_tokens","value":32,"unit":null},{"metric":"completion_tokens","value":47,"unit":null},{"metric":"total_tokens","value":79,"unit":null}],"completion_message":{"role":"assistant","content":"There once was a Llama Stack high,\nBuilt with blocks that touched the sky,\nIt stood with great care,\nAnd a gentle air,\nThis Llama's tower reached on by.","stop_reason":"end_of_turn","tool_calls":[]},"logprobs":null}