Skip to content

BadRequest error using remote-vllm on inference #1955

@danalsan

Description

@danalsan

System Info

CentOS 9 - CPU only
remote-vllm image from docker.io

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I'm running a llama-stack server using the docker.io/llamastack/distribution-remote-vllm image and getting a BadRequest. The last image that worked for me was the 0.1.9 tag and every other one after this fails with the same error.

From the client side, I'm just using curl like this:

$ curl http://localhost:8321/v1/inference/chat-completion     -H "Content-Type: application/json"     -d '{
        "model_id": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write me a limerick about Llama Stack."}],
        "max_tokens": 100,
        "temperature": 0
    }'

This same request works perfectly using the 0.1.9 tag of the container.
And below is how I start the server:

podman run -it --privileged --rm -p 8321:8321 docker.io/llamastack/distribution-remote-vllm:latest --port 8321 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct --env VLLM_URL=$VLLM_URL --env VLLM_API_TOKEN=$VLLM_API_TOKEN --env VLLM_MAX_TOKENS=200 --env LLAMA_STACK_PORT=8321

Error logs

         BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'list_type', 'loc': ('body',
         'tools'), 'msg': 'Input should be a valid list', 'input': {}}]", 'type': 'BadRequestError', 'param': None,
         'code': 400}

INFO:     ::1:54744 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error
09:25:37.793 [END] /v1/inference/chat-completion [StatusCode.OK] (1393.66ms)
 09:25:37.791 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 201, in endpoint
    return await maybe_await(value)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 161, in maybe_await
    return await value
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 324, in chat_completion
    response = await provider.chat_completion(**params)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 307, in chat_completion
    return await self._nonstream_chat_completion(request, self.client)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 313, in _nonstream_chat_completion
    r = await client.chat.completions.create(**params)
  File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2002, in create
    return await self._post(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1524, in _request
    return await self._retry_request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1594, in _retry_request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'list_type', 'loc': ('body', 'tools'), 'msg': 'Input should be a valid list', 'input': {}}]", 'type': 'BadRequestError', 'param': None, 'code': 400}

Expected behavior

On the working version, the server replies just fine:

$ curl http://localhost:8321/v1/inference/chat-completion     -H "Content-Type: application/json"     -d '{
        "model_id": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write me a limerick about Llama Stack."}],
        "max_tokens": 100,
        "temperature": 0
    }'

{"metrics":[{"metric":"prompt_tokens","value":32,"unit":null},{"metric":"completion_tokens","value":47,"unit":null},{"metric":"total_tokens","value":79,"unit":null}],"completion_message":{"role":"assistant","content":"There once was a Llama Stack high,\nBuilt with blocks that touched the sky,\nIt stood with great care,\nAnd a gentle air,\nThis Llama's tower reached on by.","stop_reason":"end_of_turn","tool_calls":[]},"logprobs":null}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions