Skip to content

Misc. bug: server: Usage statistics in chat streams added to slightly different chunk from OpenAI Streaming API #15443

@TeoZosa

Description

@TeoZosa

Name and Version

$./llama-server --version
version: 6210 (a094f38)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

curl -X POST -H "Content-Type: application/json"  http://localhost:8080/v1/chat/completions  \
-d '{
"stream": true,
"stream_options": {"include_usage": true}, 
"model": "LiquidAI/LFM2-1.2",
"messages": [{"role": "user", "content": "What is an interesting example for this GitHub issue?"}]
}'

Problem description & steps to reproduce

The llama-server streaming response differs from the OpenAI Streaming API spec. From the OpenAI API docs on choices for completion chunk objects (emphasis mine):

choices [array]
A list of chat completion choices. Can contain more than one elements if n is greater than 1. Can also be empty for the last chunk if you set stream_options: {"include_usage": true}.

llama-server streaming response

Currently, usage (and timings) are included in the final llama-server chat.completion.chunk which contains the singleton choices array containing the stop finish_reason and an empty delta object.

Example
...
data: {
  "choices": [
    {
      "finish_reason": null,
      "index": 0,
      "delta": {
        "content": "?"
      }
    }
  ],
  "created": 1755667673,
  "id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
  "model": "LiquidAI/LFM2-1.2",
  "system_fingerprint": "b6210-a094f381",
  "object": "chat.completion.chunk"
}
data: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "delta": {}
    }
  ],
  "created": 1755667673,
  "id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
  "model": "LiquidAI/LFM2-1.2",
  "system_fingerprint": "b6210-a094f381",
  "object": "chat.completion.chunk",
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 14,
    "total_tokens": 25
  },
  "timings": {
    "prompt_n": 14,
    "prompt_ms": 53.734,
    "prompt_per_token_ms": 3.838142857142857,
    "prompt_per_second": 260.5426731678267,
    "predicted_n": 11,
    "predicted_ms": 44.285,
    "predicted_per_token_ms": 4.02590909090909,
    "predicted_per_second": 248.3911030823078
  }
}
data: [DONE]

OpenAI streaming response

usage sent in a final chunk with an empty choices array after the chunk containing the finish_reason

Example
data: {
  "id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
  "object": "chat.completion.chunk",
  "created": 1755673932,
  "model": "gpt-4.1-mini-2025-04-14",
  "service_tier": "default",
  "system_fingerprint": "fp_37c45ea698",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "?"
      },
      "logprobs": null,
      "finish_reason": null
    }
  ],
  "usage": null,
  "obfuscation": "A9QOO2mww"
}
data: {
  "id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
  "object": "chat.completion.chunk",
  "created": 1755673932,
  "model": "gpt-4.1-mini-2025-04-14",
  "service_tier": "default",
  "system_fingerprint": "fp_37c45ea698",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": null,
  "obfuscation": "dZtx"
}
data: {
  "id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
  "object": "chat.completion.chunk",
  "created": 1755673932,
  "model": "gpt-4.1-mini-2025-04-14",
  "service_tier": "default",
  "system_fingerprint": "fp_37c45ea698",
  "choices": [],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 9,
    "total_tokens": 28,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "obfuscation": "JjMN5JASlm"
}

data: [DONE]

First Bad Commit

I don't think there was a bad commit per-se, this looks to be how it was implemented from the beginning (this took me quite a while to track down).

a0a08ee (lines 2329-2350)

Relevant log output

(from `Problem description & steps to reproduce`)

...
data: {
  "choices": [
    {
      "finish_reason": null,
      "index": 0,
      "delta": {
        "content": "?"
      }
    }
  ],
  "created": 1755667673,
  "id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
  "model": "LiquidAI/LFM2-1.2",
  "system_fingerprint": "b6210-a094f381",
  "object": "chat.completion.chunk"
}
data: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "delta": {}
    }
  ],
  "created": 1755667673,
  "id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
  "model": "LiquidAI/LFM2-1.2",
  "system_fingerprint": "b6210-a094f381",
  "object": "chat.completion.chunk",
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 14,
    "total_tokens": 25
  },
  "timings": {
    "prompt_n": 14,
    "prompt_ms": 53.734,
    "prompt_per_token_ms": 3.838142857142857,
    "prompt_per_second": 260.5426731678267,
    "predicted_n": 11,
    "predicted_ms": 44.285,
    "predicted_per_token_ms": 4.02590909090909,
    "predicted_per_second": 248.3911030823078
  }
}
data: [DONE]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions