-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
$./llama-server --version
version: 6210 (a094f38)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
curl -X POST -H "Content-Type: application/json" http://localhost:8080/v1/chat/completions \
-d '{
"stream": true,
"stream_options": {"include_usage": true},
"model": "LiquidAI/LFM2-1.2",
"messages": [{"role": "user", "content": "What is an interesting example for this GitHub issue?"}]
}'
Problem description & steps to reproduce
The llama-server
streaming response differs from the OpenAI Streaming API spec. From the OpenAI API docs on choices
for completion chunk objects (emphasis mine):
choices [array]
A list of chat completion choices. Can contain more than one elements if n is greater than 1. Can also be empty for the last chunk if you set stream_options: {"include_usage": true}.
llama-server
streaming response
Currently, usage
(and timings
) are included in the final llama-server
chat.completion.chunk
which contains the singleton choices
array containing the stop
finish_reason and an empty delta
object.
Example
...
data: {
"choices": [
{
"finish_reason": null,
"index": 0,
"delta": {
"content": "?"
}
}
],
"created": 1755667673,
"id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
"model": "LiquidAI/LFM2-1.2",
"system_fingerprint": "b6210-a094f381",
"object": "chat.completion.chunk"
}
data: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"delta": {}
}
],
"created": 1755667673,
"id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
"model": "LiquidAI/LFM2-1.2",
"system_fingerprint": "b6210-a094f381",
"object": "chat.completion.chunk",
"usage": {
"completion_tokens": 11,
"prompt_tokens": 14,
"total_tokens": 25
},
"timings": {
"prompt_n": 14,
"prompt_ms": 53.734,
"prompt_per_token_ms": 3.838142857142857,
"prompt_per_second": 260.5426731678267,
"predicted_n": 11,
"predicted_ms": 44.285,
"predicted_per_token_ms": 4.02590909090909,
"predicted_per_second": 248.3911030823078
}
}
data: [DONE]
OpenAI
streaming response
usage
sent in a final chunk with an empty choices
array after the chunk containing the finish_reason
Example
data: {
"id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
"object": "chat.completion.chunk",
"created": 1755673932,
"model": "gpt-4.1-mini-2025-04-14",
"service_tier": "default",
"system_fingerprint": "fp_37c45ea698",
"choices": [
{
"index": 0,
"delta": {
"content": "?"
},
"logprobs": null,
"finish_reason": null
}
],
"usage": null,
"obfuscation": "A9QOO2mww"
}
data: {
"id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
"object": "chat.completion.chunk",
"created": 1755673932,
"model": "gpt-4.1-mini-2025-04-14",
"service_tier": "default",
"system_fingerprint": "fp_37c45ea698",
"choices": [
{
"index": 0,
"delta": {},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": null,
"obfuscation": "dZtx"
}
data: {
"id": "chatcmpl-C6XHQgbtRbhg8LRSyYGGXixdAGuZn",
"object": "chat.completion.chunk",
"created": 1755673932,
"model": "gpt-4.1-mini-2025-04-14",
"service_tier": "default",
"system_fingerprint": "fp_37c45ea698",
"choices": [],
"usage": {
"prompt_tokens": 19,
"completion_tokens": 9,
"total_tokens": 28,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"obfuscation": "JjMN5JASlm"
}
data: [DONE]
First Bad Commit
I don't think there was a bad commit per-se, this looks to be how it was implemented from the beginning (this took me quite a while to track down).
Relevant log output
(from `Problem description & steps to reproduce`)
...
data: {
"choices": [
{
"finish_reason": null,
"index": 0,
"delta": {
"content": "?"
}
}
],
"created": 1755667673,
"id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
"model": "LiquidAI/LFM2-1.2",
"system_fingerprint": "b6210-a094f381",
"object": "chat.completion.chunk"
}
data: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"delta": {}
}
],
"created": 1755667673,
"id": "chatcmpl-DwWJriZJ4TnyeMBHO3WmhUJhYUqQg1RP",
"model": "LiquidAI/LFM2-1.2",
"system_fingerprint": "b6210-a094f381",
"object": "chat.completion.chunk",
"usage": {
"completion_tokens": 11,
"prompt_tokens": 14,
"total_tokens": 25
},
"timings": {
"prompt_n": 14,
"prompt_ms": 53.734,
"prompt_per_token_ms": 3.838142857142857,
"prompt_per_second": 260.5426731678267,
"predicted_n": 11,
"predicted_ms": 44.285,
"predicted_per_token_ms": 4.02590909090909,
"predicted_per_second": 248.3911030823078
}
}
data: [DONE]