-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Description
Name and Version
>llama-server --version
version: 4689 (90e4dba4)
built with MSVC 19.42.34436.0 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa
Problem description & steps to reproduce
Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API.
It keeps returning the "Either "json_schema" or "grammar" can be specified, but not both" error message.
I've tried using several different models from HF, and this issue happens no matter which model I loaded.
The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B
Request:
curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}'
Response:
{
"error": {
"code": 400,
"message": "Either \"json_schema\" or \"grammar\" can be specified, but not both",
"type": "invalid_request_error"
}
}
I've tried changing the response_format
with various values like below but it keeps returning that same error.
"response_format": {
"type": "json_schema", // either "json_schema" or "json_object" shows the same error
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
"response_format": {
"type": "json_schema", // either "json_schema" or "json_object" shows the same error
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
Even using the one in the documentation ({"type": "json_object"}
) returns the same error:
{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {"type": "json_object"}
}
To add, I tried using the POST /completion
API and using the same GGUF model it's capable of returning using the defined JSON schema:
Request:
curl --location 'http://localhost:1234/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"prompt": "<|im_start|>user\nhello<|im_end|>",
"json_schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}'
Response:
{
"index": 0,
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}",
"tokens": [],
"id_slot": 0,
"stop": true,
"model": "hermes-3-llama-3.1-8b",
"tokens_predicted": 17,
"tokens_evaluated": 6,
"generation_settings": {
"n_predict": -1,
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"dry_multiplier": 0.0,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 4096,
"dry_sequence_breakers": [
"\n",
":",
"\"",
"*"
],
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"stop": [],
"max_tokens": -1,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nresponse-kv ::= \"\\\"response\\\"\" space \":\" space string\nroot ::= \"{\" space response-kv \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
"grammar_trigger_words": [],
"grammar_trigger_tokens": [],
"preserved_tokens": [],
"samplers": [
"penalties",
"dry",
"top_k",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
],
"speculative.n_max": 16,
"speculative.n_min": 5,
"speculative.p_min": 0.8999999761581421,
"timings_per_token": false,
"post_sampling_probs": false,
"lora": []
},
"prompt": "<|begin_of_text|><|im_start|>user\nhello<|im_end|>",
"has_new_line": true,
"truncated": false,
"stop_type": "eos",
"stopping_word": "",
"tokens_cached": 22,
"timings": {
"prompt_n": 6,
"prompt_ms": 1098.932,
"prompt_per_token_ms": 183.15533333333335,
"prompt_per_second": 5.459846469117288,
"predicted_n": 17,
"predicted_ms": 7322.017,
"predicted_per_token_ms": 430.7068823529412,
"predicted_per_second": 2.3217646175910276
}
}
First Bad Commit
No response