Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue

### Name and Version

```bash
>llama-server --version
version: 4689 (90e4dba4)
built with MSVC 19.42.34436.0 for x64
```

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa
```

### Problem description & steps to reproduce

Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API.
It keeps returning the **"Either \"json_schema\" or \"grammar\" can be specified, but not both"** error message.

I've tried using several different models from HF, and this issue happens no matter which model I loaded.
The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

Request:
```bash
curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'
```

Response:
```json
{
    "error": {
        "code": 400,
        "message": "Either \"json_schema\" or \"grammar\" can be specified, but not both",
        "type": "invalid_request_error"
    }
}
```

I've tried changing the `response_format` with various values like below but it keeps returning that same error.

```json
"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "json_schema": {
        "name": "chat_response",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}
```

```json
"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}
```

Even using the one in the [documentation](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-v1chatcompletions-openai-compatible-chat-completions-api) (`{"type": "json_object"}`) returns the same error:
```json
{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {"type": "json_object"}
}
```

---

To add, I tried using the [`POST /completion`](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-completion-given-a-prompt-it-returns-the-predicted-completion) API and using the same GGUF model it's capable of returning using the defined JSON schema:

Request:
```bash
curl --location 'http://localhost:1234/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "prompt": "<|im_start|>user\nhello<|im_end|>",
    "json_schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}'
```

Response:
```json
{
    "index": 0,
    "content": "{\n    \"response\": \"Hello! How can I assist you today?\"\n}",
    "tokens": [],
    "id_slot": 0,
    "stop": true,
    "model": "hermes-3-llama-3.1-8b",
    "tokens_predicted": 17,
    "tokens_evaluated": 6,
    "generation_settings": {
        "n_predict": -1,
        "seed": 4294967295,
        "temperature": 0.800000011920929,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "dry_multiplier": 0.0,
        "dry_base": 1.75,
        "dry_allowed_length": 2,
        "dry_penalty_last_n": 4096,
        "dry_sequence_breakers": [
            "\n",
            ":",
            "\"",
            "*"
        ],
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "stop": [],
        "max_tokens": -1,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "logit_bias": [],
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nresponse-kv ::= \"\\\"response\\\"\" space \":\" space string\nroot ::= \"{\" space response-kv \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
        "grammar_trigger_words": [],
        "grammar_trigger_tokens": [],
        "preserved_tokens": [],
        "samplers": [
            "penalties",
            "dry",
            "top_k",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ],
        "speculative.n_max": 16,
        "speculative.n_min": 5,
        "speculative.p_min": 0.8999999761581421,
        "timings_per_token": false,
        "post_sampling_probs": false,
        "lora": []
    },
    "prompt": "<|begin_of_text|><|im_start|>user\nhello<|im_end|>",
    "has_new_line": true,
    "truncated": false,
    "stop_type": "eos",
    "stopping_word": "",
    "tokens_cached": 22,
    "timings": {
        "prompt_n": 6,
        "prompt_ms": 1098.932,
        "prompt_per_token_ms": 183.15533333333335,
        "prompt_per_second": 5.459846469117288,
        "predicted_n": 17,
        "predicted_ms": 7322.017,
        "predicted_per_token_ms": 430.7068823529412,
        "predicted_per_second": 2.3217646175910276
    }
}
```

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue #11847

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue #11847

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions