Skip to content

Eval bug: uncaught std::runtime_exception thrown in llama-server during tool use #13812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bjodah opened this issue May 26, 2025 · 3 comments

Comments

@bjodah
Copy link

bjodah commented May 26, 2025

Name and Version

$ /build/llama.cpp-debug/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3090)
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 7950X 16-Core Processor)
load_backend: failed to find ggml_backend_init in /build/llama.cpp-debug/bin/libggml-cuda.so
load_backend: failed to find ggml_backend_init in /build/llama.cpp-debug/bin/libggml-rpc.so
load_backend: failed to find ggml_backend_init in /build/llama.cpp-debug/bin/libggml-cpu.so
version: 5498 (6f180b9)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 7950X + 3090

Models

Qwen3-4B

Problem description & steps to reproduce

For certain requests on /v1/chat/completions end-point, with tool use I get an uncaught exception.

To reproduce this, run run.sh from this ephemeral repo:
https://github.com/bjodah/bug-reproducer-llamacpp-partial-parse/tree/main

First Bad Commit

I have not bisected this.

Relevant log output

Output from `llama-server --log-file /logs/llamacpp-Qwen3-4B.log --port 11034 --hf-repo unsloth/Qwen3-4B-GGUF:Q8_0 --n-gpu-layers 999 --jinja --cache-type-k q8_0 --ctx-size 32768 --samplers 'top_k;dry;min_p;temperature;top_p' --min-p 0.005 --top-p 0.97 --top-k 40 --temp 0.7 --dry-multiplier 0.7 --dry-allowed-length 4 --dry-penalty-last-n 2048 --presence-penalty 0.05 --frequency-penalty 0.005 --repeat-penalty 1.01 --repeat-last-n 16`
curl_perform_with_retry: HEAD https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/bjorn/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q8_0.gguf
build: 5498 (6f180b91) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu (debug)
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 11034, http threads: 31
main: loading model
srv    load_model: loading model '/home/bjorn/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 16124 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /home/bjorn/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-4B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3-4B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                          general.file_type u32              = 7
llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-4B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-4B.txt
llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.98 GiB (8.50 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2560
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 9728
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 4.02 B
print_info: general.name     = Qwen3-4B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   394.12 MiB
load_tensors:        CUDA0 model buffer size =  4076.43 MiB
.....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 32768 (padded)
llama_kv_cache_unified: layer   0: dev = CUDA0
llama_kv_cache_unified: layer   1: dev = CUDA0
llama_kv_cache_unified: layer   2: dev = CUDA0
llama_kv_cache_unified: layer   3: dev = CUDA0
llama_kv_cache_unified: layer   4: dev = CUDA0
llama_kv_cache_unified: layer   5: dev = CUDA0
llama_kv_cache_unified: layer   6: dev = CUDA0
llama_kv_cache_unified: layer   7: dev = CUDA0
llama_kv_cache_unified: layer   8: dev = CUDA0
llama_kv_cache_unified: layer   9: dev = CUDA0
llama_kv_cache_unified: layer  10: dev = CUDA0
llama_kv_cache_unified: layer  11: dev = CUDA0
llama_kv_cache_unified: layer  12: dev = CUDA0
llama_kv_cache_unified: layer  13: dev = CUDA0
llama_kv_cache_unified: layer  14: dev = CUDA0
llama_kv_cache_unified: layer  15: dev = CUDA0
llama_kv_cache_unified: layer  16: dev = CUDA0
llama_kv_cache_unified: layer  17: dev = CUDA0
llama_kv_cache_unified: layer  18: dev = CUDA0
llama_kv_cache_unified: layer  19: dev = CUDA0
llama_kv_cache_unified: layer  20: dev = CUDA0
llama_kv_cache_unified: layer  21: dev = CUDA0
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA0
llama_kv_cache_unified: layer  29: dev = CUDA0
llama_kv_cache_unified: layer  30: dev = CUDA0
llama_kv_cache_unified: layer  31: dev = CUDA0
llama_kv_cache_unified: layer  32: dev = CUDA0
llama_kv_cache_unified: layer  33: dev = CUDA0
llama_kv_cache_unified: layer  34: dev = CUDA0
llama_kv_cache_unified: layer  35: dev = CUDA0
llama_kv_cache_unified:      CUDA0 KV buffer size =  3528.00 MiB
llama_kv_cache_unified: size = 3528.00 MiB ( 32768 cells,  36 layers,  1 seqs), K (q8_0): 1224.00 MiB, V (f16): 2304.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2138.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 69.01 MiB
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =  2138.00 MiB
llama_context:  CUDA_Host compute buffer size =    69.01 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [2560 2 1 1]
set_warmup: value = 0
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for forward_message in messages %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- set message = messages[index] %}
    {%- set current_content = message.content if message.content is defined and message.content is not none else '' %}
    {%- set tool_start = '<tool_response>' %}
    {%- set tool_start_length = tool_start|length %}
    {%- set start_of_message = current_content[:tool_start_length] %}
    {%- set tool_end = '</tool_response>' %}
    {%- set tool_end_length = tool_end|length %}
    {%- set start_pos = (current_content|length) - tool_end_length %}
    {%- if start_pos < 0 %}
        {%- set start_pos = 0 %}
    {%- endif %}
    {%- set end_of_message = current_content[start_pos:] %}
    {%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set m_content = message.content if message.content is defined and message.content is not none else '' %}
        {%- set content = m_content %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in m_content %}
                {%- set content = (m_content.split('</think>')|last).lstrip('\n') %}
                {%- set reasoning_content = (m_content.split('</think>')|first).rstrip('\n') %}
                {%- set reasoning_content = (reasoning_content.split('<think>')|last).lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and (not reasoning_content.strip() == '')) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:11034 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 21
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 21, n_tokens = 21, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 21, n_tokens = 21
set_embeddings: value = 0
clear_adapter_lora: call
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [2560 21 1 1]
set_embeddings: value = 0
clear_adapter_lora: call
<<...removed many repeating lines here...>>
set_embeddings: value = 0
clear_adapter_lora: call
set_embeddings: value = 0
clear_adapter_lora: call
slot      release: id  0 | task 0 | stop processing: n_past = 165, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =      53.41 ms /    21 tokens (    2.54 ms per token,   393.16 tokens per second)
       eval time =    4008.98 ms /   145 tokens (   27.65 ms per token,    36.17 tokens per second)
      total time =    4062.39 ms /   166 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 146 | processing task
slot update_slots: id  0 | task 146 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 783
slot update_slots: id  0 | task 146 | kv cache rm [1, end)
slot update_slots: id  0 | task 146 | prompt processing progress, n_past = 783, n_tokens = 782, progress = 0.998723
slot update_slots: id  0 | task 146 | prompt done, n_past = 783, n_tokens = 782
set_embeddings: value = 0
clear_adapter_lora: call
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [2560 512 1 1]
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [2560 270 1 1]
Grammar still awaiting trigger after token 151667 (`<think>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 271 (`

`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151668 (`</think>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 271 (`

`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar triggered on regex: '<tool_call>'
set_embeddings: value = 0
clear_adapter_lora: call
set_embeddings: value = 0
clear_adapter_lora: call
<<...removed many repeating lines here...>>
set_embeddings: value = 0
clear_adapter_lora: call
slot      release: id  0 | task 146 | stop processing: n_past = 876, truncated = 0
slot print_timing: id  0 | task 146 | 
prompt eval time =     368.46 ms /   782 tokens (    0.47 ms per token,  2122.35 tokens per second)
       eval time =    2695.08 ms /    94 tokens (   28.67 ms per token,    34.88 tokens per second)
      total time =    3063.54 ms /   876 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 241 | processing task
slot update_slots: id  0 | task 241 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 888
slot update_slots: id  0 | task 241 | kv cache rm [783, end)
slot update_slots: id  0 | task 241 | prompt processing progress, n_past = 888, n_tokens = 105, progress = 0.118243
slot update_slots: id  0 | task 241 | prompt done, n_past = 888, n_tokens = 105
set_embeddings: value = 0
clear_adapter_lora: call
check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [2560 105 1 1]
Grammar still awaiting trigger after token 151667 (`<think>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 271 (`

`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151668 (`</think>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 271 (`

`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 24 (`9`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 22 (`7`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 23 (`8`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151645 (`<|im_end|>`)
slot      release: id  0 | task 241 | stop processing: n_past = 896, truncated = 0
slot print_timing: id  0 | task 241 | 
prompt eval time =      81.52 ms /   105 tokens (    0.78 ms per token,  1288.01 tokens per second)
       eval time =     225.42 ms /     9 tokens (   25.05 ms per token,    39.93 tokens per second)
      total time =     306.94 ms /   114 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 251 | processing task
slot update_slots: id  0 | task 251 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 783
slot update_slots: id  0 | task 251 | need to evaluate at least 1 token to generate logits, n_past = 783, n_prompt_tokens = 783
slot update_slots: id  0 | task 251 | kv cache rm [782, end)
slot update_slots: id  0 | task 251 | prompt processing progress, n_past = 783, n_tokens = 1, progress = 0.001277
slot update_slots: id  0 | task 251 | prompt done, n_past = 783, n_tokens = 1
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151667 (`<think>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 271 (`

`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151657 (`<tool_call>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 198 (`
`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 4913 (`{"`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 606 (`name`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 788 (`":`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 330 (` "`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 6108 (`run`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 55869 (`_python`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 14660 (`_script`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 497 (`",`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 330 (` "`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16370 (`arguments`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 788 (`":`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 5212 (` {"`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2427 (`source`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 788 (`":`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 5869 (` "#`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 87980 (`!/`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 7063 (`usr`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 8749 (`/bin`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 14358 (`/env`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 10135 (` python`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1499 (`from`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 8874 (` datetime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1159 (` import`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 8874 (` datetime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2 (`#`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 14775 (` Parse`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 279 (` the`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 12713 (` dates`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1028 (`date`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16 (`1`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 284 (` =`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 8874 (` datetime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 47433 (`.strptime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 492 (`('`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16 (`1`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 24 (`9`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 24 (`9`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 24 (`9`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 12 (`-`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16 (`1`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 12 (`-`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 19 (`4`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 516 (`',`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 7677 (` '%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 56 (`Y`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 11069 (`-%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 76 (`m`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 11069 (`-%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 67 (`d`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 863 (`')`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 59 (`\`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 303 (`nd`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 349 (`ate`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 284 (` =`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 8874 (` datetime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 47433 (`.strptime`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 492 (`('`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 15 (`0`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 20 (`5`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 12 (`-`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 15 (`0`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 20 (`5`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 12 (`-`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16 (`1`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 24 (`9`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 516 (`',`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 7677 (` '%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 56 (`Y`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 11069 (`-%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 76 (`m`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 11069 (`-%`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 67 (`d`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 863 (`')`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 59 (`\`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 77 (`n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2 (`#`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 20517 (` Calculate`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 279 (` the`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 6672 (` difference`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 304 (` in`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2849 (` days`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 20255 (`delta`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 284 (` =`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2400 (` date`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 17 (`2`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 481 (` -`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2400 (` date`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 16 (`1`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2 (`#`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 3411 (` Return`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 279 (` the`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1372 (` number`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 315 (` of`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2849 (` days`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1699 (`\n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 1350 (`print`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 36073 (`(delta`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 54142 (`.days`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 10699 (`)\`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 77 (`n`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 497 (`",`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 330 (` "`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 2116 (`args`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 788 (`":`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 3056 (` []`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 11248 (`}}
`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151658 (`</tool_call>`)
set_embeddings: value = 0
clear_adapter_lora: call
Grammar still awaiting trigger after token 151645 (`<|im_end|>`)
slot      release: id  0 | task 251 | stop processing: n_past = 907, truncated = 0
slot print_timing: id  0 | task 251 | 
prompt eval time =      33.27 ms /     1 tokens (   33.27 ms per token,    30.06 tokens per second)
       eval time =    3591.84 ms /   125 tokens (   28.73 ms per token,    34.80 tokens per second)
      total time =    3625.11 ms /   126 tokens

The server aborts with the message:
terminate called after throwing an instance of 'std::runtime_error'
  what():  </think>

GDB debugging session

Thread 1 "llama-server" received signal SIGABRT, Aborted.
0x00007ade9efc9eec in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ade9efc9eec in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ade9ef7afb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ade9ef65472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ade9f29d919 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ade9f2a8e1a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ade9f2a8e85 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ade9f2a90d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x000059e128f21a51 in common_chat_parse (input=..., is_partial=false, syntax=...) at /home/bjorn/vc/llama.cpp/common/chat.cpp:1923
#8  0x000059e128d74671 in server_slot::update_chat_msg (this=0x59e14cdcd730, diffs=...) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:1413
#9  0x000059e128d7f3d4 in server_context::send_final_response (this=0x7fff49cf9730, slot=...) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:2520
#10 0x000059e128d84a59 in server_context::update_slots (this=0x7fff49cf9730) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:3497
#11 0x000059e128d30477 in operator() (__closure=0x7fff49cfad08) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:4928
#12 0x000059e128d3e22a in std::__invoke_impl<void, main(int, char**)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...)
    at /usr/include/c++/12/bits/invoke.h:61
#13 0x000059e128d3c176 in std::__invoke_r<void, main(int, char**)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/12/bits/invoke.h:111
#14 0x000059e128d38462 in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/12/bits/std_function.h:290
#15 0x000059e128d89bda in std::function<void ()>::operator()() const (this=0x7fff49cfad08) at /usr/include/c++/12/bits/std_function.h:591
#16 0x000059e128d76dd0 in server_queue::start_loop (this=0x7fff49cfabe8) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:1684
#17 0x000059e128d32d53 in main (argc=38, argv=0x7fff49cfd0d8) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:4953
(gdb) f 8
#8  0x000059e128d74671 in server_slot::update_chat_msg (this=0x59e14cdcd730, diffs=...) at /home/bjorn/vc/llama.cpp/tools/server/server.cpp:1413
1413                params.oaicompat_chat_syntax);
(gdb) l
1408            auto previous_msg = chat_msg;
1409            SRV_DBG("Parsing chat message: %s\n", generated_text.c_str());
1410            auto new_msg = common_chat_parse(
1411                generated_text,
1412                /* is_partial= */ stop != STOP_TYPE_EOS,
1413                params.oaicompat_chat_syntax);
1414            if (!new_msg.empty()) {
1415                new_msg.ensure_tool_call_ids_set(generated_tool_call_ids, gen_tool_call_id);
1416                chat_msg = new_msg;
1417                diffs = common_chat_msg_diff::compute_diffs(previous_msg, new_msg.empty() ? previous_msg : new_msg);
(gdb) p generated_text
$4 = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<std::__new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0x59e1445c9050 "<think>\n\n<tool_call>\n{\"name\": \"run_python_script\", \"arguments\": {\"source\": \"#!/usr/bin/env python\\nfrom datetime import datetime\\n\\n# Parse the dates\\ndate1 = datetime.strptime('1999-12-24', '%Y-%m-%d')\\ndate2 = datetime.strptime('2025-05-19', '%Y-%m-%d')\\n\\n# Calculate the difference in days\\ndelta = date2 - date1\\n\\n# Return the number of days\\nprint(delta.days)\\n\", \"args\": []}}\n</tool_call>"},
  _M_string_length = 396, {_M_local_buf = "\300\003\000\000\000\000\000\000\n\000\000\000\000\000\000", _M_allocated_capacity = 960}}
(gdb) p params.oaicompat_chat_syntax
$6 = {format = COMMON_CHAT_FORMAT_HERMES_2_PRO, reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK, reasoning_in_content = false,
  thinking_forced_open = false, parse_tool_calls = true}

So the exception is thrown from

throw std::runtime_error(ex.what());

Since it is throwing instead of returning e.g. HTTP status code 500, I guess this constitutes a bug?

That <think>\n\n part before <tool_call> looks suspicious, no?

@bjodah
Copy link
Author

bjodah commented May 26, 2025

This might be due to malformated jinja template in unsloth's quant, xref: https://huggingface.co/unsloth/Qwen3-4B-GGUF/discussions/4

In case they update their gguf, this is the one I'm using:
https://huggingface.co/unsloth/Qwen3-4B-GGUF/blob/110fd0a15a3a0f2461a25729a2a2c375caf975ca/Qwen3-4B-Q8_0.gguf

$ openssl sha256 ~/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q8_0.gguf
SHA2-256(/home/bjorn/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q8_0.gguf)= eed555233267a33c7e8ee31682762cc7751b3f6d224039086e0e846f05fffa5d

@gramss
Copy link

gramss commented May 27, 2025

Hi there,

i might have a similar problem with the unsloth models.
I am running on macOS M1 processors.

Running the tutorial here works: https://docs.unsloth.ai/basics/devstral-how-to-run-and-fine-tune#possible-vision-support
But the experimental part of the vision model not.

In general, vision models work on my machine with the latest llama.cpp implementations: https://simonwillison.net/2025/May/10/llama-cpp-vision/

$> llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL

For the unsloth image section (which might be a tool usage? I am not 100% sure in this environment) I get an exeption thrown when I want to ask him something about a screenshot. The same screenshot works with the unsloth/gemma-3-4b-it-GGUF:Q4_K_XL without a problem.

Here is the exception for the unsloth-devstral using the experimental vision part:

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x000000016cecffc0
Exception Codes:       0x0000000000000002, 0x000000016cecffc0

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [49086]

VM Region Info: 0x16cecffc0 is in 0x1696cc000-0x16ced0000;  bytes after start: 58736576  bytes before end: 63
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      MALLOC_MEDIUM               160000000-168000000    [128.0M] rw-/rwx SM=PRV  
      GAP OF 0x16cc000 BYTES
--->  STACK GUARD                 1696cc000-16ced0000    [ 56.0M] ---/rwx SM=NUL  stack guard for thread 0
      Stack                       16ced0000-16d6cc000    [ 8176K] rw-/rwx SM=SHM  thread 0

Also in the startup of the server, I see this note:

Failed to infer a tool call example (possible template bug)

I am not seeing this issue in the unsloth/gemma-3-4b-it-GGUF:Q4_K_XL ..


Another issue might be, that

$> llama-cli -hf unsloth/Devstral-Small-2505-GGUF:UD-Q4_K_XL --jinja       

was not downloading the mmproj.gguf files.. I added them manually, but there is no respecting .json for that file. Maybe this might be also a problem in my case..

Thanks for the help! :)

@gramss
Copy link

gramss commented May 27, 2025

FYI, I could "fix" my issue by dowloading the model again like this:

llama-server -hf unsloth/Devstral-Small-2505-GGUF:UD-Q4_K_XL

without the --jinja arg. Now it runs fine.

But I still have the

Failed to infer a tool call example (possible template bug)

Issue with this model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants