Skip to content

Eval bug: Segmentation fault when loading SmolVLM-500M-Instruct-Q8\_0.gguf on Termux / Android ARM64, only in Termux, not in Prooted ones, other gguf work fine #13708

Closed
@Manamama

Description

@Manamama

Name and Version

When toying with : llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF I have discovered the below.

ChatGPT wrote most of it:


Segmentation fault when loading SmolVLM-500M-Instruct-Q8_0.gguf on Termux / Android ARM64

Environment:

Device:    MediaTek MT6785V/CD (8-core ARMv8)
OS:        Android 11 aarch64 (via Termux)
Kernel:    4.14.186+
Shell:     bash 5.2.37 (via Termux)
Compiler:  clang 20.1.5 (termux build)
llama.cpp: commit 6b56a646 (build 5453)

Command:

llama-server -m /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf

Result:

...
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x0000007fbded0d38 in VTT for std::__ndk1::basic_ostringstream<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > () from /data/data/com.termux/files/usr/lib/libc++_shared.so

Behavior:

  • Works perfectly with other .gguf files (e.g., TinyLlama, Mistral, Phi2, etc.)
  • Only crashes on this model (SmolVLM-500M-Instruct-Q8_0.gguf)
  • Reproducible across llama-cli and llama-server, even with --no-warmup
  • Works on same hardware when using other quantized GGUF files
  • Crashes during warmup, likely during debug/diagnostic logging that internally uses std::ostringstream
  • Same binary works for other models, suggesting the issue is data-triggered, not toolchain-related

Hypothesis:

  • Likely malformed or corrupted string in tokenizer.chat_template, tokenizer.ggml.tokens, or metadata array, which causes undefined behavior when being streamed via std::ostringstream — maybe due to invalid UTF-8 or a bug in parsing logic.

gguf-dump output shows:

- kv 45: tokenizer.chat_template str = <|im_start|>{% for message in messages %}...
- kv 38: tokenizer.ggml.tokens arr[str,49280]
- kv 40: tokenizer.ggml.merges arr[str,48900]

These are candidates for malformed input that could cause ostringstream to crash at VTT dispatch.

Reproduction confirmed on:

  • Fresh install of Termux with no NDK environment interference
  • Static vs dynamic libc++ shows no difference
  • Changing LD_LIBRARY_PATH, --no-warmup, or thread count has no effect
  • Custom C++ ostringstream stress tests work fine under the same environment
  • Backtrace isolates crash to the libc++ VTT dispatch, implying corrupt or invalid string metadata usage at runtime

Minimal fix suggestion:

  • Add additional sanitization of strings during GGUF metadata parsing, especially in tokenizer fields
  • Wrap std::ostringstream << calls with bounds/UTF-8 guards
  • Optionally skip/disable metadata printing in warmup when GGUF field count is high

Grok AI below:

GGUF File Details:
SmolVLM-500M-Instruct-Q8_0.gguf:
Architecture: llama

Type: model

Metadata: 49 key-value pairs (e.g., llama.block_count=32, context_length=8192, embedding_length=960)

Tensors: 291 (65 F32, 226 Q8_0), 414.86 MiB, 8.50 BPW

Tokenizer: GPT-2 based (smollm preset), 49,280 tokens, 48,900 merges

Base Models: SmolLM2 360M Instruct (HuggingFaceTB), Siglip Base Patch16 512 (Google)

Datasets: The_Cauldron, Docmatix (HuggingFaceM4)

mmproj-SmolVLM-500M-Instruct-Q8_0.gguf (tested separately, also crashes):
Architecture: clip

Type: clip-vision

Metadata: 40 key-value pairs (e.g., clip.vision.image_size=512, patch_size=16, projector_type=idefics3)

Tensors: 198 (mostly Q8_0, some F32), ~98M parameters

Key Tensor: mm.model.fc.weight (12,288 x 960, Q8_0)

Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf (works without crash):
Architecture: qwen2

Type: model

Metadata: 31 key-value pairs

Tensors: 579 (Q2_K, Q3_K, Q4_K, Q6_K, F32), 5.37 GiB, 3.12 BPW

Additional Context:
The crash occurs only in Termux (Bionic libc), not in a Prooted Debian environment (glibc), suggesting Bionic’s stricter memory alignment or aarch64 optimizations (DOTPROD, FP16_VA, AARCH64_REPACK) expose the issue.

ldd ./llama-server shows dual libc++ linking:

/data/data/com.termux/files/usr/lib/libc++_shared.so
/system/lib64/libc++.so

This may cause ABI conflicts, especially during vision processing (std::ostringstream usage in ggml or metadata logging).

The mmproj file’s idefics3 projector type may have incomplete support in llama.cpp (build 5453), potentially causing memory corruption during CLIP vision initialization.

Testing without the mmproj file (as above) still results in a crash, indicating the issue lies in the main model’s vision-related initialization (likely due to its Siglip base model integration).

Other models (e.g., Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf) work because they are text-only, bypassing the CLIP vision code path.

Suspected Root Cause:
A bug in llama.cpp’s vision initialization (CLIP architecture, idefics3 projector) causes memory corruption during the warm-up phase, likely in ggml tensor operations or metadata logging using std::ostringstream. This is exacerbated by:
Dual libc++ linking (/data/data/com.termux/files/usr/lib/libc++_shared.so and /system/lib64/libc++.so), causing ABI mismatches.

Bionic libc’s strict memory management on Android (aarch64).

Possible incomplete idefics3 projector support or Q8_0 tensor handling in the CLIP vision path.

Suggested Fix:
Investigate llama.cpp’s CLIP vision initialization (llama_new_context_with_model, common_init_from_params) for buffer overflows or invalid pointer operations, especially in ggml tensor allocation or std::ostringstream usage.

Ensure consistent libc++ linking by rebuilding dependencies (libcurl.so, libomp.so) to use only Termux’s libc++_shared.so.

Verify idefics3 projector support and Q8_0 tensor handling on aarch64 with Bionic libc.

Add memory alignment checks for aarch64 optimizations (DOTPROD, FP16_VA).

Workaround Attempted:
Rebuilding llama.cpp with -DCMAKE_CXX_FLAGS="-nostdlib++ -L/data/data/com.termux/files/usr/lib -lc++_shared" to avoid /system/lib64/libc++.so (did not resolve the issue).

Testing without mmproj file (still crashes, as shown above).

Testing other vision models (e.g., MobileVLM-3B-Q4_K_M.gguf) is pending.

Attachments:
Full llama-server log for SmolVLM-500M-Instruct-Q8_0.gguf (#) (see above)

gguf-dump for SmolVLM-500M-Instruct-Q8_0.gguf (#) (summarized above)

gguf-dump for mmproj-SmolVLM-500M-Instruct-Q8_0.gguf (#) (summarized above)

gguf-dump for Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf (#) (summarized above)

GDB backtrace (#) (see above)

  • my comments , the log:

-- The C compiler identification is Clang 20.1.5
-- The CXX compiler identification is Clang 20.1.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /data/data/com.termux/files/usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /data/data/com.termux/files/usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.49.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1")
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+noi8mm+nosve+nosme 
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found CURL: /data/data/com.termux/files/usr/lib/libcurl.so (found version "8.13.0")
-- Configuring done (15.0s)
-- Generating done (0.8s)
-- Build files have been written to: /data/data/com.termux/files/home/downloads/llama.cpp/build
[  3%] Built target ggml-base
[  9%] Built target ggml-cpu
[ 10%] Built target ggml
[ 21%] Built target llama
[ 21%] Built target build_info
[ 27%] Built target common
...
-- Set non-toolchain portion of runtime path of "/data/data/com.termux/files/usr/bin/llama-export-lora" to ""
-- Installing: /data/data/com.termux/files/usr/lib/libllama.so
-- Set non-toolchain portion of runtime path of "/data/data/com.termux/files/usr/lib/libllama.so" to ""
-- Up-to-date: /data/data/com.termux/files/usr/include/llama.h
-- Up-to-date: /data/data/com.termux/files/usr/include/llama-cpp.h
-- Up-to-date: /data/data/com.termux/files/usr/lib/cmake/llama/llama-config.cmake
-- Up-to-date: /data/data/com.termux/files/usr/lib/cmake/llama/llama-version.cmake
-- Up-to-date: /data/data/com.termux/files/usr/bin/convert_hf_to_gguf.py
-- Up-to-date: /data/data/com.termux/files/usr/lib/pkgconfig/llama.pc
~/downloads/llama.cpp $

so

~/downloads/llama.cpp $ which llama-server
/data/data/com.termux/files/usr/bin/llama-server
~/downloads/llama.cpp $ /data/data/com.termux/files/usr/bin/llama-server
WARNING: linker: Warning: unable to normalize "none" (ignoring)
build: 5453 (6b56a646) with clang version 20.1.5 for aarch64-unknown-linux-android24
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

and:

~/downloads/llama.cpp $ llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF --no-warmup
WARNING: linker: Warning: unable to normalize "none" (ignoring)
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_mmproj-SmolVLM-500M-Instruct-Q8_0.gguf
build: 5453 (6b56a646) with clang version 20.1.5 for aarch64-unknown-linux-android24
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 7
main: loading model
srv    load_model: loading model '/data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf'
llama_model_loader: loaded meta data with 49 key-value pairs and 291 tensors from /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolVLM 500M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolVLM
llama_model_loader: - kv   5:                         general.size_label str              = 500M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 2
llama_model_loader: - kv   8:                  general.base_model.0.name str              = SmolLM2 360M Instruct
llama_model_loader: - kv   9:          general.base_model.0.organization str              = HuggingFaceTB
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv  11:                  general.base_model.1.name str              = Siglip Base Patch16 512
llama_model_loader: - kv  12:               general.base_model.1.version str              = 512
llama_model_loader: - kv  13:          general.base_model.1.organization str              = Google
llama_model_loader: - kv  14:              general.base_model.1.repo_url str              = https://huggingface.co/google/siglip-...
llama_model_loader: - kv  15:                      general.dataset.count u32              = 2
llama_model_loader: - kv  16:                     general.dataset.0.name str              = The_Cauldron
llama_model_loader: - kv  17:             general.dataset.0.organization str              = HuggingFaceM4
llama_model_loader: - kv  18:                 general.dataset.0.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  19:                     general.dataset.1.name str              = Docmatix
llama_model_loader: - kv  20:             general.dataset.1.organization str              = HuggingFaceM4
llama_model_loader: - kv  21:                 general.dataset.1.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  22:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  23:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  24:                          llama.block_count u32              = 32
llama_model_loader: - kv  25:                       llama.context_length u32              = 8192
llama_model_loader: - kv  26:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  27:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  28:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  29:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  30:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  31:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  32:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  33:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  34:                           llama.vocab_size u32              = 49280
llama_model_loader: - kv  35:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  36:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  37:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  38:                      tokenizer.ggml.tokens arr[str,49280]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  39:                  tokenizer.ggml.token_type arr[i32,49280]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  40:                      tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  42:                tokenizer.ggml.eos_token_id u32              = 49279
llama_model_loader: - kv  43:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  45:                    tokenizer.chat_template str              = <|im_start|>{% for message in message...
llama_model_loader: - kv  46:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  47:               general.quantization_version u32              = 2
llama_model_loader: - kv  48:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 414.86 MiB (8.50 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 409.25 M
print_info: general.name     = SmolVLM 500M Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49280
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 49279 '<end_of_utterance>'
print_info: EOT token        = 2 '<|im_end|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM REP token    = 4 '<reponame>'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 4 '<reponame>'
print_info: EOG token        = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =   414.86 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.19 MiB
llama_kv_cache_unified:        CPU KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:        CPU compute buffer size =   135.51 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
Segmentation fault
~/downloads/llama.cpp $ 

Operating systems

Other? (Please let us know in description)

GGML backends

BLAS

Hardware

See above

Models

See above

Problem description & steps to reproduce

See above

First Bad Commit

?

Relevant log output

See above

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions