Description
Name and Version
When toying with : llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
I have discovered the below.
ChatGPT wrote most of it:
Segmentation fault when loading SmolVLM-500M-Instruct-Q8_0.gguf on Termux / Android ARM64
Environment:
Device: MediaTek MT6785V/CD (8-core ARMv8)
OS: Android 11 aarch64 (via Termux)
Kernel: 4.14.186+
Shell: bash 5.2.37 (via Termux)
Compiler: clang 20.1.5 (termux build)
llama.cpp: commit 6b56a646 (build 5453)
Command:
llama-server -m /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf
Result:
...
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x0000007fbded0d38 in VTT for std::__ndk1::basic_ostringstream<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > () from /data/data/com.termux/files/usr/lib/libc++_shared.so
Behavior:
- Works perfectly with other
.gguf
files (e.g., TinyLlama, Mistral, Phi2, etc.) - Only crashes on this model (
SmolVLM-500M-Instruct-Q8_0.gguf
) - Reproducible across
llama-cli
andllama-server
, even with--no-warmup
- Works on same hardware when using other quantized GGUF files
- Crashes during warmup, likely during debug/diagnostic logging that internally uses
std::ostringstream
- Same binary works for other models, suggesting the issue is data-triggered, not toolchain-related
Hypothesis:
- Likely malformed or corrupted string in
tokenizer.chat_template
,tokenizer.ggml.tokens
, or metadata array, which causes undefined behavior when being streamed viastd::ostringstream
— maybe due to invalid UTF-8 or a bug in parsing logic.
gguf-dump output shows:
- kv 45: tokenizer.chat_template str = <|im_start|>{% for message in messages %}...
- kv 38: tokenizer.ggml.tokens arr[str,49280]
- kv 40: tokenizer.ggml.merges arr[str,48900]
These are candidates for malformed input that could cause ostringstream
to crash at VTT dispatch.
Reproduction confirmed on:
- Fresh install of Termux with no NDK environment interference
- Static vs dynamic libc++ shows no difference
- Changing
LD_LIBRARY_PATH
,--no-warmup
, or thread count has no effect - Custom C++
ostringstream
stress tests work fine under the same environment - Backtrace isolates crash to the libc++ VTT dispatch, implying corrupt or invalid string metadata usage at runtime
Minimal fix suggestion:
- Add additional sanitization of strings during GGUF metadata parsing, especially in tokenizer fields
- Wrap
std::ostringstream <<
calls with bounds/UTF-8 guards - Optionally skip/disable metadata printing in warmup when
GGUF
field count is high
Grok AI below:
GGUF File Details:
SmolVLM-500M-Instruct-Q8_0.gguf:
Architecture: llama
Type: model
Metadata: 49 key-value pairs (e.g., llama.block_count=32, context_length=8192, embedding_length=960)
Tensors: 291 (65 F32, 226 Q8_0), 414.86 MiB, 8.50 BPW
Tokenizer: GPT-2 based (smollm preset), 49,280 tokens, 48,900 merges
Base Models: SmolLM2 360M Instruct (HuggingFaceTB), Siglip Base Patch16 512 (Google)
Datasets: The_Cauldron, Docmatix (HuggingFaceM4)
mmproj-SmolVLM-500M-Instruct-Q8_0.gguf (tested separately, also crashes):
Architecture: clip
Type: clip-vision
Metadata: 40 key-value pairs (e.g., clip.vision.image_size=512, patch_size=16, projector_type=idefics3)
Tensors: 198 (mostly Q8_0, some F32), ~98M parameters
Key Tensor: mm.model.fc.weight (12,288 x 960, Q8_0)
Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf (works without crash):
Architecture: qwen2
Type: model
Metadata: 31 key-value pairs
Tensors: 579 (Q2_K, Q3_K, Q4_K, Q6_K, F32), 5.37 GiB, 3.12 BPW
Additional Context:
The crash occurs only in Termux (Bionic libc), not in a Prooted Debian environment (glibc), suggesting Bionic’s stricter memory alignment or aarch64 optimizations (DOTPROD, FP16_VA, AARCH64_REPACK) expose the issue.
ldd ./llama-server shows dual libc++ linking:
/data/data/com.termux/files/usr/lib/libc++_shared.so
/system/lib64/libc++.so
This may cause ABI conflicts, especially during vision processing (std::ostringstream usage in ggml or metadata logging).
The mmproj file’s idefics3 projector type may have incomplete support in llama.cpp (build 5453), potentially causing memory corruption during CLIP vision initialization.
Testing without the mmproj file (as above) still results in a crash, indicating the issue lies in the main model’s vision-related initialization (likely due to its Siglip base model integration).
Other models (e.g., Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf) work because they are text-only, bypassing the CLIP vision code path.
Suspected Root Cause:
A bug in llama.cpp’s vision initialization (CLIP architecture, idefics3 projector) causes memory corruption during the warm-up phase, likely in ggml tensor operations or metadata logging using std::ostringstream. This is exacerbated by:
Dual libc++ linking (/data/data/com.termux/files/usr/lib/libc++_shared.so and /system/lib64/libc++.so), causing ABI mismatches.
Bionic libc’s strict memory management on Android (aarch64).
Possible incomplete idefics3 projector support or Q8_0 tensor handling in the CLIP vision path.
Suggested Fix:
Investigate llama.cpp’s CLIP vision initialization (llama_new_context_with_model, common_init_from_params) for buffer overflows or invalid pointer operations, especially in ggml tensor allocation or std::ostringstream usage.
Ensure consistent libc++ linking by rebuilding dependencies (libcurl.so, libomp.so) to use only Termux’s libc++_shared.so.
Verify idefics3 projector support and Q8_0 tensor handling on aarch64 with Bionic libc.
Add memory alignment checks for aarch64 optimizations (DOTPROD, FP16_VA).
Workaround Attempted:
Rebuilding llama.cpp with -DCMAKE_CXX_FLAGS="-nostdlib++ -L/data/data/com.termux/files/usr/lib -lc++_shared" to avoid /system/lib64/libc++.so (did not resolve the issue).
Testing without mmproj file (still crashes, as shown above).
Testing other vision models (e.g., MobileVLM-3B-Q4_K_M.gguf) is pending.
Attachments:
Full llama-server
log for SmolVLM-500M-Instruct-Q8_0.gguf
(#) (see above)
gguf-dump
for SmolVLM-500M-Instruct-Q8_0.gguf
(#) (summarized above)
gguf-dump
for mmproj-SmolVLM-500M-Instruct-Q8_0.gguf
(#) (summarized above)
gguf-dump
for Absolute_Zero_Reasoner-Coder-14b.Q2_K.gguf
(#) (summarized above)
GDB backtrace (#) (see above)
- my comments , the log:
-- The C compiler identification is Clang 20.1.5
-- The CXX compiler identification is Clang 20.1.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /data/data/com.termux/files/usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /data/data/com.termux/files/usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.49.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1")
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+noi8mm+nosve+nosme
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found CURL: /data/data/com.termux/files/usr/lib/libcurl.so (found version "8.13.0")
-- Configuring done (15.0s)
-- Generating done (0.8s)
-- Build files have been written to: /data/data/com.termux/files/home/downloads/llama.cpp/build
[ 3%] Built target ggml-base
[ 9%] Built target ggml-cpu
[ 10%] Built target ggml
[ 21%] Built target llama
[ 21%] Built target build_info
[ 27%] Built target common
...
-- Set non-toolchain portion of runtime path of "/data/data/com.termux/files/usr/bin/llama-export-lora" to ""
-- Installing: /data/data/com.termux/files/usr/lib/libllama.so
-- Set non-toolchain portion of runtime path of "/data/data/com.termux/files/usr/lib/libllama.so" to ""
-- Up-to-date: /data/data/com.termux/files/usr/include/llama.h
-- Up-to-date: /data/data/com.termux/files/usr/include/llama-cpp.h
-- Up-to-date: /data/data/com.termux/files/usr/lib/cmake/llama/llama-config.cmake
-- Up-to-date: /data/data/com.termux/files/usr/lib/cmake/llama/llama-version.cmake
-- Up-to-date: /data/data/com.termux/files/usr/bin/convert_hf_to_gguf.py
-- Up-to-date: /data/data/com.termux/files/usr/lib/pkgconfig/llama.pc
~/downloads/llama.cpp $
so
~/downloads/llama.cpp $ which llama-server
/data/data/com.termux/files/usr/bin/llama-server
~/downloads/llama.cpp $ /data/data/com.termux/files/usr/bin/llama-server
WARNING: linker: Warning: unable to normalize "none" (ignoring)
build: 5453 (6b56a646) with clang version 20.1.5 for aarch64-unknown-linux-android24
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8
and:
~/downloads/llama.cpp $ llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF --no-warmup
WARNING: linker: Warning: unable to normalize "none" (ignoring)
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_mmproj-SmolVLM-500M-Instruct-Q8_0.gguf
build: 5453 (6b56a646) with clang version 20.1.5 for aarch64-unknown-linux-android24
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 7
main: loading model
srv load_model: loading model '/data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf'
llama_model_loader: loaded meta data with 49 key-value pairs and 291 tensors from /data/data/com.termux/files/home/.cache/llama.cpp/ggml-org_SmolVLM-500M-Instruct-GGUF_SmolVLM-500M-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = SmolVLM 500M Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = SmolVLM
llama_model_loader: - kv 5: general.size_label str = 500M
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 2
llama_model_loader: - kv 8: general.base_model.0.name str = SmolLM2 360M Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv 11: general.base_model.1.name str = Siglip Base Patch16 512
llama_model_loader: - kv 12: general.base_model.1.version str = 512
llama_model_loader: - kv 13: general.base_model.1.organization str = Google
llama_model_loader: - kv 14: general.base_model.1.repo_url str = https://huggingface.co/google/siglip-...
llama_model_loader: - kv 15: general.dataset.count u32 = 2
llama_model_loader: - kv 16: general.dataset.0.name str = The_Cauldron
llama_model_loader: - kv 17: general.dataset.0.organization str = HuggingFaceM4
llama_model_loader: - kv 18: general.dataset.0.repo_url str = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv 19: general.dataset.1.name str = Docmatix
llama_model_loader: - kv 20: general.dataset.1.organization str = HuggingFaceM4
llama_model_loader: - kv 21: general.dataset.1.repo_url str = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv 22: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 23: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 24: llama.block_count u32 = 32
llama_model_loader: - kv 25: llama.context_length u32 = 8192
llama_model_loader: - kv 26: llama.embedding_length u32 = 960
llama_model_loader: - kv 27: llama.feed_forward_length u32 = 2560
llama_model_loader: - kv 28: llama.attention.head_count u32 = 15
llama_model_loader: - kv 29: llama.attention.head_count_kv u32 = 5
llama_model_loader: - kv 30: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 31: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 32: llama.attention.key_length u32 = 64
llama_model_loader: - kv 33: llama.attention.value_length u32 = 64
llama_model_loader: - kv 34: llama.vocab_size u32 = 49280
llama_model_loader: - kv 35: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 36: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 37: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 38: tokenizer.ggml.tokens arr[str,49280] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 39: tokenizer.ggml.token_type arr[i32,49280] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 40: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 41: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 42: tokenizer.ggml.eos_token_id u32 = 49279
llama_model_loader: - kv 43: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 45: tokenizer.chat_template str = <|im_start|>{% for message in message...
llama_model_loader: - kv 46: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 47: general.quantization_version u32 = 2
llama_model_loader: - kv 48: general.file_type u32 = 7
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 414.86 MiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 960
print_info: n_layer = 32
print_info: n_head = 15
print_info: n_head_kv = 5
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 320
print_info: n_embd_v_gqa = 320
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2560
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 409.25 M
print_info: general.name = SmolVLM 500M Instruct
print_info: vocab type = BPE
print_info: n_vocab = 49280
print_info: n_merges = 48900
print_info: BOS token = 1 '<|im_start|>'
print_info: EOS token = 49279 '<end_of_utterance>'
print_info: EOT token = 2 '<|im_end|>'
print_info: UNK token = 0 '<|endoftext|>'
print_info: PAD token = 2 '<|im_end|>'
print_info: LF token = 198 'Ċ'
print_info: FIM REP token = 4 '<reponame>'
print_info: EOG token = 0 '<|endoftext|>'
print_info: EOG token = 2 '<|im_end|>'
print_info: EOG token = 4 '<reponame>'
print_info: EOG token = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_Mapped model buffer size = 414.86 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 100000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.19 MiB
llama_kv_cache_unified: CPU KV buffer size = 160.00 MiB
llama_kv_cache_unified: size = 160.00 MiB ( 4096 cells, 32 layers, 1 seqs), K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_context: CPU compute buffer size = 135.51 MiB
llama_context: graph nodes = 1158
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
Segmentation fault
~/downloads/llama.cpp $
Operating systems
Other? (Please let us know in description)
GGML backends
BLAS
Hardware
See above
Models
See above
Problem description & steps to reproduce
See above
First Bad Commit
?
Relevant log output
See above