Closed
Description
Name and Version
Affects all llama builds since e0dbec0, tested up to
version: 4941 (ba932df)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu
bug not present in
version: 4879 (f08f4b3)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
# Can be replicated with any model, here using Llama-3.3
# (-b/-c to reduce memory usages, but not relevant to the bug - can use model ctx size)
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean
Problem description & steps to reproduce
Fails in llm_graph_context::build_pooling
with:
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
Reproduce with any model using llama-embedding --pooling mean
, for example:
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf \
-ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean
The error is due to mismatch between inp
and inp_mean
tensors in llama-graph.cpp@:1626.
Run with additional output printing nelements
and nrows
of inp
and inp_mean
:
llama_context: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 16777216, nrow = 2048
imp_mean nel = 1, nrow = 1
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
run before with llama 4879 (f08f4b3), i.e., before e0dbec0 (#12181):
llama_init_from_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.00 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 8192, nrow = 1
imp_mean nel = 1, nrow = 1
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
llama_init_from_model: CUDA0 compute buffer size = 1600.03 MiB
llama_init_from_model: CUDA1 compute buffer size = 1664.06 MiB
llama_init_from_model: CUDA_Host compute buffer size = 192.09 MiB
llama_init_from_model: graph nodes = 2569
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
inp nel = 16384, nrow = 2
imp_mean nel = 4, nrow = 2
[...]
batch_decode: n_tokens = 3, n_seq = 1
inp nel = 24576, nrow = 3
imp_mean nel = 9, nrow = 3