Skip to content

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Closed
@s-u

Description

@s-u

Name and Version

Affects all llama builds since e0dbec0, tested up to

version: 4941 (ba932df)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

bug not present in

version: 4879 (f08f4b3)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

# Can be replicated with any model, here using Llama-3.3
# (-b/-c to reduce memory usages, but not relevant to the bug - can use model ctx size)
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

Problem description & steps to reproduce

Fails in llm_graph_context::build_pooling with:
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

Reproduce with any model using llama-embedding --pooling mean, for example:

llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf \
   -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

The error is due to mismatch between inp and inp_mean tensors in llama-graph.cpp@:1626.

Run with additional output printing nelements and nrows of inp and inp_mean:

llama_context: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 16777216, nrow = 2048
imp_mean nel = 1, nrow = 1
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

run before with llama 4879 (f08f4b3), i.e., before e0dbec0 (#12181):

llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.00 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 8192, nrow = 1
imp_mean nel = 1, nrow = 1
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
llama_init_from_model:      CUDA0 compute buffer size =  1600.03 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1664.06 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   192.09 MiB
llama_init_from_model: graph nodes  = 2569
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
inp nel = 16384, nrow = 2
imp_mean nel = 4, nrow = 2
[...]
batch_decode: n_tokens = 3, n_seq = 1
inp nel = 24576, nrow = 3
imp_mean nel = 9, nrow = 3

First Bad Commit

e0dbec0

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions