Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean

### Name and Version

Affects all llama builds since e0dbec0, tested up to 

version: 4941 (ba932dfb)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

bug not present in

version: 4879 (f08f4b31)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

libllama (core library)

### Command line

```shell
# Can be replicated with any model, here using Llama-3.3
# (-b/-c to reduce memory usages, but not relevant to the bug - can use model ctx size)
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean
```

### Problem description & steps to reproduce

Fails in `llm_graph_context::build_pooling` with:
   llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

Reproduce with any model using `llama-embedding --pooling mean`, for example:

    llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf \
       -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

The error is due to mismatch between `inp` and `inp_mean` tensors in [llama-graph.cpp@:1626.](https://github.com/ggml-org/llama.cpp/blob/ba932dfb50cc694645b1a148c72f8c06ee080b17/src/llama-graph.cpp#L1622)

Run with additional output printing `nelements` and `nrows` of `inp` and `inp_mean`:

```
llama_context: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 16777216, nrow = 2048
imp_mean nel = 1, nrow = 1
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
```

run before with llama 4879 (f08f4b31), i.e., before e0dbec0 (#12181):

```
llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.00 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 8192, nrow = 1
imp_mean nel = 1, nrow = 1
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
llama_init_from_model:      CUDA0 compute buffer size =  1600.03 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1664.06 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   192.09 MiB
llama_init_from_model: graph nodes  = 2569
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
inp nel = 16384, nrow = 2
imp_mean nel = 4, nrow = 2
[...]
batch_decode: n_tokens = 3, n_seq = 1
inp nel = 24576, nrow = 3
imp_mean nel = 9, nrow = 3
```


### First Bad Commit

e0dbec0bc6cd4b6230cda7a6ed1e9dac08d1600b

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions