Eval bug: Garbage output from Llama-3.2-1B-Instruct-Q4_K_M using GGML_VULKAN on M1 Mac

### Name and Version

```
$  ./build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M1 Pro (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
version: 4489 (f11cfdfd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
``` 

### Operating systems

Mac

### GGML backends

Vulkan

### Hardware

Apple M1 Pro, 32 GB RAM

### Models

Meta Llama 3.2 Instruct 1B Q4_K_M

### Problem description & steps to reproduce

In a fresh git clone:

```
$ cmake -B build -DGGML_VULKAN=ON -DGGML_METAL=OFF -DCMAKE_BUILD_TYPE=Release -G Ninja
$ cmake --build build --config Release -j 8
$  ./build/bin/llama-cli -m ~/llamas/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "The capital of France is " --device Vulkan0 -ngl 17  -no-cnv --version
```

Result: prompt is echoed, but then generation is obvious nonsense tokens.

If I omit `--device Vulkan0 -ngl 17`, I get reasonable output, but I see
```
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/17 layers to GPU
```
in the logs, suggesting that the GPU is not used. Omitting `-ngl 17` and keeping `--device Vulkan0` has the same behavior as omitting both `-ngl 17` and `--device Vulkan0`.

### First Bad Commit

EDIT: bisect surprisingly finished; seems to bisect to d79d8f39b4da6deca4aea8bf130c6034c482b320 (#10846).

45095a61bfd164e87563a0dc0fbd7b0e9891590b is bad
e9e661bd59364e5d4fce035834b6cadcadf8c2ef is good

there are a lot of revs with broken builds in that range. I wrote a simple shell loop to auto-skip them, but it's skipping a lot of revs that mention changing Vulkan, so I'm giving up on bisection being helpful.

### Relevant log output

```shell
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q4_K:   96 tensors
llama_model_loader: - type q6_K:   17 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 762.81 MiB (5.18 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2048
print_info: n_layer          = 16
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 1.24 B
print_info: general.name     = Llama 3.2 1B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
ggml_vulkan: Compiling shaders.....................................Done!
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   205.49 MiB
load_tensors:      Vulkan0 model buffer size =   762.81 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   128.00 MiB
llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.49 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   280.00 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 518
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 1881698075
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

The capital of France is   ansomightsightsightsightsightsightsightsightsightsightsunningunningunning draft fork fork Fork Fork Fenlockspspsightsunningunningunning fairly Fairy Fairy Fairy Fairy draftunning fork fork cer cer madness fairly fairly Fork Fairy Fairyfork Up Sent Sentunning fairly terms Sent Faith Fairy Fork fork Fork Bra Fairy fairlyunningunningunningunningunningunningights fairly Mad fork Forkunning draft fork Indian Indianightsightsightsunningunningunningunningunningunning sent Up Sentightsights Fork fork fairly Bra mise Upightsunningunning Faithunningunningunning Fairy sent fork sentunningunningightsightsightsunning Ambunningunningunningunning fairly fairly fairly fairly Indian madness204 up factunningunningunningunningunningunningunningunningunning Amb Forkambunning Fairy Fairy Fairy reached fairly Indian terms termsunningunningunning Fairy Fairy fork Bra Bra Bal forkunning Fork Amb204 draft Bor Fairy fairlyightsunningunning
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Garbage output from Llama-3.2-1B-Instruct-Q4_K_M using GGML_VULKAN on M1 Mac #11256

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Garbage output from Llama-3.2-1B-Instruct-Q4_K_M using GGML_VULKAN on M1 Mac #11256

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions