-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Closed
Labels
Description
Name and Version
$ ./build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M1 Pro (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
version: 4489 (f11cfdfd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
Operating systems
Mac
GGML backends
Vulkan
Hardware
Apple M1 Pro, 32 GB RAM
Models
Meta Llama 3.2 Instruct 1B Q4_K_M
Problem description & steps to reproduce
In a fresh git clone:
$ cmake -B build -DGGML_VULKAN=ON -DGGML_METAL=OFF -DCMAKE_BUILD_TYPE=Release -G Ninja
$ cmake --build build --config Release -j 8
$ ./build/bin/llama-cli -m ~/llamas/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "The capital of France is " --device Vulkan0 -ngl 17 -no-cnv --version
Result: prompt is echoed, but then generation is obvious nonsense tokens.
If I omit --device Vulkan0 -ngl 17
, I get reasonable output, but I see
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/17 layers to GPU
in the logs, suggesting that the GPU is not used. Omitting -ngl 17
and keeping --device Vulkan0
has the same behavior as omitting both -ngl 17
and --device Vulkan0
.
First Bad Commit
EDIT: bisect surprisingly finished; seems to bisect to d79d8f3 (#10846).
45095a6 is bad
e9e661b is good
there are a lot of revs with broken builds in that range. I wrote a simple shell loop to auto-skip them, but it's skipping a lot of revs that mention changing Vulkan, so I'm giving up on bisection being helpful.
Relevant log output
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type q4_K: 96 tensors
llama_model_loader: - type q6_K: 17 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 762.81 MiB (5.18 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2048
print_info: n_layer = 16
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1B
print_info: model params = 1.24 B
print_info: general.name = Llama 3.2 1B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 128 'Ä'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
ggml_vulkan: Compiling shaders.....................................Done!
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors: CPU_Mapped model buffer size = 205.49 MiB
load_tensors: Vulkan0 model buffer size = 762.81 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 500000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 128.00 MiB
llama_init_from_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.49 MiB
llama_init_from_model: Vulkan0 compute buffer size = 280.00 MiB
llama_init_from_model: Vulkan_Host compute buffer size = 12.01 MiB
llama_init_from_model: graph nodes = 518
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 10 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
sampler seed: 1881698075
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
The capital of France is ansomightsightsightsightsightsightsightsightsightsightsunningunningunning draft fork fork Fork Fork Fenlockspspsightsunningunningunning fairly Fairy Fairy Fairy Fairy draftunning fork fork cer cer madness fairly fairly Fork Fairy Fairyfork Up Sent Sentunning fairly terms Sent Faith Fairy Fork fork Fork Bra Fairy fairlyunningunningunningunningunningunningights fairly Mad fork Forkunning draft fork Indian Indianightsightsightsunningunningunningunningunningunning sent Up Sentightsights Fork fork fairly Bra mise Upightsunningunning Faithunningunningunning Fairy sent fork sentunningunningightsightsightsunning Ambunningunningunningunning fairly fairly fairly fairly Indian madness204 up factunningunningunningunningunningunningunningunningunning Amb Forkambunning Fairy Fairy Fairy reached fairly Indian terms termsunningunningunning Fairy Fairy fork Bra Bra Bal forkunning Fork Amb204 draft Bor Fairy fairlyightsunningunning