Llama 2 70b Chat not working on M1 Macs when using Metal #2429

jd4ever1 · 2023-07-27T22:01:19Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am trying to run TheBloke's llama-2-70b-chat.ggmlv3.q2_K.bin on my M1 Macbook Pro. It is expected to run.

Current Behavior

When running ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]" I get the following error:

./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
main: build = 918 (7c529ce)
main: seed  = 1690493628
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 27827.36 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x156308740
ggml_metal_init: loaded kernel_add_row                        0x156308de0
ggml_metal_init: loaded kernel_mul                            0x156309280
ggml_metal_init: loaded kernel_mul_row                        0x156309830
ggml_metal_init: loaded kernel_scale                          0x156309cd0
ggml_metal_init: loaded kernel_silu                           0x15630a170
ggml_metal_init: loaded kernel_relu                           0x15630a610
ggml_metal_init: loaded kernel_gelu                           0x15630aab0
ggml_metal_init: loaded kernel_soft_max                       0x15630b0e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x15630b6c0
ggml_metal_init: loaded kernel_get_rows_f16                   0x15630bcc0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15630c430
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x15630ca30
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15630d030
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x15630d630
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x15630dc30
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x15630e230
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x15630e830
ggml_metal_init: loaded kernel_rms_norm                       0x15630ee70
ggml_metal_init: loaded kernel_norm                           0x15630f610
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x15630fdf0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x156310430
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x156107190
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x156310930
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x156311090
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1563116d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x156311cf0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x156312510
ggml_metal_init: loaded kernel_rope                           0x1563129b0
ggml_metal_init: loaded kernel_alibi_f32                      0x156313340
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x156313b50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x156314360
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x156314a50
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (27449.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   237.00 MB, (27686.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (27990.23 / 49152.00)

system_info: n_threads = 9 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p

The same error happens when I try the other quantized models, such as llama-2-70b-chat.ggmlv3.q4_K_M.bin and llama-2-70b-chat.ggmlv3.q4_K_S.bin

I get similar error when using the server:

./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c 4096
{"timestamp":1690494784,"level":"INFO","function":"main","line":1124,"message":"build info","build":918,"commit":"7c529ce"}
{"timestamp":1690494784,"level":"INFO","function":"main","line":1129,"message":"system info","n_threads":9,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 28339.36 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x152805ac0
ggml_metal_init: loaded kernel_add_row                        0x1356044b0
ggml_metal_init: loaded kernel_mul                            0x135604aa0
ggml_metal_init: loaded kernel_mul_row                        0x135605050
ggml_metal_init: loaded kernel_scale                          0x1356054f0
ggml_metal_init: loaded kernel_silu                           0x135605990
ggml_metal_init: loaded kernel_relu                           0x135605e30
ggml_metal_init: loaded kernel_gelu                           0x1356062d0
ggml_metal_init: loaded kernel_soft_max                       0x135606900
ggml_metal_init: loaded kernel_diag_mask_inf                  0x135606ee0
ggml_metal_init: loaded kernel_get_rows_f16                   0x1356074e0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x135607c50
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x135608250
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x135608850
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x135608e50
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x135609450
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x135609a50
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13560a050
ggml_metal_init: loaded kernel_rms_norm                       0x13560a690
ggml_metal_init: loaded kernel_norm                           0x13560ae30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13560b610
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13560bc50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13560c290
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x152a09450
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x152a09a90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x152a0a0d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x152a0a6f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x152a0af10
ggml_metal_init: loaded kernel_rope                           0x152a0b3b0
ggml_metal_init: loaded kernel_alibi_f32                      0x152a0bd40
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x152a0c550
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x152a0cd60
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x152806040
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (28569.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   749.00 MB, (29318.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (29622.23 / 49152.00)

llama server listening at http://127.0.0.1:8080

{"timestamp":1690494785,"level":"INFO","function":"main","line":1344,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1690494790,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60294,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
{"timestamp":1690494792,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c

The 70b models only work when not using metal by omitting -ngl 1.

The error only happens with the 70b models. The smaller 13b llama 2 chat models work as expected.

Environment and Context

Running on my M1 Macbook Pro.

Model Name: MacBook Pro
Model Identifier: MacBookPro18,2
Model Number: MK233LL/A
Chip: Apple M1 Max
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 64 GB
System Firmware Version: 8419.60.44
OS Loader Version: 8419.60.44
Serial Number (system):
Hardware UUID:
Provisioning UDID:
Activation Lock Status: Enabled

llama.cpp built with LLAMA_METAL=1 make

uname -a
Darwin Johns-MacBook-Pro-2.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

python3 --version
Python 3.9.15

make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0



g++ --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

git log | head -1
commit 7c529cede6e84054e77a3eceab31c53de7b2f55b

The text was updated successfully, but these errors were encountered:

izard · 2023-07-27T22:17:10Z

Getting same error with Metal, CPU inference works.

klosax · 2023-07-27T23:22:29Z

There is no GQA support in Metal yet. See #2276 (comment)

appleguy · 2023-07-28T05:48:23Z

Thank you for the reply, klosax !

Naturally (given that issue) I also experience it. If there is any need to test later patches on an a 192GB system (Mac Pro/M2 Ultra), let me know and I'm happy to try it out.

jd4ever1 · 2023-07-29T22:20:37Z

There is no GQA support in Metal yet. See #2276 (comment)

@klosax Thanks for the info! Saved me hours of trying and failing

mbosc · 2023-07-30T15:05:46Z

Hi all, I think I have worked out a very simple and inelegant fix that got llama-2-70b working on my M2 Macbook Pro with metal. I've only tried on a q5 model, I'm downloading another pair of quants, but if you guys want to try it out, you can check out my fork https://github.com/mbosc/llama.cpp. If it seems to work consistently, I'll open a PR!

izard · 2023-07-30T16:45:55Z

Perfect, this fix brings llama-2-70b at Q4_0 up from 2.5 tokens per second to 5 tokens per second with significantly lower power utilization on my MBP. Thank you so much!

jd4ever1 · 2023-08-06T00:07:27Z

@mbosc Thanks for your hard work! It works now!

fizahkhalidQuids · 2023-08-07T05:33:10Z

Getting same error with Metal, CPU inference works.

Can you tell me the specification requirements for CPU inference? And how does LLAMA 70B Chat performs as compared to GPT-4 , 3.5 if you have tested it a bit?

mbosc mentioned this issue Jul 30, 2023

Updated mul_mat_f16_f32 metal kernel to allow llama-2-70B on metal #2459

Merged

jd4ever1 closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

jd4ever1 commented Jul 27, 2023

izard commented Jul 27, 2023

klosax commented Jul 27, 2023

appleguy commented Jul 28, 2023

jd4ever1 commented Jul 29, 2023

mbosc commented Jul 30, 2023

izard commented Jul 30, 2023

jd4ever1 commented Aug 6, 2023

fizahkhalidQuids commented Aug 7, 2023

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

Comments

jd4ever1 commented Jul 27, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

izard commented Jul 27, 2023

klosax commented Jul 27, 2023

appleguy commented Jul 28, 2023

jd4ever1 commented Jul 29, 2023

mbosc commented Jul 30, 2023

izard commented Jul 30, 2023

jd4ever1 commented Aug 6, 2023

fizahkhalidQuids commented Aug 7, 2023