Skip to content

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
jd4ever1 opened this issue Jul 27, 2023 · 8 comments
Closed
4 tasks done

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

jd4ever1 opened this issue Jul 27, 2023 · 8 comments

Comments

@jd4ever1
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am trying to run TheBloke's llama-2-70b-chat.ggmlv3.q2_K.bin on my M1 Macbook Pro. It is expected to run.

Current Behavior

When running ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]" I get the following error:

./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
main: build = 918 (7c529ce)
main: seed  = 1690493628
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 27827.36 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x156308740
ggml_metal_init: loaded kernel_add_row                        0x156308de0
ggml_metal_init: loaded kernel_mul                            0x156309280
ggml_metal_init: loaded kernel_mul_row                        0x156309830
ggml_metal_init: loaded kernel_scale                          0x156309cd0
ggml_metal_init: loaded kernel_silu                           0x15630a170
ggml_metal_init: loaded kernel_relu                           0x15630a610
ggml_metal_init: loaded kernel_gelu                           0x15630aab0
ggml_metal_init: loaded kernel_soft_max                       0x15630b0e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x15630b6c0
ggml_metal_init: loaded kernel_get_rows_f16                   0x15630bcc0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15630c430
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x15630ca30
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15630d030
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x15630d630
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x15630dc30
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x15630e230
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x15630e830
ggml_metal_init: loaded kernel_rms_norm                       0x15630ee70
ggml_metal_init: loaded kernel_norm                           0x15630f610
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x15630fdf0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x156310430
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x156107190
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x156310930
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x156311090
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1563116d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x156311cf0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x156312510
ggml_metal_init: loaded kernel_rope                           0x1563129b0
ggml_metal_init: loaded kernel_alibi_f32                      0x156313340
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x156313b50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x156314360
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x156314a50
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (27449.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   237.00 MB, (27686.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (27990.23 / 49152.00)

system_info: n_threads = 9 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p

The same error happens when I try the other quantized models, such as llama-2-70b-chat.ggmlv3.q4_K_M.bin and llama-2-70b-chat.ggmlv3.q4_K_S.bin

I get similar error when using the server:

./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c 4096
{"timestamp":1690494784,"level":"INFO","function":"main","line":1124,"message":"build info","build":918,"commit":"7c529ce"}
{"timestamp":1690494784,"level":"INFO","function":"main","line":1129,"message":"system info","n_threads":9,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 28339.36 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x152805ac0
ggml_metal_init: loaded kernel_add_row                        0x1356044b0
ggml_metal_init: loaded kernel_mul                            0x135604aa0
ggml_metal_init: loaded kernel_mul_row                        0x135605050
ggml_metal_init: loaded kernel_scale                          0x1356054f0
ggml_metal_init: loaded kernel_silu                           0x135605990
ggml_metal_init: loaded kernel_relu                           0x135605e30
ggml_metal_init: loaded kernel_gelu                           0x1356062d0
ggml_metal_init: loaded kernel_soft_max                       0x135606900
ggml_metal_init: loaded kernel_diag_mask_inf                  0x135606ee0
ggml_metal_init: loaded kernel_get_rows_f16                   0x1356074e0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x135607c50
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x135608250
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x135608850
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x135608e50
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x135609450
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x135609a50
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13560a050
ggml_metal_init: loaded kernel_rms_norm                       0x13560a690
ggml_metal_init: loaded kernel_norm                           0x13560ae30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13560b610
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13560bc50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13560c290
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x152a09450
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x152a09a90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x152a0a0d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x152a0a6f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x152a0af10
ggml_metal_init: loaded kernel_rope                           0x152a0b3b0
ggml_metal_init: loaded kernel_alibi_f32                      0x152a0bd40
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x152a0c550
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x152a0cd60
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x152806040
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (28569.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   749.00 MB, (29318.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (29622.23 / 49152.00)

llama server listening at http://127.0.0.1:8080

{"timestamp":1690494785,"level":"INFO","function":"main","line":1344,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1690494790,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60294,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
{"timestamp":1690494792,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c

The 70b models only work when not using metal by omitting -ngl 1.

The error only happens with the 70b models. The smaller 13b llama 2 chat models work as expected.

Environment and Context

Running on my M1 Macbook Pro.

Model Name: MacBook Pro
Model Identifier: MacBookPro18,2
Model Number: MK233LL/A
Chip: Apple M1 Max
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 64 GB
System Firmware Version: 8419.60.44
OS Loader Version: 8419.60.44
Serial Number (system):
Hardware UUID:
Provisioning UDID:
Activation Lock Status: Enabled

llama.cpp built with LLAMA_METAL=1 make

uname -a
Darwin Johns-MacBook-Pro-2.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

python3 --version
Python 3.9.15

make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0



g++ --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

git log | head -1
commit 7c529cede6e84054e77a3eceab31c53de7b2f55b
@izard
Copy link

izard commented Jul 27, 2023

Getting same error with Metal, CPU inference works.

@klosax
Copy link
Contributor

klosax commented Jul 27, 2023

There is no GQA support in Metal yet. See #2276 (comment)

@appleguy
Copy link

Thank you for the reply, klosax !

Naturally (given that issue) I also experience it. If there is any need to test later patches on an a 192GB system (Mac Pro/M2 Ultra), let me know and I'm happy to try it out.

@jd4ever1
Copy link
Author

There is no GQA support in Metal yet. See #2276 (comment)

@klosax Thanks for the info! Saved me hours of trying and failing

@mbosc
Copy link
Contributor

mbosc commented Jul 30, 2023

Hi all, I think I have worked out a very simple and inelegant fix that got llama-2-70b working on my M2 Macbook Pro with metal. I've only tried on a q5 model, I'm downloading another pair of quants, but if you guys want to try it out, you can check out my fork https://github.com/mbosc/llama.cpp. If it seems to work consistently, I'll open a PR!

@izard
Copy link

izard commented Jul 30, 2023

Perfect, this fix brings llama-2-70b at Q4_0 up from 2.5 tokens per second to 5 tokens per second with significantly lower power utilization on my MBP. Thank you so much!

@jd4ever1
Copy link
Author

jd4ever1 commented Aug 6, 2023

@mbosc Thanks for your hard work! It works now!

@jd4ever1 jd4ever1 closed this as completed Aug 6, 2023
@fizahkhalidQuids
Copy link

Getting same error with Metal, CPU inference works.

Can you tell me the specification requirements for CPU inference? And how does LLAMA 70B Chat performs as compared to GPT-4 , 3.5 if you have tested it a bit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants