-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Llama 2 70b Chat not working on M1 Macs when using Metal #2429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Getting same error with Metal, CPU inference works. |
There is no GQA support in Metal yet. See #2276 (comment) |
Thank you for the reply, klosax ! Naturally (given that issue) I also experience it. If there is any need to test later patches on an a 192GB system (Mac Pro/M2 Ultra), let me know and I'm happy to try it out. |
@klosax Thanks for the info! Saved me hours of trying and failing |
Hi all, I think I have worked out a very simple and inelegant fix that got llama-2-70b working on my M2 Macbook Pro with metal. I've only tried on a q5 model, I'm downloading another pair of quants, but if you guys want to try it out, you can check out my fork https://github.com/mbosc/llama.cpp. If it seems to work consistently, I'll open a PR! |
Perfect, this fix brings llama-2-70b at Q4_0 up from 2.5 tokens per second to 5 tokens per second with significantly lower power utilization on my MBP. Thank you so much! |
@mbosc Thanks for your hard work! It works now! |
Can you tell me the specification requirements for CPU inference? And how does LLAMA 70B Chat performs as compared to GPT-4 , 3.5 if you have tested it a bit? |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am trying to run TheBloke's llama-2-70b-chat.ggmlv3.q2_K.bin on my M1 Macbook Pro. It is expected to run.
Current Behavior
When running
./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
I get the following error:The same error happens when I try the other quantized models, such as llama-2-70b-chat.ggmlv3.q4_K_M.bin and llama-2-70b-chat.ggmlv3.q4_K_S.bin
I get similar error when using the server:
The 70b models only work when not using metal by omitting
-ngl 1
.The error only happens with the 70b models. The smaller 13b llama 2 chat models work as expected.
Environment and Context
Running on my M1 Macbook Pro.
Model Name: MacBook Pro
Model Identifier: MacBookPro18,2
Model Number: MK233LL/A
Chip: Apple M1 Max
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 64 GB
System Firmware Version: 8419.60.44
OS Loader Version: 8419.60.44
Serial Number (system):
Hardware UUID:
Provisioning UDID:
Activation Lock Status: Enabled
llama.cpp built with
LLAMA_METAL=1 make
The text was updated successfully, but these errors were encountered: