-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Feature request?]: Running larger models without quantization. #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Everything works when quantized. |
Check whether you have enough memory:
You can add swap on Linux, but even using NVME it will be very slow due to the random access patterns of the code. |
Even my primary memory + swap is not enough. |
As per the comments from @MarkSchmidty in issue #53 use the 4bit quantized models to reduce your memory requirements by approx. 4X. The loss in quality should be negligible. |
Yes. The quantized models work perfectly for me. I just wanted to test out the other versions to check out how they perform |
Slowly 😄 If there's a specific model size and prompt you need I can compare the 4bit to the f16. I'm currently exploring different option permutations in issue #69 |
Fix UnicodeDecodeError permanently
Current error
The text was updated successfully, but these errors were encountered: