-
-
Notifications
You must be signed in to change notification settings - Fork 219
Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you using the 70B model? I would advise you to install the Exllama repo itself (seems like you run oobabooga) and test again with the test_benchmark_inference.py. Add |
Here is a quick tutorial if you did not do that kind of thing before: |
I am, it's working for me like that. |
benchmark working for me with -gs 18,24 -l 4096 on 2x3090 |
15.5. 24 is what I use. The memory limits are still merely suggestions. I lower the first limit until the split looks good and the 2nd GPU doesn't OOM during inference. Same as other GPTQ stuff or anything that uses accelerate. |
Works great with text-generation-webui after adding --max_seq_len 4096 into the command. |
I have tried LLAMA-2 70b GPTQ 4bit with 2x 3090 with 2048 context length with decent performance (11 tok/sec) but it doesn't work when I increase context length beyond 2048 tokens.
The error I am getting is
The text was updated successfully, but these errors were encountered: