Skip to content

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
razorback16 opened this issue Jul 25, 2023 · 6 comments
Closed

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

razorback16 opened this issue Jul 25, 2023 · 6 comments

Comments

@razorback16
Copy link

razorback16 commented Jul 25, 2023

I have tried LLAMA-2 70b GPTQ 4bit with 2x 3090 with 2048 context length with decent performance (11 tok/sec) but it doesn't work when I increase context length beyond 2048 tokens.

The error I am getting is

    token = self.generator.gen_single_token()
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/generator.py", line 353, in gen_single_token
    self.apply_rep_penalty(logits)
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/generator.py", line 335, in apply_rep_penalty
    cuda_ext.ext_apply_rep_penalty_mask_cpu(self.sequence,
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/cuda_ext.py", line 110, in ext_apply_rep_penalty_mask_cpu
    apply_rep_penalty(sequence, penalty_max, sustain, decay, logits)
TypeError: apply_rep_penalty(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: float, arg2: int, arg3: int, arg4: torch.Tensor) -> None

Invoked with: tensor([], size=(1, 0), dtype=torch.int64), 1.17, -1, 128, None
Output generated in 0.01 seconds (0.00 tokens/s, 0 tokens, context 722, seed 1893750658)
@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Jul 25, 2023

Are you using the 70B model?

I would advise you to install the Exllama repo itself (seems like you run oobabooga) and test again with the test_benchmark_inference.py.

Add -l 4096 as an CLI argument and try to run it

@SinanAkkoyun
Copy link
Contributor

Here is a quick tutorial if you did not do that kind of thing before:
#192 (comment)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 26, 2023

I am, it's working for me like that.

@bdambrosio
Copy link

benchmark working for me with -gs 18,24 -l 4096 on 2x3090
fails with first -gs value above 18.
Very happy camper. Thank you!

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 26, 2023

15.5. 24 is what I use. The memory limits are still merely suggestions. I lower the first limit until the split looks good and the 2nd GPU doesn't OOM during inference. Same as other GPTQ stuff or anything that uses accelerate.

@razorback16 razorback16 changed the title Can we run LLAMA-2 with 4096 context length on 2x 3090? Can we run LLAMA-2 70b with 4096 context length on 2x 3090? Jul 26, 2023
@razorback16
Copy link
Author

Works great with text-generation-webui after adding --max_seq_len 4096 into the command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants