Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

razorback16 · 2023-07-25T01:12:30Z

I have tried LLAMA-2 70b GPTQ 4bit with 2x 3090 with 2048 context length with decent performance (11 tok/sec) but it doesn't work when I increase context length beyond 2048 tokens.

The error I am getting is

    token = self.generator.gen_single_token()
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/generator.py", line 353, in gen_single_token
    self.apply_rep_penalty(logits)
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/generator.py", line 335, in apply_rep_penalty
    cuda_ext.ext_apply_rep_penalty_mask_cpu(self.sequence,
  File "~/oobabooga_linux/text-generation-webui/venv/lib/python3.10/site-packages/exllama/cuda_ext.py", line 110, in ext_apply_rep_penalty_mask_cpu
    apply_rep_penalty(sequence, penalty_max, sustain, decay, logits)
TypeError: apply_rep_penalty(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: float, arg2: int, arg3: int, arg4: torch.Tensor) -> None

Invoked with: tensor([], size=(1, 0), dtype=torch.int64), 1.17, -1, 128, None
Output generated in 0.01 seconds (0.00 tokens/s, 0 tokens, context 722, seed 1893750658)

The text was updated successfully, but these errors were encountered:

SinanAkkoyun · 2023-07-25T08:49:18Z

Are you using the 70B model?

I would advise you to install the Exllama repo itself (seems like you run oobabooga) and test again with the test_benchmark_inference.py.

Add -l 4096 as an CLI argument and try to run it

SinanAkkoyun · 2023-07-25T09:24:52Z

Here is a quick tutorial if you did not do that kind of thing before:
#192 (comment)

Ph0rk0z · 2023-07-26T13:35:54Z

I am, it's working for me like that.

bdambrosio · 2023-07-26T16:48:16Z

benchmark working for me with -gs 18,24 -l 4096 on 2x3090
fails with first -gs value above 18.
Very happy camper. Thank you!

Ph0rk0z · 2023-07-26T21:59:03Z

15.5. 24 is what I use. The memory limits are still merely suggestions. I lower the first limit until the split looks good and the 2nd GPU doesn't OOM during inference. Same as other GPTQ stuff or anything that uses accelerate.

razorback16 · 2023-07-26T22:57:51Z

Works great with text-generation-webui after adding --max_seq_len 4096 into the command.

razorback16 changed the title ~~Can we run LLAMA-2 with 4096 context length on 2x 3090?~~ Can we run LLAMA-2 70b with 4096 context length on 2x 3090? Jul 26, 2023

razorback16 closed this as completed Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

razorback16 commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023

Ph0rk0z commented Jul 26, 2023

bdambrosio commented Jul 26, 2023

Ph0rk0z commented Jul 26, 2023

razorback16 commented Jul 26, 2023

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

Comments

razorback16 commented Jul 25, 2023 • edited Loading

SinanAkkoyun commented Jul 25, 2023 • edited Loading

SinanAkkoyun commented Jul 25, 2023

Ph0rk0z commented Jul 26, 2023

bdambrosio commented Jul 26, 2023

Ph0rk0z commented Jul 26, 2023

razorback16 commented Jul 26, 2023

razorback16 commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading