@jploski After going through 50 loops I ended where I started, it appears I wrongly tested it first. The broadcasting destroys batched processing once n_batch reaches 32 (in most cases, sometimes > 32 or < 32).

I just pushed a hotfix which disabled broadcasting during batching (back to repeating), it must be some nasty problem in the tensor shapes. If it can be solved would be great, though I think the repeat is less damaging in the batched processing, only makes full GPU integration a bit harder.

Here is a test case: falcon_main -t 4 -m falcon-7b\q4_1 -p "Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu! Answer:" -n 64 --temp 0 --override-max-gpu -b 31 <= -b 31 usually works, >= -b32 fails

What do you mean by "fails"?

I just tried the pre-hotfix commit d94c88d using a q5_1 version of 7B (and use --override-max-gpu 1 - is that what you meant?). It produced roughly the same output with -b 31 -b 32 and -b 33 (I think the differences are because of cuBLAS).

Edit: nevermind, I see the issue (garbage generated with large batch size)

Slowdown with tokens #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions