Working with long stories

I'm trying to make long stories using a llama.cpp model (`guanaco-33B.ggmlv3.q4_0.bin` in my case) with `oobabooga/text-generation-webui`.

It works for short inputs but it stops working once the number of input tokens is coming close to the context size (**2048**).

With a bit of playing with the webui (you can count input tokens and modify the `max_new_tokens` on the main page) I found out that the behavior is like this:

if **nb_input_tokens** + **max_new_tokens** < **context_size** , then it works correctly.
if **nb_input_tokens** < **context_size** but **nb_input_tokens** + **max_new_tokens** > **context_size** , then it fails silently, generating 0 tokens:

```
Output generated in 0.25 seconds (0.00 tokens/s, 0 tokens, ...
```

if `nb_input_tokens` > `context_size`, then it fails with:

```
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, ...
```

I've seen [issue #92 of llama-cpp-python](https://github.com/abetlen/llama-cpp-python/issues/92) but it is closed and I'm on a recent version of `llama-cpp-python` (release 0.1.57)

`llama-cpp-python` should probably discard some input tokens at the beginning to be able to fit inside the context and allow us to continue long stories.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Working with long stories #307

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Working with long stories #307

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions