Skip to content

Modify generator.py > generate_simple to accept encode_special_characters? #243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zmarty opened this issue Aug 10, 2023 · 1 comment
Open

Comments

@zmarty
Copy link

zmarty commented Aug 10, 2023

Hello. I noticed a couple of recent PRs added the encode_special_characters parameter option inside the tokenizer. This is great because right now I don't think exllama by default encodes special tokens and calls LLMs correctly.

However, for this change to be actually useful I think that functions such as generate_simple inside generator.py should also accept this parameter and send it along to the tokenizer, like so:

def generate_simple(self, prompt, max_new_tokens = 128, encode_special_characters=True):

Is this a reasonable request? I can probably also submit a PR if it's something people think is useful, although I am not Python programmer. But I did modify the function locally and it works fine.

#195
#197
#199

@turboderp
Copy link
Owner

Well, there's a whole discussion about the "correct" way to tokenize inputs to language models. According to the SP authors, it's not correct to pass control symbols to the encoder, which is why there's all that extra code in ExLlamaTokenizer (and in HF's AutoTokenizer) to thwart their efforts. I've gone back and forth on the best way to deal with it, and currently I'm not sure. Even with that, ExLlama still won't tokenize added tokens (beyond the 32000 in the standard Llama vocabulary), and as far as I know even HF doesn't do it correctly so it's not a simple matter at all.

At any rate, generate_simple was supposed to be just that, a simple way of getting some output out of the generator, not an omni-tool for handling more advanced cases like this. That's why it isn't used in the web UI and chatbot examples.

But I don't see any harm in adding an option to encode special characters in the prompt. I would let it default to False though, to minimize chances of regression. So I'll get on that later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants