Skip to content

EOS Token is not encoded/decoded correctly #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nivibilla opened this issue Jul 26, 2023 · 4 comments
Closed

EOS Token is not encoded/decoded correctly #199

nivibilla opened this issue Jul 26, 2023 · 4 comments

Comments

@nivibilla
Copy link

nivibilla commented Jul 26, 2023

Hi,

The encoder has the Eos token set but its not being encoded correctly.

image

@nivibilla nivibilla changed the title EOS Token is printed but not stopping llama2 EOS Token is not encoded/decoded correctly Jul 26, 2023
@nivibilla nivibilla reopened this Jul 26, 2023
@nivibilla
Copy link
Author

also when generating, even though the model was trained to output the single token '2' which is the end token. It splits it up.
image

@turboderp
Copy link
Owner

This is due to SentencePiece not wanting to encode control symbols as part of the input. HF AutoTokenizer jumps through a lot of hoops to encode those symbols separately, transparently using SentencePiece in a way it wasn't "meant" to be used.

I'm currently in two minds about the right way to deal with this, since it leaves you without any good options if you want to encode </s> or whatever as text, since there's no way to escape control symbols, so e.g. sanitizing user input in a chat client becomes really difficult. On the other hand it's very cumbersome to have to define something like a prompt format as a mixture of control tokens and tokenized text. So yeah. Idk.

Regardless, I've just merged @SinanAkkoyun's PR that should allow you to either add BOS and/or EOS tokens with flags to tokenizer.encode or by including them in the input string and setting encode_special_characters = True. So you could try that.

As for the model outputting a multiple-token representation of </s>... yes, that's very strange. If it's correctly tuned to output one token, it's statistically pretty much impossible for that to be split up into the multi-token representation of the exact same string instead. The model has no concept of those three tokens combining to form the EOS token, unless it's been tuned to equate those two (i.e. with incorrect tokenizer settings).

@nivibilla
Copy link
Author

I see, that makes sense. Thanks so much. I will have a look at the PR. Also, yeah the model outputting them splitup I have no idea why that's happening. Thank you!

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Jul 26, 2023

Thank you for merging! :)

As for the model outputting a multiple-token representation of ... yes, that's very strange.

In my testing it just predicts the next token! @nivibilla provides the </s> as a multi-token string (by encoding without special chars enabled) already. The model then picks that up and reproduces it. At least I think that's why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants