-
-
Notifications
You must be signed in to change notification settings - Fork 219
EOS Token is not encoded/decoded correctly #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is due to SentencePiece not wanting to encode control symbols as part of the input. HF AutoTokenizer jumps through a lot of hoops to encode those symbols separately, transparently using SentencePiece in a way it wasn't "meant" to be used. I'm currently in two minds about the right way to deal with this, since it leaves you without any good options if you want to encode Regardless, I've just merged @SinanAkkoyun's PR that should allow you to either add BOS and/or EOS tokens with flags to As for the model outputting a multiple-token representation of |
I see, that makes sense. Thanks so much. I will have a look at the PR. Also, yeah the model outputting them splitup I have no idea why that's happening. Thank you! |
Thank you for merging! :)
In my testing it just predicts the next token! @nivibilla provides the |
Hi,
The encoder has the Eos token set but its not being encoded correctly.
The text was updated successfully, but these errors were encountered: