This is due to SentencePiece not wanting to encode control symbols as part of the input. HF AutoTokenizer jumps through a lot of hoops to encode those symbols separately, transparently using SentencePiece in a way it wasn't "meant" to be used.

I'm currently in two minds about the right way to deal with this, since it leaves you without any good options if you want to encode </s> or whatever as text, since there's no way to escape control symbols, so e.g. sanitizing user input in a chat client becomes really difficult. On the other hand it's very cumbersome to have to define something like a prompt format as a mixture of control tokens and tokenized text. So yeah. Idk.

Regardless, I've just merged @SinanAkkoyun's PR that should allow you to either add BOS and/or EOS tokens with flags to tokenizer.encode or by including them in the input string and setting encode_special_characters = True. So you could try that.

As for the model outputting a multiple-token representation of </s>... yes, that's very strange. If it's correctly tuned to output one token, it's statistically pretty much impossible for that to be split up into the multi-token representation of the exact same string instead. The model has no concept of those three tokens combining to form the EOS token, unless it's been tuned to equate those two (i.e. with incorrect tokenizer settings).

Uh oh!

EOS Token is not encoded/decoded correctly #199

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions