-
Notifications
You must be signed in to change notification settings - Fork 30k
Closed
Description
🐛 Bug
After adding a new token that contains a space to the GPT2 tokenizer, the tokenizer produces an error at decoding time (see example code below). My current workaround is to preprocess that token to remove spaces before adding it and to postprocess the token after decoding. But I thought I'd share this in case this is something that the library can warn against (e.g. added tokens should not include spaces) or even support.
Model I am using (Bert, XLNet....): GPT2
Language I am using the model on (English, Chinese....): English
The problem arise when using:
- the official example scripts: (give details)
- my own modified scripts: (give details)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details)
To Reproduce
Steps to reproduce the behavior:
- Run the following code:
from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_tokens(["special token"])
encoded = tokenizer.encode("special token")
tokenizer.decode(encoded)
- Currently, I get the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-5-f47101f92e14> in <module>
----> 1 tokenizer.decode(encoded)
~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_utils.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces)
665 token_ids, skip_special_tokens=skip_special_tokens
666 )
--> 667 text = self.convert_tokens_to_string(filtered_tokens)
668 if clean_up_tokenization_spaces:
669 text = self.clean_up_tokenization(text)
~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in convert_tokens_to_string(self, tokens)
187 """ Converts a sequence of tokens (string) in a single string. """
188 text = ''.join(tokens)
--> 189 text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
190 return text
191
~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in <listcomp>(.0)
187 """ Converts a sequence of tokens (string) in a single string. """
188 text = ''.join(tokens)
--> 189 text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
190 return text
191
KeyError: ' '
Expected behavior
I expect the decoder to return the string "special token"
Environment
- OS: OSX
- Python version: 3.7.3
- PyTorch version: 1.1.0
- PyTorch Transformers version (or branch): master (d06c5a2)
- Using GPU ? No
- Distributed of parallel setup ? No
- Any other relevant information:
Metadata
Metadata
Assignees
Labels
No labels