GPT2 Tokenizer decoding fails when the added tokens include a space

## 🐛 Bug

After adding a new token that contains a space to the GPT2 tokenizer, the tokenizer produces an error at decoding time (see example code below). My current workaround is to preprocess that token to remove spaces before adding it and to postprocess the token after decoding. But I thought I'd share this in case this is something that the library can warn against (e.g. added tokens should not include spaces) or even support.



Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:
* [ ] the official example scripts: (give details)
* [x] my own modified scripts: (give details)

The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details)

## To Reproduce

Steps to reproduce the behavior:

1. Run the following code:

```python
from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_tokens(["special token"])
encoded = tokenizer.encode("special token")
tokenizer.decode(encoded)
```

2. Currently, I get the error: 
```
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-f47101f92e14> in <module>
----> 1 tokenizer.decode(encoded)

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_utils.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces)
    665             token_ids, skip_special_tokens=skip_special_tokens
    666         )
--> 667         text = self.convert_tokens_to_string(filtered_tokens)
    668         if clean_up_tokenization_spaces:
    669             text = self.clean_up_tokenization(text)

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in convert_tokens_to_string(self, tokens)
    187         """ Converts a sequence of tokens (string) in a single string. """
    188         text = ''.join(tokens)
--> 189         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
    190         return text
    191 

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in <listcomp>(.0)
    187         """ Converts a sequence of tokens (string) in a single string. """
    188         text = ''.join(tokens)
--> 189         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
    190         return text
    191 

KeyError: ' '
```
## Expected behavior

I expect the decoder to return the string `"special token"`

## Environment

* OS: OSX
* Python version: 3.7.3
* PyTorch version: 1.1.0
* PyTorch Transformers version (or branch): master (d06c5a2a0acd8525d969a8f8f5b968ec0ec110b4)
* Using GPU ? No
* Distributed of parallel setup ? No
* Any other relevant information:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT2 Tokenizer decoding fails when the added tokens include a space #1133

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPT2 Tokenizer decoding fails when the added tokens include a space #1133

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions