Skip to content

GPT2 Tokenizer decoding fails when the added tokens include a space #1133

@harkous

Description

@harkous

🐛 Bug

After adding a new token that contains a space to the GPT2 tokenizer, the tokenizer produces an error at decoding time (see example code below). My current workaround is to preprocess that token to remove spaces before adding it and to postprocess the token after decoding. But I thought I'd share this in case this is something that the library can warn against (e.g. added tokens should not include spaces) or even support.

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • the official example scripts: (give details)
  • my own modified scripts: (give details)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

  1. Run the following code:
from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_tokens(["special token"])
encoded = tokenizer.encode("special token")
tokenizer.decode(encoded)
  1. Currently, I get the error:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-f47101f92e14> in <module>
----> 1 tokenizer.decode(encoded)

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_utils.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces)
    665             token_ids, skip_special_tokens=skip_special_tokens
    666         )
--> 667         text = self.convert_tokens_to_string(filtered_tokens)
    668         if clean_up_tokenization_spaces:
    669             text = self.clean_up_tokenization(text)

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in convert_tokens_to_string(self, tokens)
    187         """ Converts a sequence of tokens (string) in a single string. """
    188         text = ''.join(tokens)
--> 189         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
    190         return text
    191 

~/miniconda3/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py in <listcomp>(.0)
    187         """ Converts a sequence of tokens (string) in a single string. """
    188         text = ''.join(tokens)
--> 189         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
    190         return text
    191 

KeyError: ' '

Expected behavior

I expect the decoder to return the string "special token"

Environment

  • OS: OSX
  • Python version: 3.7.3
  • PyTorch version: 1.1.0
  • PyTorch Transformers version (or branch): master (d06c5a2)
  • Using GPU ? No
  • Distributed of parallel setup ? No
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions