Skip to content

Commit d12a63c

Browse files
committed
convert : fix incorrect added token dedup in BpeVocab
1 parent b2b63d1 commit d12a63c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

convert.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None):
387387
(item['content'], item['id'])
388388
for item in tokenizer_json.get('added_tokens', [])
389389
# Added tokens here can be duplicates of the main vocabulary.
390-
if item['content'] not in bpe_tokenizer)
390+
if item['content'] not in self.vocab)
391391

392392
vocab_size = len(self.vocab)
393393
expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))

0 commit comments

Comments
 (0)