Skip to content

Prompt tokenization does not match openai/whisper #1098

@iceychris

Description

@iceychris

Hey there!

When passing in a prompt via --prompt, the tokenized word piece ids do not seem to match openai/whisper.
This leads to the decoder producing garbage output (probably because it receives combinations of token ids it has never seen before).

I think one way to resolve this would be to port the openai/tiktoken tokenizer encode implementation to whisper.cpp.

whisper.cpp

$ make && ./main -nt -nf -bs 1 --prompt " hallo" -l de -m models/ggml-tiny.bin samples/jfk.wav
...
whisper_full_with_state: prompt[0] = 50361 | [_PREV_]
whisper_full_with_state: prompt[1] = 6500 |  hall
whisper_full_with_state: prompt[2] = 78 | o
whisper_full_with_state: prompt[3] = 50258 | [_SOT_]
...

openai/whisper

from whisper.tokenizer import get_tokenizer

prompt = " hallo"
tokenizer = get_tokenizer(multilingual=True, language="de", task="transcribe")
ids = tokenizer.encode(prompt)
tokens = [tokenizer.decode([i]) for i in ids]
print(list(zip(ids, tokens)))
[(324, ' ha'), (1913, 'llo')]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions