-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Labels
bugSomething isn't workingSomething isn't working
Description
Hey there!
When passing in a prompt via --prompt
, the tokenized word piece ids do not seem to match openai/whisper.
This leads to the decoder producing garbage output (probably because it receives combinations of token ids it has never seen before).
I think one way to resolve this would be to port the openai/tiktoken
tokenizer encode
implementation to whisper.cpp
.
whisper.cpp
- tokenizer encode implementation: https://github.com/ggerganov/whisper.cpp/blob/4774d2feb01a772a15de81ffc34b34a1f294f020/whisper.cpp#L2597-L2623
$ make && ./main -nt -nf -bs 1 --prompt " hallo" -l de -m models/ggml-tiny.bin samples/jfk.wav
...
whisper_full_with_state: prompt[0] = 50361 | [_PREV_]
whisper_full_with_state: prompt[1] = 6500 | hall
whisper_full_with_state: prompt[2] = 78 | o
whisper_full_with_state: prompt[3] = 50258 | [_SOT_]
...
openai/whisper
- tokenizer encode implementation: https://github.com/openai/tiktoken/blob/5d970c1100d3210b42497203d6b5c1e30cfda6cb/src/lib.rs#L14-L98
from whisper.tokenizer import get_tokenizer
prompt = " hallo"
tokenizer = get_tokenizer(multilingual=True, language="de", task="transcribe")
ids = tokenizer.encode(prompt)
tokens = [tokenizer.decode([i]) for i in ids]
print(list(zip(ids, tokens)))
[(324, ' ha'), (1913, 'llo')]
westphal-jan, erksch, Legion2, bobqianic and NurAhmadullah
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working