Prompt tokenization does not match openai/whisper

Hey there!

When passing in a prompt via `--prompt`, the tokenized word piece ids do not seem to match [openai/whisper](https://github.com/openai/whisper).
This leads to the decoder producing garbage output (probably because it receives combinations of token ids it has never seen before).

I think one way to resolve this would be to port the `openai/tiktoken` tokenizer `encode` implementation to `whisper.cpp`.

### `whisper.cpp`

* tokenizer encode implementation: https://github.com/ggerganov/whisper.cpp/blob/4774d2feb01a772a15de81ffc34b34a1f294f020/whisper.cpp#L2597-L2623

```bash
$ make && ./main -nt -nf -bs 1 --prompt " hallo" -l de -m models/ggml-tiny.bin samples/jfk.wav
```

```
...
whisper_full_with_state: prompt[0] = 50361 | [_PREV_]
whisper_full_with_state: prompt[1] = 6500 |  hall
whisper_full_with_state: prompt[2] = 78 | o
whisper_full_with_state: prompt[3] = 50258 | [_SOT_]
...
```

### `openai/whisper`

* tokenizer encode implementation: https://github.com/openai/tiktoken/blob/5d970c1100d3210b42497203d6b5c1e30cfda6cb/src/lib.rs#L14-L98

```python
from whisper.tokenizer import get_tokenizer

prompt = " hallo"
tokenizer = get_tokenizer(multilingual=True, language="de", task="transcribe")
ids = tokenizer.encode(prompt)
tokens = [tokenizer.decode([i]) for i in ids]
print(list(zip(ids, tokens)))
```

```python
[(324, ' ha'), (1913, 'llo')]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prompt tokenization does not match openai/whisper #1098

`whisper.cpp`

`openai/whisper`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prompt tokenization does not match openai/whisper #1098

Description

whisper.cpp

openai/whisper

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`whisper.cpp`

`openai/whisper`