Skip to content

Unicode Error for Hindi transcription  #1700

@rahulshivajipawar

Description

@rahulshivajipawar

When doing transcription in Hindi for a file, I encounter invalid unicode character.

Screenshot 2023-12-29 at 8 29 09 PM

I have noticed this with many Hindi files.

Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well.

I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions