Skip to content

The initial token is always empty. #367

Closed
@BadisG

Description

@BadisG

Hello,

I noticed something when trying the chat with Bob is that I always get the first token as empty.

 1 -> ''

4103 -> ' Trans'
924 -> 'cript'
310 -> ' of'
263 -> ' a'
7928 -> ' dialog'

So the result is this:

image

There's this little space at the begining of the text. Maybe this alone can significantly impact the quality of the output, that's why I decided to post this issue.

I'm on a windows 10 using WSL to emulate the linux environnement (the main.exe is not as good as the linux main atm).

I'm using a file that is the result of all those manipulations:

  1. I have first a llama-7b-4bit.pt file
  2. I converted it with the gptq-to-ggml converter (convert-gptq-to-ggml.py)
  3. I converted it again into the new version of ggml with this script Breaking change of models since PR #252 #324 (comment)

Here's the .sh command (7B_CHAT_Bob.sh):

#!/bin/bash
dos2unix 7B_CHAT_Bob.sh

./main -m ./models/llama7b-4bit-GPTQ.bin -t 14 -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Everything is updated on this repository as I apply a git pull everytime I launch the powershell.

Activity

added
questionFurther information is requested
need more infoThe OP should provide more details about the issue
on Mar 21, 2023
gjmulder

gjmulder commented on Mar 21, 2023

@gjmulder
Collaborator

Please review the issue reporting guidelines in #239 and provide a better description of the issue you are observing.

BadisG

BadisG commented on Mar 21, 2023

@BadisG
Author

Please review the issue reporting guidelines in #239 and provide a better description of the issue you are observing.

I added more details based on your guideline, I hope that'll help

PriNova

PriNova commented on Mar 21, 2023

@PriNova

Hello,

I noticed something when trying the chat with Bob is that I always get the first token as empty.

 1 -> ''

4103 -> ' Trans' 924 -> 'cript' 310 -> ' of' 263 -> ' a' 7928 -> ' dialog'

So the result is this:

()Transcript of a dialog, where the User...

There's this little space at the begining of the text. Maybe this alone can significantly impact the quality of the output, that's why I decided to post this issue.

I'm on a windows 10 using WSL to emulate the linux environnement (the main.exe is not as good as the linux main atm).

I'm using a file that is the result of all those manipulations:

  1. I have first a llama-7b-4bit.pt
  2. I converted it with the gptq-to-ggml converter (convert-gptq-to-ggml.py)
  3. I converted it again into the new version of ggml with this script Breaking change of models since PR #252 #324 (comment)

Here's the .sh command (7B_CHAT_Bob.sh):

#!/bin/bash
dos2unix 7B_CHAT_Bob.sh

./main -m ./models/llama7b-4bit-GPTQ.bin -t 14 -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Everything is updated on this repository as I apply a git pull everytime I launch the powershell.

The Token with ID 1 is a custom token called BOD (Begin Of Document) and is one of the two tokens which are required in the token vocabulary. The second is EOD (End Of Document) with ID 2.

So to say, this is a normal behaviour.

BadisG

BadisG commented on Mar 21, 2023

@BadisG
Author

@PriNova I see, thanks for your answer I learned something today!
But still I can see a space at the begining of the text, I think I hadn't that before, it's a bit ugly to look at... but if it doesn't change the output I'm ok with that.

mattsta

mattsta commented on Mar 22, 2023

@mattsta

You can make token 1 go away by commenting out in utils.cpp llama_tokenize():

    if (bos) {
        // output.push_back(1);
    }

It's probably more correct with it there, but also doesn't seem to break anything if removed (if only submitting one whole document per session at least).

As for the leading space, look at your initial tokens above of:

4103 -> ' Trans'
924 -> 'cript'

The space is inside the first token, so it is being printed. Technically if the first token starts with a space the output could skip over it when printing.

Green-Sky

Green-Sky commented on Mar 22, 2023

@Green-Sky
Collaborator

The leading space is intentional and a result of
https://github.com/ggerganov/llama.cpp/blob/d5850c53ca179b9674b98f35d359763416a3cc11/main.cpp#L232-L233

not not sure if we should just not print the first character (the space) or not.

added a commit that references this issue on Dec 19, 2023

Merge pull request ggml-org#367 from ianscrivener/ianscrivener-macos-…

abf6d4a
github-actions

github-actions commented on Apr 10, 2024

@github-actions
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mattsta@Green-Sky@gjmulder@PriNova@BadisG

        Issue actions

          The initial token is always empty. · Issue #367 · ggml-org/llama.cpp