Skip to content

server : display token probabilities in the UI #2489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Aug 25, 2023

Conversation

jhen0409
Copy link
Collaborator

@jhen0409 jhen0409 commented Aug 2, 2023

#2423

This is a simple implementation for probabilities of llama response.

It renders a popover for each token. The popover is based on preact-portal, it's short so I make some modifications and copy that into index.html.

Dark mode:
Screenshot 2023-08-07 at 16 00 35

Light mode:
Screenshot 2023-08-07 at 16 01 08

For bytes, I just add a bottom border line to split them: (https://github.com/ggerganov/llama.cpp/assets/3001525/ad92444e-58cc-445a-b8a9-44704236e285)

(Screenshots updated after 04b6f2c)

We can set More options -> Show Probabilities to use n_probs param.

@ghost
Copy link

ghost commented Aug 2, 2023

Ah, this is fun. It works as expected on Android too.

@jubruckne
Copy link

Very nice! But maybe you could round the probabilities to 2 decimals or so?

@jhen0409
Copy link
Collaborator Author

jhen0409 commented Aug 4, 2023

Very nice! But maybe you could round the probabilities to 2 decimals or so?

Sure, I'll do it later. Thanks! Maybe just change the numbers to percentage.

@staviq
Copy link
Contributor

staviq commented Aug 5, 2023

I just had the craziest idea.

I'm not requesting anything, just wondering about the viability of this idea.

Do you thing this could be used to create spelling suggestions while typing ? Like on Android keyboard ?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Aug 7, 2023

If it was percentages, it would be cool.

The byte thing should be fixed in the API level, but at the time we added the probabilities, we couldn't figure it out.

@jhen0409
Copy link
Collaborator Author

jhen0409 commented Aug 7, 2023

The byte thing should be fixed in the API level, but at the time we added the probabilities, we couldn't figure it out.

I have read PR #1962, and I'm a bit confused about this, shouldn't we improve it by convert bytes in the UI side?

I'm thinking maybe we can do the merge for bytes to get a readable result (also helpful for Chinese or other language users), but I'm not sure if it will have other problems.

@jhen0409
Copy link
Collaborator Author

I'm thinking maybe we can do the merge for bytes to get a readable result (also helpful for Chinese or other language users), but I'm not sure if it will have other problems.

I've confirmed the another bytes pair is not able to decode successfully, so I just hide that. I see that Open AI playground also doing the same thing.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Aug 12, 2023

There is some strange thing that happens sometimes: the probabilities disappear from the result.

capt

@jhen0409
Copy link
Collaborator Author

There is some strange thing that happens sometimes: the probabilities disappear from the result.

[image]

Not very sure why it happened, maybe the completion_probabilities of some partial responses is not an array, but as I know in the server.cpp, it should have ensured that it is an array.

I just removed the array check of completion_probabilities for messages, only check params.n_probs > 0 for that, it should be avoid this problem.

@ggerganov ggerganov added high priority Very important issue 🦙. llama labels Aug 17, 2023
@ghost
Copy link

ghost commented Aug 20, 2023

Screenshot_20230820-115304_Iceraven

Still working as expected on Android! ♥️

Edit: Using Iceraven(Firefox)

{
// Always send partial response
// so we can get the correct partial response of the last to_send in the client
const json data = format_partial_response(llama, to_send, probs_output);
Copy link
Collaborator Author

@jhen0409 jhen0409 Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made the last to_send have a partial response, so we can correctly get the probabilities of last message. (the final response included all probabilities).

@ghost
Copy link

ghost commented Aug 21, 2023

The problem should be fixed now, I was mistake for completion_probabilities usage, so previously it cannot show probabilities for this partial response:

Thank you for the fixes, my testing shows it's working.
For my own understanding, when you refer to a "partial response", do you mean how the tokenizer peices words together? i.e. with llama!, the tokenizer chops it in parts, is that what's meany by partial response?

Thank you.

Screenshot_20230821-071816_Iceraven

@slaren
Copy link
Member

slaren commented Aug 21, 2023

I tested the same generation again, but now there is a missing token (ll):
image
image

The probs are still there, but hidden in the space:
image

@jhen0409
Copy link
Collaborator Author

The problem should be fixed now, I was mistake for completion_probabilities usage, so previously it cannot show probabilities for this partial response:

Thank you for the fixes, my testing shows it's working. For my own understanding, when you refer to a "partial response", do you mean how the tokenizer peices words together? i.e. with llama!, the tokenizer chops it in parts, is that what's meany by partial response?

Thank you.

[screenshot]

Generally, the partial response is included a single llama_eval return value and send via event stream, so it will be an one token. Multibytes (like emoji) is also processed in server.cpp, so it will include 2~4 tokens in a partial response.

But things like llama probably shouldn't be a partial response, and I think I found the reason, you can see my next comment.

@jhen0409
Copy link
Collaborator Author

jhen0409 commented Aug 22, 2023

I tested the same generation again, but now there is a missing token (ll): [image]

The probs are still there, but hidden in the space: [image]

Confirmed it was a problem from find_partial_stop_string in server.cpp, tokens like L or l are incorrectly considered as a partial stop word.

Log

token_text:  L
pos: 11
stop_pos: 18446744073709551615
final stop_pos: 1 (wrong)
to_send:  (empty)

token_text: l
pos: 12
stop_pos: 18446744073709551615
final stop_pos: 0 (wrong)
to_send: (empty)

token_text: ama
pos: 12
stop_pos: 18446744073709551615
final stop_pos: 0 (wrong)
to_send: (empty)

I'll fix this later.

UPDATED: The fix is here but it was problem with sent_token_probs_index, the above log is expected as we need to wait for possible stop words.

Comment on lines 1289 to 1293
const std::string to_send = llama.generated_text.substr(pos, stop_pos);
const std::string to_send = stop_pos == std::string::npos
? llama.generated_text.substr(pos, std::string::npos)
: ""; // just don't send anything if we're not done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this fix, the to_send is a whitespace when it gets stop_pos 1 from L, then the sent_token_probs_index will be incorrect.

@jhen0409
Copy link
Collaborator Author

Also, I merged the master branch, so need to use GGUF models for testing here.

@slaren
Copy link
Member

slaren commented Aug 24, 2023

I noticed that each interaction has increasingly more vertical space. Also happens without token probabilities.

image

@jhen0409
Copy link
Collaborator Author

jhen0409 commented Aug 24, 2023

I noticed that each interaction has increasingly more vertical space. Also happens without token probabilities.

[image]

I got the same in master, in this case the model responds content like \n\nUser: and we cut the antiprompt User:. It should be easy to improve.

@jhen0409
Copy link
Collaborator Author

After fixing the newline issue, I think this can be merged. Thank you guys!

@jhen0409 jhen0409 merged commit 29674ab into ggml-org:master Aug 25, 2023
@jhen0409 jhen0409 deleted the server-probs branch August 25, 2023 10:32
@IgnacioFDM
Copy link
Contributor

Nice one. What I'd like to have in the future is a notebook mode (so basic completion instead of chat). Do you have any plans for that? I could maybe it hack it together a few weeks from now when I'm a bit less busy.

@jhen0409
Copy link
Collaborator Author

Nice one. What I'd like to have in the future is a notebook mode (so basic completion instead of chat). Do you have any plans for that? I could maybe it hack it together a few weeks from now when I'm a bit less busy.

I'm also thinking about to have a pure text completion in web UI, but the plans is not very clear. Currently I'm using the vim plugin for that, but the web UI may be could provide more visual capabilities. It's a low priority for me, but interesting.

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* server : add n_probs param in chat UI

* server : keep message data array & show in probabilites component

* server : add simple popover component

* server : fix completion_probabilities undefined if not set n_probs

* server : implement Probabilites

* server : handle bytes

* server : make n_probs max to 10 for easy scroll

* server : adjust for dark/light mode

* server : Fix regenerated prompt

* server : update index.html.hpp

* server : convert prob to percentage + show original value as div title

* server : fix Probabilites not used if included empty str

* server : skip byte pair in display probabilites

* server : remove array check of completion_probabilities in messages

* skip empty array or byte pair (> 1) in Probabilites

* generate index.html.hpp

* fix incorrect prob convert if the str is already a known token

* use final response to show probabilities on stop

* revert unnecessary change

* correct probabilites usage

* remove unused function

* always send partial response for get correct probs of last to_send

* fix typo

* fix content of format_final_response

* refactor probs render & make pColor transparent if not found

* send empty string when got stop_pos in partial

* avoid unnecessary empty data event & send rest of partial tokens on stop

* use <br /> for new line

* skip -1 tok in loop to avoid send '' on end

* trim last new lines on stop

* revert unnecessary change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue 🦙. llama
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants