-
Notifications
You must be signed in to change notification settings - Fork 11.7k
llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better #2083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi, I'm here to help if needed |
How would this actually work? Would the entire prompt be scaled, or just the initial instruction as a seperate block of text? Would negative prompts (as described in the paper) be yet another input? Multiple inputs seems like a fundamental change for llama.cpp (though not one I am oppose to). |
If I understand this correctly, this looks deceptively simple.
|
@AlphaAtlas I'm not clear what difference you make between the "entire prompt" or "initial instruction", but really it depends on the use case. For base models, the results laid in the paper contrast prompted and promptless completion. That is, we compute the conditioned completion with the whole text, and the unconditional only starting with the last token before the net has to continue the text. ie, the pseudo code is:
in case of multi-round, I guess we would empty uncond_prompt at each round. For assistant, the story is a little bit different. Let assistant_prompt = system_prompt (eg "answer the following question") + user_prompt (eg "tell me about alpacas") From my preliminary experiments it seems that promptless continuation with assistant are terrible (feel free to experiment and challenge this we were heavily time constrained and did not try this thoroughly), so the unconditional prompt as outlined above doesn't work. Instead, we set the uncond_prompt (which becomes a negative prompt really) as the assistant prompt with the default system prompt and the user prompt, and we build "cond_prompt" with a different system_prompt (most likely set by the app designer who wants to deviate from the default prompt for a specific tone or persona), and the user_prompt. This setting emphasizes the change in system_prompt. If we actually want to use the default system prompt and emphasize the user prompt, we have several options:
Hope that helps! EDIT: @bullno1 nailed the explanation. |
@Vermeille So if I understand correctly, given a model like Vicuna which was trained on:
To apply CFG and make the response rude, we would have the generation context have the modified system prompt:
Then have the guidance context have the original system prompt:
After applying the logits merging as above, the result will stay close to the persona set out in the generation prompt and stay far away from the guidance/negative prompt. |
@bullno1 100% correct. We had extremely """great""" results with asking for inappropriate or angry response. |
Fascinating, I think i get it now 🤔 Many (most?) users are running instruction-tuned assistants out in the wild, so this is an interesting issue for the downstream UI devs. |
@Vermeille Should this subscript in Equation (7) be |
Yes. Totally a typo.
…On Wed, Jul 5, 2023, 18:30 Evan Miller ***@***.***> wrote:
@Vermeille <https://github.com/Vermeille> Should this subscript in
Equation (7) be $j\lt i$ rather than $i\lt j$? Trying to understand the
paper, and want to make sure it's not "peeking" into future tokens....
[image: image]
—
Reply to this email directly, view it on GitHub
<#2083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIOR6544AYOUOTN5W5NL2LXOWJBNANCNFSM6AAAAAAZ4D5524>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We should add an example of this technique or try to straight up add it to the If I'm understanding correctly, there might be a more efficient way to evaluate both contexts with a single batched pass, but to do that we'll need some extra changes to handle multiple KV caches. We can think about this later though |
@ggerganov If you give me some guidance so that it does not take me too much time to implement the feature ("it goes in this file, take example here, the softmax function is this one, and the caveats are this and that"), I volunteer. |
@Vermeille going through the backlog of issues, just reached this one, so sorry for the late reply. I see that @bullno1 already started an implementation and from a quick look it is pretty much what is needed. |
The proposed implementation looks pretty slick indeed! |
Should I be able to use cfg without a negative prompt, but instead still have the cfg-scale in order to enforce the prompt (or parts of the prompt)? Edit: or will it work if I have a negative prompt that is same as the main prompt, but has "don't" instead of "do" (where "do" is in the main prompt)? |
@Mihaiii Yes. That's the primary usage defended in the paper.
Should do the trick as well. |
|
It runs inference twice, once for the normal prompt and once for the negative prompt. |
Yes, with batched decoding you can run F16 + CFG at the same speed as regular F16 decoding. |
It was expected to run slower, but as far I can see it also uses half of GPU capacity from what I can see at the GPU history. I'm using below command:
|
@ggerganov retweeted the "Stay on topic with Classifier-Free Guidance" paper that came out showing that "Classifier-Free Guidance (CFG)"... "can be used broadly as an inference-time technique in pure language modeling. " ... "brings improvements equivalent to a model with twice the parameter-count" (with no retraining needed). - https://arxiv.org/abs/2306.17806
I saw that the Transformers library has one of the paper's author working on an implementation.
I didn't see an issue for it yet here so I figured pointing to it is the least I could do for this awesome library!
The text was updated successfully, but these errors were encountered: