Skip to content

llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better #2083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lukestanley opened this issue Jul 3, 2023 · 19 comments · Fixed by #2135
Closed
Assignees
Labels
enhancement New feature or request generation quality Quality of model output good first issue Good for newcomers research 🔬

Comments

@lukestanley
Copy link

lukestanley commented Jul 3, 2023

@ggerganov retweeted the "Stay on topic with Classifier-Free Guidance" paper that came out showing that "Classifier-Free Guidance (CFG)"... "can be used broadly as an inference-time technique in pure language modeling. " ... "brings improvements equivalent to a model with twice the parameter-count" (with no retraining needed). - https://arxiv.org/abs/2306.17806

I saw that the Transformers library has one of the paper's author working on an implementation.

I didn't see an issue for it yet here so I figured pointing to it is the least I could do for this awesome library!

@Vermeille
Copy link
Contributor

hi, I'm here to help if needed

@AlphaAtlas
Copy link

AlphaAtlas commented Jul 3, 2023

How would this actually work? Would the entire prompt be scaled, or just the initial instruction as a seperate block of text?

Would negative prompts (as described in the paper) be yet another input?

Multiple inputs seems like a fundamental change for llama.cpp (though not one I am oppose to).

@bullno1
Copy link
Contributor

bullno1 commented Jul 3, 2023

If I understand this correctly, this looks deceptively simple.
The recent multi context support makes this possible:

  • Have a context for generation
  • Have another context for guidance, seeded with the last token of the prompt.
    This is an "unconditional" context to contrast against a conditional context with a prompt.
    I believe this can also be seeded with a negative prompt.
  • Eval both contexts
  • Use the logits of the guidance context to modify the logits of the generation context using the formula in Add Classifier-Free Guidance sampling huggingface/transformers#24536
    This looks like it is just softmax and weighted sum.
  • Do sampling as usual from these new logits
  • Append the next token to both contexts
  • Repeat

@Vermeille
Copy link
Contributor

Vermeille commented Jul 3, 2023

@AlphaAtlas I'm not clear what difference you make between the "entire prompt" or "initial instruction", but really it depends on the use case.

For base models, the results laid in the paper contrast prompted and promptless completion. That is, we compute the conditioned completion with the whole text, and the unconditional only starting with the last token before the net has to continue the text.

ie, the pseudo code is:

prompt = tokenize(<user_prompt>)
uncond_prompt = prompt[:, :-1]  # only keep the last token

while we want to sample continuation:
    cfg_logits = cfg(cond=model(prompt), uncond=model(uncond_prompt))
    next_token = sample(cfg_logits)

    prompt.append(next_token)
    uncond_prompt.append(next_token)

in case of multi-round, I guess we would empty uncond_prompt at each round.


For assistant, the story is a little bit different.

Let assistant_prompt = system_prompt (eg "answer the following question") + user_prompt (eg "tell me about alpacas")

From my preliminary experiments it seems that promptless continuation with assistant are terrible (feel free to experiment and challenge this we were heavily time constrained and did not try this thoroughly), so the unconditional prompt as outlined above doesn't work. Instead, we set the uncond_prompt (which becomes a negative prompt really) as the assistant prompt with the default system prompt and the user prompt, and we build "cond_prompt" with a different system_prompt (most likely set by the app designer who wants to deviate from the default prompt for a specific tone or persona), and the user_prompt. This setting emphasizes the change in system_prompt.

If we actually want to use the default system prompt and emphasize the user prompt, we have several options:

  1. we could try and search for a neutral generic negative user_prompt ("tell me something"? we need something neutral, but we could not find a satisfying wording)
  2. introduce a special syntax. For instance: "Tell me about alpacas {as rap lyrics}" would get split into positive prompt "Tell me about alpacas as rap lyrics" and neg prompt "tell me about alpacas", thus emphasizing "as rap lyrics".

Hope that helps!

EDIT: @bullno1 nailed the explanation.

@bullno1
Copy link
Contributor

bullno1 commented Jul 3, 2023

@Vermeille So if I understand correctly, given a model like Vicuna which was trained on:

A chat between a user and a helpful, polite assistant ...
USER: [User input]
ASSISTANT: [Use model to generate this]

To apply CFG and make the response rude, we would have the generation context have the modified system prompt:

A chat between a user and a rude and obnoxious assistant ...
USER: Tell me about LLM.
ASSISTANT:

Then have the guidance context have the original system prompt:

A chat between a user and a polite assistant ...
USER: Tell me about LLM.
ASSISTANT:

After applying the logits merging as above, the result will stay close to the persona set out in the generation prompt and stay far away from the guidance/negative prompt.

@Vermeille
Copy link
Contributor

@bullno1 100% correct. We had extremely """great""" results with asking for inappropriate or angry response.
100% hilariously unhinged and over the top

@AlphaAtlas
Copy link

Fascinating, I think i get it now 🤔

Many (most?) users are running instruction-tuned assistants out in the wild, so this is an interesting issue for the downstream UI devs.

@evanmiller
Copy link
Contributor

@Vermeille Should this subscript in Equation (7) be $j\lt i$ rather than $i\lt j$? Trying to understand the paper, and want to make sure it's not "peeking" into future tokens....

image

@Vermeille
Copy link
Contributor

Vermeille commented Jul 5, 2023 via email

@ggerganov ggerganov changed the title Feature request: Classifier-Free Guidance sampling to stay on topic better llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better Jul 5, 2023
@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers generation quality Quality of model output research 🔬 labels Jul 5, 2023
@ggerganov
Copy link
Member

We should add an example of this technique or try to straight up add it to the main / server examples if it does not complicate the logic too much. As noted by @bullno1, the multi-context support should make this relatively easy.

If I'm understanding correctly, there might be a more efficient way to evaluate both contexts with a single batched pass, but to do that we'll need some extra changes to handle multiple KV caches. We can think about this later though

@Vermeille
Copy link
Contributor

@ggerganov If you give me some guidance so that it does not take me too much time to implement the feature ("it goes in this file, take example here, the softmax function is this one, and the caveats are this and that"), I volunteer.

@ggerganov
Copy link
Member

@Vermeille going through the backlog of issues, just reached this one, so sorry for the late reply.

I see that @bullno1 already started an implementation and from a quick look it is pretty much what is needed.
Let's work on #2135 and merge it

@Vermeille
Copy link
Contributor

The proposed implementation looks pretty slick indeed!

@Mihaiii
Copy link
Contributor

Mihaiii commented Aug 31, 2023

Should I be able to use cfg without a negative prompt, but instead still have the cfg-scale in order to enforce the prompt (or parts of the prompt)?

Edit: or will it work if I have a negative prompt that is same as the main prompt, but has "don't" instead of "do" (where "do" is in the main prompt)?

@Vermeille
Copy link
Contributor

Should I be able to use cfg without a negative prompt, but instead still have the cfg-scale in order to enforce the prompt (or parts of the prompt)?

@Mihaiii Yes. That's the primary usage defended in the paper.

Edit: or will it work if I have a negative prompt that is same as the main prompt, but has "don't" instead of "do" (where "do" is in the main prompt)?

Should do the trick as well.

@DenisSergeevitch
Copy link

--cfg-negative-prompt usage heavily affects performance on the M2 Metal hardware. Is it the same with CUDA?

@bullno1
Copy link
Contributor

bullno1 commented Nov 24, 2023

--cfg-negative-prompt usage heavily affects performance on the M2 Metal hardware. Is it the same with CUDA?

It runs inference twice, once for the normal prompt and once for the negative prompt.
Maybe the batch API could help.

@ggerganov
Copy link
Member

Yes, with batched decoding you can run F16 + CFG at the same speed as regular F16 decoding.
For quantum models, there are some ifs and buts, but it should also be possible

@RafaAguilar
Copy link

RafaAguilar commented Dec 2, 2023

Yes, with batched decoding you can run F16 + CFG at the same speed as regular F16 decoding.

For quantum models, there are some ifs and buts, but it should also be possible

It was expected to run slower, but as far I can see it also uses half of GPU capacity from what I can see at the GPU history.

I'm using below command:

./main --cfg-scale 4 --mirostat 2\
 -n 500 -c 2048 -b 8 --temp 0.7 \
 --top-p 0.2 -ngl 99 -t 10 \
 -m $LLAMACPP_MODEL \
 --prompt $LLAMACPP_PROMPT \
 --cfg-negative-prompt $LLAMACPP_NEGATIVE_PROMPT

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output good first issue Good for newcomers research 🔬
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants