-
-
Notifications
You must be signed in to change notification settings - Fork 220
Classifier-Free Guidance #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi! Author here. There are already open issues PRs tackling this. And some other people want it implemented too: If you have implementation questions, I can probably answer those, or even implement it myself if given directions. I'm zero familiar with exllama's codebase so I would just need a few hints (basically, a few design hints and the right place in the code). Although @ortegaalfredo seems to be willing to do it while I'm landing the huggingface version :) |
@Vermeille You're an absolute hero 🥲 I originally tried with one CFGLogits warper with just a single word or phrase as the cfg input but it was pretty incoherent. I tried again but this time using the instruction text modified by removing/adding a word or phrase. It did have a better effect, but the relationship between CFG value and effect seemed inversely correlated (lower value was adhering more). Basically I'm really confuse on how to properly achieve the positive/negative text behavior given the code in the PR, I feel like I did something wrong |
This is the exact code used for the paper:
For negative prompting, we set There are some other intricacies for assistants, which are outlined here |
Thanks alot! That's really helpful. I think this comment in that thread also helps clarify a lot of things:
I think what I was doing wrong is the other way around -- set the desired prompt as the CFG guidance and use the baseline as the normal generation, but based on this it seems it should be the other way around and so it is not inversed anymore. It's a little unintuitive but I think I got it. Let's see: I have a instruction and I want to emphasize that it needs to be written in Spanish, so in this case, I will set my generation tokens to be: |
Correct Sir!
|
Please note that having a negative prompt set to a simpler / opposite prompt is what we did in the paper for assistants, but not for untuned LMs where we just remove the prompt as outlined in my previous answer |
Much less than you! You 4x'd the context size of LLaMA for free. That's quite the achievement! |
Is this to say that a way to leverage CFG for assistant tuned models would be a relative text weight syntax à la Midjourney? /imagine football game::1 old coach::3 This would give ‘old coach’ a weight 3x as high as 'football game'. |
Ported over the code from Transformers, but having trouble getting it working. The AI turns quite wonky. I probably did something completely wrong.
Note: I have also modified However, the results are just quite bad. I'm not sure why, but I think it may have to do with the cache. The Transformers code allows you to specify cached key value pairs in the logits processor, but I think a further rewrite of exllama kernels will be required to accomplish that. |
I quickly glanced over the code in the bus but it looks like your negative context always uses the last token only. Also you should remove that last softmax and linear interpolation |
I think that the dimensionality of tensors in exllama is also different from Transformers.
Does this look right, then? Or am I misunderstanding you?
|
Still doesn't work. It seems that no matter what I try, the exllama just falls into looping after a few tokens. Still not sure why that is, but if I comment out
These two lines, like so, then exllama functions normally. So I am leaning towards this being a cache issue. So I think probably, self.out needs its own cache. This is only a guess though. |
So, I didn't read up on CFG yet, but it looks like you're essentially doing two generations in parallel and mixing the logits..? If that's the case, you would need a cache per sequence. Since they're, well, different sequences, with different keys/values to cache. If you run the forward pass with a cache, keys and values from that forward pass will be added to that cache. In any case, the sampler is supposed to be called on a set of logits, and generating new logits within it seems wrong. Especially since |
Yes. You CFG-mix the logits, sample a new token, append it to both branches, and start over. |
Okay, I wrote up an example in The output does indeed seem to be a smooth gradient between "helpful" and rude, as per the two prompts:
|
The code looks correct! |
Just a heads up on CFG, a technique in which: "Models can perform as well as a model 2x as large" at the cost of 2x the computation, but that is negligible if it improves 65B to a 130B-class LLM.
This technique already works with computer vision NN.
https://arxiv.org/abs/2306.17806
https://twitter.com/Vermeille_/status/1675664118500454400
Any tips on how to implement this in exllama? I'm a developer so perhaps I can try to implement this myself.
The text was updated successfully, but these errors were encountered: