Closed
Description
Speculative sampling is explained here: https://arxiv.org/abs/2302.01318
In more simple terms here:
- Combine large LLM with small LLM for faster inference #630 (comment)
- Combine large LLM with small LLM for faster inference #630 (comment)
For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.
We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggml-org/ggml#293
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
SlyEcho commentedon Jun 29, 2023
Would it make sense to do something like a beam search with the fast model and then evaluate the result with the larger model?
ggerganov commentedon Jul 1, 2023
Yes, this might be even more efficient as it might increase the "success" rate of the drafted sequence
evanmiller commentedon Jul 5, 2023
Note that speculative sampling increases overall compute. The algorithm in the linked paper executes the "main" model in parallel for the speculative sequence:
If local compute resources are saturated then speculative sampling won't decrease prediction latency; the algorithm requires pre-existing parallelism of some kind (either farming out the parallel evaluation or perhaps a multi-node pipeline architecture). Based on my understanding of llama.cpp's architecture, it doesn't seem like a great fit; but maybe there's a modification that can be made to work?
DKormann commentedon Jul 28, 2023
It increases overall computation but also increases parallelisation of inference on the main model so it can still be faster
ggerganov commentedon Aug 10, 2023
Staged speculative decoding
https://arxiv.org/abs/2308.04623
charliexchen commentedon Aug 27, 2023
Hey hey -- I'm one of the authors of https://arxiv.org/abs/2302.01318. It's good to see the open source community pick up on this! I'm sadly not in a position to contribute directly, but since this is already on your radar:
ggerganov commentedon Aug 27, 2023
@charliexchen Thanks for stopping by. We are getting close to have everything needed to implement this. Hopefully will have a prototype soon
Yes, this matches with my understanding
If we can replicate the speed improvement factor on Apple Silicon + Metal, it would be a game changer.
In your experience, if you are generating highly structured text (e.g. source code in some programming language), does it allow you to increase the size difference with the drafter significantly without losing the speed effect? I imagine this would be the case to some extend since there would be many "easy-to-predict" tokens in such cases.
charliexchen commentedon Aug 27, 2023
In our paper we got much higher speedups for the HumanEval code generation task compared to XSUM using the same model pairing, so acceptance rate is indeed rather task specific. If you have an "easier" task in some sense, then shrinking the drafter is absolutely on the table.
evanmiller commentedon Aug 27, 2023
@charliexchen Did you consider using the same model as a draft model? I mean after layer K < N, immediate sample the output to form a draft token.
charliexchen commentedon Aug 27, 2023
This seems related to CALM (Which is mentioned in one of the other threads). It should work, but you need to explicitly train/finetune the model to handle that.
The nice thing about spec sampling is that you don't have to touch the target model at all.
ggerganov commentedon Aug 31, 2023
I'll try to do a PoC of speculative sampling today - will post a branch when I get something running
ggerganov commentedon Sep 3, 2023
Closed via #2926