-
Notifications
You must be signed in to change notification settings - Fork 94
Speculative Decoding Appendix Expansion #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
inference.md
Outdated
|
||
Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy). | ||
|
||
Traditionally, speculative decoding was easiest to deploy when a smaller version of an existing model already exists for a similar sampling distribution, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new head to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nits:
- Present tense is probably better. "Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model".
- "..., the smaller drafter can still be too expensive if the acceptance rate is low."
- "Instead, it has become common to embed a drafter within the main model, for instance..."
- It's not clear how EAGLE and Medusa are different. What does multiple parallel heads mean? Maybe don't repeat the word head twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new wording, updated.
Also updated Eagle description to be a "new network" as eagle itself is a head but it is better characterized as a network within its implementation.
inference.md
Outdated
|
||
Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy). | ||
|
||
Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new network to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more nits:
- "relies on the existence... for LLaMA-2 70B, which often doesn't exist. Even when one is available..."
- "...for instance by adding a small drafter head near the final layers of the main model that's both faster to run and higher quality because it shares parameters with the target model (cite ...)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
inference.md
Outdated
|
||
Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy). | ||
|
||
Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B, which often doesn't exist. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a small drafter head at the penultimate layer of the main model that's both faster to run and higher quality because it shares parameters with the target model <d-cite key="eagle"></d-cite> or has multiple parallel heads <d-cite key="medusa"></d-cite> or is a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think we need all these details. Just "Instead, it can be helpful to embed a drafter within the main model, for instance by adding a dedicated drafter head to one of the later layers of the base model ([1], [2], [3]). Because this head shares most of its parameters with the main model, it's faster to run and matches the sampling distribution more closely."
I don't think we need to really get into the details of how it's implemented beyond that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you.
Added a section on speculative decoding limitations/techniques to overcome them (ie. EAGLE, Medusa, etc...)