Speculative Decoding Appendix Expansion #69

stefanZorcic · 2025-08-20T01:54:03Z

Added a section on speculative decoding limitations/techniques to overcome them (ie. EAGLE, Medusa, etc...)

…rcome them

google-cla · 2025-08-20T01:54:07Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jacobaustin123 · 2025-08-20T02:02:10Z

inference.md


 Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

+Traditionally, speculative decoding was easiest to deploy when a smaller version of an existing model already exists for a similar sampling distribution, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new head to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.


A few small nits:

Present tense is probably better. "Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model".

"..., the smaller drafter can still be too expensive if the acceptance rate is low."

"Instead, it has become common to embed a drafter within the main model, for instance..."

It's not clear how EAGLE and Medusa are different. What does multiple parallel heads mean? Maybe don't repeat the word head twice.

I like the new wording, updated.

Also updated Eagle description to be a "new network" as eagle itself is a head but it is better characterized as a network within its implementation.

jacobaustin123 · 2025-08-20T02:25:22Z

inference.md


 Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

+Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new network to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.


a few more nits:

"relies on the existence... for LLaMA-2 70B, which often doesn't exist. Even when one is available..."

"...for instance by adding a small drafter head near the final layers of the main model that's both faster to run and higher quality because it shares parameters with the target model (cite ...)."

jacobaustin123 · 2025-08-20T14:15:59Z

inference.md


 Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

+Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B, which often doesn't exist. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a small drafter head at the penultimate layer of the main model that's both faster to run and higher quality because it shares parameters with the target model <d-cite key="eagle"></d-cite> or has multiple parallel heads <d-cite key="medusa"></d-cite> or is a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.


I still don't think we need all these details. Just "Instead, it can be helpful to embed a drafter within the main model, for instance by adding a dedicated drafter head to one of the later layers of the base model ([1], [2], [3]). Because this head shares most of its parameters with the main model, it's faster to run and matches the sampling distribution more closely."

I don't think we need to really get into the details of how it's implemented beyond that.

…ion.

jacobaustin123

LGTM! Thank you.

Added a section on speculative decoding limitations/techniques to ove…

dd93237

…rcome them

jacobaustin123 reviewed Aug 20, 2025

View reviewed changes

Updated wording in speculative decoding section

c8e1c8b

jacobaustin123 reviewed Aug 20, 2025

View reviewed changes

Updated wording for speculative decoding appendix

4926fe7

jacobaustin123 reviewed Aug 20, 2025

View reviewed changes

Updated wording to remove excess details is speculative decoding sect…

ec9d891

…ion.

jacobaustin123 approved these changes Aug 20, 2025

View reviewed changes

jacobaustin123 merged commit ea04302 into jax-ml:main Aug 20, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative Decoding Appendix Expansion #69

Speculative Decoding Appendix Expansion #69

Uh oh!

stefanZorcic commented Aug 20, 2025

Uh oh!

google-cla bot commented Aug 20, 2025

Uh oh!

jacobaustin123 Aug 20, 2025

Uh oh!

stefanZorcic Aug 20, 2025

Uh oh!

jacobaustin123 Aug 20, 2025

Uh oh!

stefanZorcic Aug 20, 2025

Uh oh!

jacobaustin123 Aug 20, 2025

Uh oh!

jacobaustin123 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

		Traditionally, speculative decoding was easiest to deploy when a smaller version of an existing model already exists for a similar sampling distribution, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new head to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.


		Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

		Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new network to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.

Speculative Decoding Appendix Expansion #69

Speculative Decoding Appendix Expansion #69

Uh oh!

Conversation

stefanZorcic commented Aug 20, 2025

Uh oh!

google-cla bot commented Aug 20, 2025

Uh oh!

jacobaustin123 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

stefanZorcic Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jacobaustin123 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

stefanZorcic Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jacobaustin123 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jacobaustin123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants