Skip to content

Conversation

stefanZorcic
Copy link
Contributor

Added a section on speculative decoding limitations/techniques to overcome them (ie. EAGLE, Medusa, etc...)

Copy link

google-cla bot commented Aug 20, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

inference.md Outdated

Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

Traditionally, speculative decoding was easiest to deploy when a smaller version of an existing model already exists for a similar sampling distribution, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new head to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nits:

  • Present tense is probably better. "Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model".
  • "..., the smaller drafter can still be too expensive if the acceptance rate is low."
  • "Instead, it has become common to embed a drafter within the main model, for instance..."
  • It's not clear how EAGLE and Medusa are different. What does multiple parallel heads mean? Maybe don't repeat the word head twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new wording, updated.

Also updated Eagle description to be a "new network" as eagle itself is a head but it is better characterized as a network within its implementation.

inference.md Outdated

Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a new network to the penultimate layer <d-cite key="eagle"></d-cite> or multiple parallel heads <d-cite key="medusa"></d-cite> or a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more nits:

  • "relies on the existence... for LLaMA-2 70B, which often doesn't exist. Even when one is available..."
  • "...for instance by adding a small drafter head near the final layers of the main model that's both faster to run and higher quality because it shares parameters with the target model (cite ...)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

inference.md Outdated

Every accepted token becomes more expensive in terms of FLOPs on average (since some will be rejected, and we have to call a draft model), but we wring more FLOPs out of the hardware, and the small model is cheap, so we win overall. Since everything has been checked by the big model, we don't change the sampling distribution at all (though the exact trajectory will differ for non-greedy).

Traditionally, speculative decoding relies on the existence of a smaller model with a similar sampling distribution to the target model, e.g. LLaMA-2 2B for LLaMA-2 70B, which often doesn't exist. Even when this is available, the smaller drafter can still be too expensive if the acceptance rate is low. Instead, it can be helpful to embed a drafter within the main model, for instance by adding a small drafter head at the penultimate layer of the main model that's both faster to run and higher quality because it shares parameters with the target model <d-cite key="eagle"></d-cite> or has multiple parallel heads <d-cite key="medusa"></d-cite> or is a multi-token prediction (MTP) head <d-cite key="DeepSeek3"></d-cite>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we need all these details. Just "Instead, it can be helpful to embed a drafter within the main model, for instance by adding a dedicated drafter head to one of the later layers of the base model ([1], [2], [3]). Because this head shares most of its parameters with the main model, it's faster to run and matches the sampling distribution more closely."

I don't think we need to really get into the details of how it's implemented beyond that.

Copy link
Collaborator

@jacobaustin123 jacobaustin123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you.

@jacobaustin123 jacobaustin123 merged commit ea04302 into jax-ml:main Aug 20, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants