Skip to content

llama : add example for speculative sampling #2030

Closed
@ggerganov

Description

@ggerganov

Speculative sampling is explained here: https://arxiv.org/abs/2302.01318

In more simple terms here:

For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.

We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggml-org/ggml#293

Activity

SlyEcho

SlyEcho commented on Jun 29, 2023

@SlyEcho
Collaborator

Would it make sense to do something like a beam search with the fast model and then evaluate the result with the larger model?

ggerganov

ggerganov commented on Jul 1, 2023

@ggerganov
MemberAuthor

Yes, this might be even more efficient as it might increase the "success" rate of the drafted sequence

evanmiller

evanmiller commented on Jul 5, 2023

@evanmiller
Contributor

Note that speculative sampling increases overall compute. The algorithm in the linked paper executes the "main" model in parallel for the speculative sequence:

image

If local compute resources are saturated then speculative sampling won't decrease prediction latency; the algorithm requires pre-existing parallelism of some kind (either farming out the parallel evaluation or perhaps a multi-node pipeline architecture). Based on my understanding of llama.cpp's architecture, it doesn't seem like a great fit; but maybe there's a modification that can be made to work?

DKormann

DKormann commented on Jul 28, 2023

@DKormann

It increases overall computation but also increases parallelisation of inference on the main model so it can still be faster

ggerganov

ggerganov commented on Aug 10, 2023

@ggerganov
MemberAuthor

Staged speculative decoding

https://arxiv.org/abs/2308.04623

charliexchen

charliexchen commented on Aug 27, 2023

@charliexchen

Hey hey -- I'm one of the authors of https://arxiv.org/abs/2302.01318. It's good to see the open source community pick up on this! I'm sadly not in a position to contribute directly, but since this is already on your radar:

  1. You have way less FLOPs on a CPU, but at the same time DDR4/DDR5 ram is also much slower, so it balances out to an extent. Compute resources will get saturated more quickly compared to most accelerators, but there's enough headroom on higher end CPUs for this to still work. To figure out exactly when this happens, you can just use llama.ccp's batching functionality and time how much you can push things before things start slowing down.
  2. The smallest llamas still give some decent speedups, but you want to maximise the size difference between the models (without making the drafter too terrible) to get the most out of this. You can see that in the Comparison SSp / RSp chart in https://github.com/dust-tt/llama-ssp (This is running on GPU, but assuming model time is proportional it's still instructive)
ggerganov

ggerganov commented on Aug 27, 2023

@ggerganov
MemberAuthor

@charliexchen Thanks for stopping by. We are getting close to have everything needed to implement this. Hopefully will have a prototype soon

  1. Yes, this matches with my understanding

Model_type Ms/token Speed Improvement
SSp 30B/7B 180ms 1.8x

If we can replicate the speed improvement factor on Apple Silicon + Metal, it would be a game changer.

In your experience, if you are generating highly structured text (e.g. source code in some programming language), does it allow you to increase the size difference with the drafter significantly without losing the speed effect? I imagine this would be the case to some extend since there would be many "easy-to-predict" tokens in such cases.

charliexchen

charliexchen commented on Aug 27, 2023

@charliexchen

In our paper we got much higher speedups for the HumanEval code generation task compared to XSUM using the same model pairing, so acceptance rate is indeed rather task specific. If you have an "easier" task in some sense, then shrinking the drafter is absolutely on the table.

evanmiller

evanmiller commented on Aug 27, 2023

@evanmiller
Contributor

@charliexchen Did you consider using the same model as a draft model? I mean after layer K < N, immediate sample the output to form a draft token.

charliexchen

charliexchen commented on Aug 27, 2023

@charliexchen

This seems related to CALM (Which is mentioned in one of the other threads). It should work, but you need to explicitly train/finetune the model to handle that.

The nice thing about spec sampling is that you don't have to touch the target model at all.

ggerganov

ggerganov commented on Aug 31, 2023

@ggerganov
MemberAuthor

I'll try to do a PoC of speculative sampling today - will post a branch when I get something running

self-assigned this
on Aug 31, 2023
moved this from Todo to In Progress in ggml : roadmapon Aug 31, 2023
ggerganov

ggerganov commented on Sep 3, 2023

@ggerganov
MemberAuthor

Closed via #2926

moved this from In Progress to Done in ggml : roadmapon Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @evanmiller@SlyEcho@ggerganov@charliexchen@DKormann

      Issue actions

        llama : add example for speculative sampling · Issue #2030 · ggml-org/llama.cpp