feat: support speculative decoding for llamacpp #402

cr7258 · 2025-05-06T14:04:58Z

What this PR does / why we need it

Support speculative decoding for llama.cpp, which significantly improves response latency by leveraging draft model predictions. From the logs, we can see the main model and draft model are loaded successfully.

kubectl logs llamacpp-speculator-0 

Defaulted container "model-runner" out of: model-runner, model-loader (init), model-loader-1 (init)
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 5280 (27aa2595) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 4, n_threads_batch = 4, total_threads = 8

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 7
main: loading model

# load main model
srv    load_model: loading model '/workspace/models/llama-2-7b.Q8_0.gguf'
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /workspace/models/llama-2-7b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V2
print_info: file type   = Q8_0
print_info: file size   = 6.67 GiB (8.50 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 32
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 4096
print_info: n_embd_v_gqa     = 4096
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 6.74 B
print_info: general.name     = LLaMA v2
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  6828.64 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 256
llama_kv_cache_unified:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache_unified: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:        CPU compute buffer size =    71.01 MiB
llama_context: graph nodes  = 967
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

# load draft model
srv    load_model: loading draft model '/workspace/models/llama-2-7b.Q2_K.gguf'
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /workspace/models/llama-2-7b.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 10
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:   65 tensors
llama_model_loader: - type q3_K:  160 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V2
print_info: file type   = Q2_K - Medium
print_info: file size   = 2.63 GiB (3.35 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 32
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 4096
print_info: n_embd_v_gqa     = 4096
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 6.74 B
print_info: general.name     = LLaMA v2
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  2694.32 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 256
llama_kv_cache_unified:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache_unified: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:        CPU compute buffer size =    71.01 MiB
llama_context: graph nodes  = 967
llama_context: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 256
llama_kv_cache_unified:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache_unified: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:        CPU compute buffer size =    71.01 MiB
llama_context: graph nodes  = 967
llama_context: graph splits = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /health 240.243.170.78 200

Send a inference request:

curl --request POST \
    --url http://localhost:8080/v1/completions \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
{"choices":[{"text":"\n1. Register a domain name (if you don’t have one yet)\n2. Choose a web host\n3. Choose a WordPress theme\n4. Create your website (using WordPress)\n5. Create and upload your website\n6. Optimize your website for search engines\n8. Test your website\n9. Monitor your website\n10. Update your website\nBuilding a website is not difficult if you have the right tools and resources.\nIn this article, we’ll cover everything you need to know to get started, including choosing a domain name, finding a web","index":0,"logprobs":null,"finish_reason":"length"}],"created":1746539608,"model":"gpt-3.5-turbo","system_fingerprint":"b5280-27aa2595","object":"text_completion","usage":{"completion_tokens":128,"prompt_tokens":14,"total_tokens":142},"id":"chatcmpl-mNfNdrU0Kivf873PYZyw0R5Ahhi1KNnX","timings":{"prompt_n":1,"prompt_ms":194.28,"prompt_per_token_ms":194.28,"prompt_per_second":5.147210212065061,"predicted_n":128,"predicted_ms":35572.597,"predicted_per_token_ms":277.9109140625,"predicted_per_second":3.598275380344033,"draft_n":155,"draft_n_accepted":15}}

Which issue(s) this PR fixes

Fixes #240

Special notes for your reviewer

Does this PR introduce a user-facing change?

support speculative decoding for llamacpp

cr7258 · 2025-05-06T14:05:42Z

api/inference/v1alpha1/config_types.go

@@ -42,7 +42,6 @@ type BackendRuntimeConfig struct {
 	// ConfigName represents the recommended configuration name for the backend,
 	// It will be inferred from the models in the runtime if not specified, e.g. default,
 	// speculative-decoding.
-	// +kubebuilder:default=default


Why remove the default value of the ConfigName field?

llmaz infers the recommended configuration name for the backend if not specified. However, the kubebuilder:default=default annotation prevents this inference by always setting ConfigName to "default" instead of leaving it as nil, bypassing the role-based detection logic by mistake.

Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.

If we set the default value here, when we don't define configName in the Playground, even though we define main and draft models, the configName will always be set as default instead of speculative-decoding.

Recalled why we remove this before. Make sense to me.

cr7258 · 2025-05-06T14:06:36Z

/kind feature

kerthcet · 2025-05-07T11:30:23Z

I will take a look tonight or tomorrow at the latest.

kerthcet

Only one comment.

kerthcet · 2025-05-08T12:50:30Z

api/inference/v1alpha1/config_types.go

@@ -42,7 +42,6 @@ type BackendRuntimeConfig struct {
 	// ConfigName represents the recommended configuration name for the backend,
 	// It will be inferred from the models in the runtime if not specified, e.g. default,
 	// speculative-decoding.
-	// +kubebuilder:default=default


Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.

kerthcet · 2025-05-08T15:05:06Z

/lgtm
/approve

Thanks @cr7258

kerthcet · 2025-05-08T15:07:28Z

/triage accepted

feat: support speculative decoding for llamacpp

175260a

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 6, 2025

InftyAI-Agent requested a review from kerthcet May 6, 2025 14:05

cr7258 commented May 6, 2025

View reviewed changes

InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 6, 2025

cr7258 added 2 commits May 7, 2025 09:39

update doc

1ed31f8

Merge branch 'main' into speculative-llamacpp

fca2fbe

kerthcet reviewed May 8, 2025

View reviewed changes

InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 8, 2025

InftyAI-Agent assigned kerthcet May 8, 2025

InftyAI-Agent added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a label and requires one. labels May 8, 2025

kerthcet merged commit f1fa51f into InftyAI:main May 9, 2025
41 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: support speculative decoding for llamacpp #402

feat: support speculative decoding for llamacpp #402

Uh oh!

cr7258 commented May 6, 2025

Uh oh!

cr7258 May 6, 2025

Uh oh!

kerthcet May 8, 2025

Uh oh!

cr7258 May 8, 2025

Uh oh!

kerthcet May 8, 2025

Uh oh!

cr7258 commented May 6, 2025

Uh oh!

kerthcet commented May 7, 2025

Uh oh!

kerthcet left a comment

Uh oh!

kerthcet May 8, 2025

Uh oh!

kerthcet commented May 8, 2025

Uh oh!

kerthcet commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: support speculative decoding for llamacpp #402

feat: support speculative decoding for llamacpp #402

Uh oh!

Conversation

cr7258 commented May 6, 2025

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

cr7258 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

kerthcet May 8, 2025

Choose a reason for hiding this comment

Uh oh!

cr7258 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kerthcet May 8, 2025

Choose a reason for hiding this comment

Uh oh!

cr7258 commented May 6, 2025

Uh oh!

kerthcet commented May 7, 2025

Uh oh!

kerthcet left a comment

Choose a reason for hiding this comment

Uh oh!

kerthcet May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kerthcet commented May 8, 2025

Uh oh!

kerthcet commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!