Skip to content

feat: support speculative decoding for llamacpp #402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion api/inference/v1alpha1/config_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ type BackendRuntimeConfig struct {
// ConfigName represents the recommended configuration name for the backend,
// It will be inferred from the models in the runtime if not specified, e.g. default,
// speculative-decoding.
// +kubebuilder:default=default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the default value of the ConfigName field?

llmaz infers the recommended configuration name for the backend if not specified. However, the kubebuilder:default=default annotation prevents this inference by always setting ConfigName to "default" instead of leaving it as nil, bypassing the role-based detection logic by mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set the default value here, when we don't define configName in the Playground, even though we define main and draft models, the configName will always be set as default instead of speculative-decoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recalled why we remove this before. Make sense to me.

ConfigName *string `json:"configName,omitempty"`
// Args defined here will "append" the args defined in the recommendedConfig,
// either explicitly configured in configName or inferred in the runtime.
Expand Down
14 changes: 14 additions & 0 deletions chart/templates/backends/llamacpp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,20 @@ spec:
limits:
cpu: 2
memory: 4Gi
- name: speculative-decoding
args:
- -m
- "{{`{{ .ModelPath }}`}}"
- -md
- "{{`{{ .DraftModelPath }}`}}"
- --host
- "0.0.0.0"
- --port
- "8080"
- --draft-max
- "16"
- --draft-min
- "5"
startupProbe:
periodSeconds: 10
failureThreshold: 30
Expand Down
1 change: 0 additions & 1 deletion config/crd/bases/inference.llmaz.io_playgrounds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,6 @@ spec:
the hood, e.g. vLLM.
type: string
configName:
default: default
description: |-
ConfigName represents the recommended configuration name for the backend,
It will be inferred from the models in the runtime if not specified, e.g. default,
Expand Down
5 changes: 5 additions & 0 deletions docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ We provide a set of examples to help you serve large language models, by default
- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
- [Deploy models via ollama](#deploy-models-via-ollama)
- [Speculative Decoding with llama.cpp](#speculative-decoding-with-llamacpp)
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
- [Multi-Host Inference](#multi-host-inference)
- [Deploy Host Models](#deploy-host-models)
Expand Down Expand Up @@ -59,6 +60,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference

[ollama](https://github.com/ollama/ollama) based on llama.cpp, aims for local deploy. see [example](./ollama/) here.

### Speculative Decoding with llama.cpp

llama.cpp supports speculative decoding to significantly improve inference performance, see [example](./speculative-decoding/llamacpp/) here.

### Speculative Decoding with vLLM

[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
Expand Down
9 changes: 1 addition & 8 deletions docs/examples/speculative-decoding/llamacpp/playground.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This is just an toy example, because it doesn't make any sense
# in real world, drafting tokens for the model with similar size.
# in real world, drafting tokens for the model with smaller size.
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
Expand Down Expand Up @@ -38,10 +38,3 @@ spec:
backendName: llamacpp
args:
- -fa # use flash attention
resources:
requests:
cpu: 4
memory: "8Gi"
limits:
cpu: 4
memory: "8Gi"
29 changes: 14 additions & 15 deletions test/config/backends/llamacpp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,21 +29,20 @@ spec:
limits:
cpu: 2
memory: 4Gi
# TODO: not supported yet, see https://github.com/InftyAI/llmaz/issues/240.
# - name: speculative-decoding
# args:
# - -m
# - "{{ .ModelPath }}"
# - -md
# - "{{ .DraftModelPath }}"
# - --host
# - "0.0.0.0"
# - --port
# - "8080"
# - --draft-max
# - "16"
# - --draft-min
# - "5"
- name: speculative-decoding
args:
- -m
- "{{ .ModelPath }}"
- -md
- "{{ .DraftModelPath }}"
- --host
- "0.0.0.0"
- --port
- "8080"
- --draft-max
- "16"
- --draft-min
- "5"
startupProbe:
periodSeconds: 10
failureThreshold: 30
Expand Down
53 changes: 25 additions & 28 deletions test/e2e/playground_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -142,32 +142,29 @@ var _ = ginkgo.Describe("playground e2e tests", func() {
hpa := &autoscalingv2.HorizontalPodAutoscaler{}
gomega.Expect(k8sClient.Get(ctx, types.NamespacedName{Name: playground.Name, Namespace: playground.Namespace}, hpa)).To(gomega.Succeed())
})
// TODO: add e2e tests.
// ginkgo.It("SpeculativeDecoding with llama.cpp", func() {
// targetModel := wrapper.MakeModel("llama2-7b-q8-gguf").FamilyName("llama2").ModelSourceWithModelHub("Huggingface").ModelSourceWithModelID("TheBloke/Llama-2-7B-GGUF", "llama-2-7b.Q8_0.gguf", "", nil, nil).Obj()
// gomega.Expect(k8sClient.Create(ctx, targetModel)).To(gomega.Succeed())
// defer func() {
// gomega.Expect(k8sClient.Delete(ctx, targetModel)).To(gomega.Succeed())
// }()
// draftModel := wrapper.MakeModel("llama2-7b-q2-k-gguf").FamilyName("llama2").ModelSourceWithModelHub("Huggingface").ModelSourceWithModelID("TheBloke/Llama-2-7B-GGUF", "llama-2-7b.Q2_K.gguf", "", nil, nil).Obj()
// gomega.Expect(k8sClient.Create(ctx, draftModel)).To(gomega.Succeed())
// defer func() {
// gomega.Expect(k8sClient.Delete(ctx, draftModel)).To(gomega.Succeed())
// }()

// playground := wrapper.MakePlayground("llamacpp-speculator", ns.Name).
// MultiModelsClaim([]string{"llama2-7b-q8-gguf", "llama2-7b-q2-k-gguf"}, coreapi.SpeculativeDecoding).
// BackendRuntime("llamacpp").BackendLimit("cpu", "4").BackendRequest("memory", "8Gi").
// Replicas(1).
// Obj()
// gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
// validation.ValidatePlayground(ctx, k8sClient, playground)
// validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)

// service := &inferenceapi.Service{}
// gomega.Expect(k8sClient.Get(ctx, types.NamespacedName{Name: playground.Name, Namespace: playground.Namespace}, service)).To(gomega.Succeed())
// validation.ValidateService(ctx, k8sClient, service)
// validation.ValidateServiceStatusEqualTo(ctx, k8sClient, service, inferenceapi.ServiceAvailable, "ServiceReady", metav1.ConditionTrue)
// validation.ValidateServicePods(ctx, k8sClient, service)
// })
ginkgo.It("SpeculativeDecoding with llama.cpp", func() {
targetModel := wrapper.MakeModel("llama2-7b-q8-gguf").FamilyName("llama2").ModelSourceWithModelHub("Huggingface").ModelSourceWithModelID("TheBloke/Llama-2-7B-GGUF", "llama-2-7b.Q8_0.gguf", "", nil, nil).Obj()
gomega.Expect(k8sClient.Create(ctx, targetModel)).To(gomega.Succeed())
defer func() {
gomega.Expect(k8sClient.Delete(ctx, targetModel)).To(gomega.Succeed())
}()
draftModel := wrapper.MakeModel("llama2-7b-q2-k-gguf").FamilyName("llama2").ModelSourceWithModelHub("Huggingface").ModelSourceWithModelID("TheBloke/Llama-2-7B-GGUF", "llama-2-7b.Q2_K.gguf", "", nil, nil).Obj()
gomega.Expect(k8sClient.Create(ctx, draftModel)).To(gomega.Succeed())
defer func() {
gomega.Expect(k8sClient.Delete(ctx, draftModel)).To(gomega.Succeed())
}()

playground := wrapper.MakePlayground("llamacpp-speculator", ns.Name).
ModelClaims([]string{"llama2-7b-q8-gguf", "llama2-7b-q2-k-gguf"}, []string{"main", "draft"}).
BackendRuntime("llamacpp").Replicas(1).Obj()
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)

service := &inferenceapi.Service{}
gomega.Expect(k8sClient.Get(ctx, types.NamespacedName{Name: playground.Name, Namespace: playground.Namespace}, service)).To(gomega.Succeed())
validation.ValidateService(ctx, k8sClient, service)
validation.ValidateServiceStatusEqualTo(ctx, k8sClient, service, inferenceapi.ServiceAvailable, "ServiceReady", metav1.ConditionTrue)
validation.ValidateServicePods(ctx, k8sClient, service)
})
})
Loading