-
-
Notifications
You must be signed in to change notification settings - Fork 29
feat: support speculative decoding for llamacpp #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -42,7 +42,6 @@ type BackendRuntimeConfig struct { | |||
// ConfigName represents the recommended configuration name for the backend, | |||
// It will be inferred from the models in the runtime if not specified, e.g. default, | |||
// speculative-decoding. | |||
// +kubebuilder:default=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the default value of the ConfigName field?
llmaz infers the recommended configuration name for the backend if not specified. However, the kubebuilder:default=default
annotation prevents this inference by always setting ConfigName to "default" instead of leaving it as nil, bypassing the role-based detection logic by mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set the default value here, when we don't define configName
in the Playground, even though we define main and draft models, the configName will always be set as default
instead of speculative-decoding
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recalled why we remove this before. Make sense to me.
/kind feature |
I will take a look tonight or tomorrow at the latest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one comment.
@@ -42,7 +42,6 @@ type BackendRuntimeConfig struct { | |||
// ConfigName represents the recommended configuration name for the backend, | |||
// It will be inferred from the models in the runtime if not specified, e.g. default, | |||
// speculative-decoding. | |||
// +kubebuilder:default=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.
/lgtm Thanks @cr7258 |
/triage accepted |
What this PR does / why we need it
Support speculative decoding for llama.cpp, which significantly improves response latency by leveraging draft model predictions. From the logs, we can see the main model and draft model are loaded successfully.
Send a inference request:
Which issue(s) this PR fixes
Fixes #240
Special notes for your reviewer
Does this PR introduce a user-facing change?