-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Cost based scaling down of pods #1888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ingvagabund The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
616b12a
to
72397be
Compare
/sig apps |
72397be
to
65a80f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great start @ingvagabund, smarter downscaling is something that has been requested for a long time but isn't quite covered by the descheduler because, like you mention, it's not edge-triggered.
While we consider options for implementation, maybe triggering the descheduler on a scale-down event could be possible too. It could simulate a descheduling run and mark the affected nodes, then the controllers could wait/react. I don't know if that's feasible but just an idea
I think scoring the pods and annotating them (something like alpha.k8s.io/eviction-cost
) makes sense, but it will be interesting because in the case of scalable controllers you want to calculate that cost relative to their other replicas (rather than considering the cluster as a whole, like the scheduler/descheduler do now)
I think this also makes a good case for a "descheduler framework" that mirrors the basic setup of the scheduler's framework. The desire to run custom deschedulers like this, as well as the growing list of descheduler strategies, shows that there is a lot of customization users may find useful.
f5a9c34
to
9b036a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ingvagabund! Some initial comments below.
I think I understand most of the proposal, but it'd be helpful if you add an end-to-end example to explain how the whole flow works.
- choose youngest/oldest pods first | ||
- choose pods minimizing topology skew among failure domains (e.g. availability zones) | ||
|
||
### User Stories [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!!
This one is a use case for the descheduler directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to serve the different stories? Different people want different strategies?
Is it going to be a configuration for the controller or something that Pods can define?
If that's something we don't want to discuss in this KEP, it should be added to Non-goal.
The constraints can follow the same rules as the scheduler (through importing scheduling plugins) | ||
or be custom made (e.g. wrt. to application or proprietary requirements). | ||
The component will implement a mechanism for ranking pods. | ||
Either by annotating a pod, updating its status or setting a new field in pod's spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty important in this KEP to discuss which solution to go with. I'd suggest to put up a paragraph comparing the pros/cons specifically.
In addition to updating the pod's spec, another option is to use a CRD to represent each Pod's "cost". In this way, controller manage just holds a CRDLister cache and fetch the "cost" when needed.
Examples of scheduling constraints: | ||
- choose pods running on a node which have a `PreferNoSchedule` taint first | ||
- choose youngest/oldest pods first | ||
- choose pods minimizing topology skew among failure domains (e.g. availability zones) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean the component will provide different sets of "cost" according to different strategies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, each strategy will provide a different set of costs for a pod group. If configured and implemented properly, suitable weighted sum of ranks can give a final rank as is done in scheduler when scoring multiple nodes today.
## Proposal | ||
|
||
Proposed solution is to implement an optional cost-based component that will be watching all | ||
pods (or its subset) present in a cluster and assigning each one a cost based on a set of scheduling constraints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nodes as well as other APIs are also needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are going to be more objects to watch. Pods and nodes are primary. Depending on other strategies services might be required as well. I did not want to go too much into the detail. Altered the sentence a bit.
scheduling plugins can be vendored as well to provide some of the core scheduling logic. | ||
|
||
Once a pod is removed, ranks of all relevant pods are re-computed. | ||
The pod ranking is best-effort so in case a controller is to delete more than one pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO "Delete more than one pod" would be a common req, "selects all the pods with the smallest cost and remove" may flip and imbalance situation easily. So is it possible to ensure the "cost" design to support "Delete X pods in a batch"? No need to be perfect in the beginning, but keep the batch-deletion req in mind from day 1 is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a few words on this topic. There are two categories of strategies defined. One category where there's no need to recompute costs. So it's safe to scale down with 2 or more replicas at the same time. Other category where each time a pod is added/removed, all the costs need to be re-computed. I am mainly focusing on the first category to provide basic functionality. With a space to discuss the second category.
68e607a
to
405cb61
Compare
f9c82ba
to
58fe5fa
Compare
58fe5fa
to
966f39a
Compare
966f39a
to
1b5e812
Compare
This idea reminds me of https://kubernetes.io/docs/reference/scheduling/profiles/#scheduling-plugins I wonder if there is an overlap? Maybe the descheduler or something like it could use the same framework but different scoring. |
@sftim yeah, being able to call out to scheduling plugins in order to run the same algorithm is one part of this (and also a goal for the descheduler, see kubernetes-sigs/descheduler#238). We're working to clean up internal dependencies in the framework in order to make that easier. |
There is a related proposal for delete-priority #1828 |
|
||
- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off) | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the high level, I think this is a useful feature for various types of workloads (batch or serving) and placement strategies (spreading, binpacking etc.).
|
||
In the provided examples the first two strategies do not require rank re-computation. | ||
|
||
### Rank normalization and weighted sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that multiple controllers could influence the cost of evicting a pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the assumption that the cost of a pod is updated by a single controller, the implementation detail of that controller is not relevant in this KEP I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the assumption that the cost of a pod is updated by a single controller
Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec
for the Pod). Is that right?
I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?
Yes, it would be part of pod.Status
.
I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.
I feel that the implementation detail of the controller updating the cost is not important here. I don't think this KEP needs to take a stand or promote implementing a controller in k/k. Such a controller warrants its own KEP if we feel we need one in-tree.
A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ. | ||
``` | ||
|
||
### Implementation Details/Notes/Constraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should center the discussion here around descheduler. What we need to propose is an abstract concept of eviction cost, with the cost being comparable among the pods that belong to the same group/collection (have the same owner ref for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the intent of the KEP, it should be specified in Goals/Non-Goals
- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks) | ||
- add support for strategies which require rank re-computation | ||
|
||
### Option A (field in a pod status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this option, I don't think we need more than that, this is a simple and abstract concept will be easy to reason about.
... | ||
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost | ||
// +optional | ||
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine by default the controller (e.g., the ReplicaSet controller) that creates the pods would set this to the highest conceptual value for new pods because we assume the scheduler will make the best placement possible. Another controller would be responsible for updating the cost to something lower at runtime based on cluster state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it feasible to have the scheduler record its actual opinion of placement, or to leave room for adding that?
For example: a three-zone cluster, even pod spreading, scaling from 2 to 3. The scheduler places the Pod in the zone with no replica and scores this as “I could not have done better”.
Now schedule 3 more Pods. The first two go leaving this.
- node
a
(zone 1): 1 DaemonSet Pod, 2 app Pods - node
b
(zone 2): 1 DaemonSet Pod, 1 app Pod, 1 large cluster critical Pod - node
c
(zone 3): 1 DaemonSet Pod, 2 app Pods
The scheduler considers the 3rd Pod and sees it doesn't fit on Node b
. The scheduler picks node c
. The scheduler also records that the scheduling involved a weighted trade-off that is scored at (say) 0.72, because the Pods for the app are not evenly spread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.
Very simple, reading the field directly from a pod status. | ||
No additional apimachinery logic. | ||
|
||
### Option B (CRD for a pod group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this CRD adding value compared to option 1.
/cc @alculquicondor |
@@ -0,0 +1,382 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are using the old template. Please update
## Summary | ||
|
||
Cost ranking pods through an external component and scaling down pods based | ||
on the cost allows to employ various scheduling strategies to keep a cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
descheduling/eviction strategies?
|
||
## Summary | ||
|
||
Cost ranking pods through an external component and scaling down pods based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to be more explicit on what the current state is vs what is being proposed.
### Goals | ||
|
||
- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan | ||
- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be goals for the KEP. Is the plan to just provide the base binary and the framework or is it also to provide strategies A, B and C.
... | ||
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost | ||
// +optional | ||
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.
- choose youngest/oldest pods first | ||
- choose pods minimizing topology skew among failure domains (e.g. availability zones) | ||
|
||
### User Stories [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to serve the different stories? Different people want different strategies?
Is it going to be a configuration for the controller or something that Pods can define?
If that's something we don't want to discuss in this KEP, it should be added to Non-goal.
A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ. | ||
``` | ||
|
||
### Implementation Details/Notes/Constraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the intent of the KEP, it should be specified in Goals/Non-Goals
1. **A controller**: Scaling down one more time selects one of {P4, P12} | ||
1. Topology skew is still `1` | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you separate the risks with a numbered list?
We put this on the agenda for next SIG Apps meeting, along with #1828 |
@ingvagabund do you think there is anything here that is not covered in #1828? |
lets converge on one KEP, lets close this one and perhaps focus our energy on #1828 |
@alculquicondor This proposal extends the one in #1828. Going more into detail about how the annotation/status field is going to be consumed. I am fine closing this proposal and building on top of #1828 as long as either an annotation or a pod field is used to set the pod's priority/cost. |
@ingvagabund the other PR has already some reviewers from sig-apps. Would you mind leaving your suggestions as a review? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/close ref #2255 |
@ahg-g: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The KEP is touching both SIG scheduling (descheduler part) and SIG apps (controller part).
In short: use variation of descheduler (to be implemented) that will watch pods in a cluster and rank a pod with a cost. The cost will reflect impact of diverging from a scheduling plan when the pod gets deleted based on a set of constraints. Each controller respecting the cost will use it to select a pod to be deleted when scaling down to minimize negative impact on the scheduling plan.