Cost based scaling down of pods #1888

ingvagabund · 2020-07-08T10:42:40Z

The KEP is touching both SIG scheduling (descheduler part) and SIG apps (controller part).

In short: use variation of descheduler (to be implemented) that will watch pods in a cluster and rank a pod with a cost. The cost will reflect impact of diverging from a scheduling plan when the pod gets deleted based on a set of constraints. Each controller respecting the cost will use it to select a pod to be deleted when scaling down to minimize negative impact on the scheduling plan.

k8s-ci-robot · 2020-07-08T10:42:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ingvagabund
To complete the pull request process, please assign huang-wei
You can assign the PR to them by writing /assign @huang-wei in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ingvagabund · 2020-07-08T11:28:41Z

/sig apps

damemi

I think this is a great start @ingvagabund, smarter downscaling is something that has been requested for a long time but isn't quite covered by the descheduler because, like you mention, it's not edge-triggered.

While we consider options for implementation, maybe triggering the descheduler on a scale-down event could be possible too. It could simulate a descheduling run and mark the affected nodes, then the controllers could wait/react. I don't know if that's feasible but just an idea

I think scoring the pods and annotating them (something like alpha.k8s.io/eviction-cost) makes sense, but it will be interesting because in the case of scalable controllers you want to calculate that cost relative to their other replicas (rather than considering the cluster as a whole, like the scheduler/descheduler do now)

I think this also makes a good case for a "descheduler framework" that mirrors the basic setup of the scheduler's framework. The desire to run custom deschedulers like this, as well as the growing list of descheduler strategies, shows that there is a lot of customization users may find useful.

Huang-Wei

Thanks @ingvagabund! Some initial comments below.

I think I understand most of the proposal, but it'd be helpful if you add an end-to-end example to explain how the whole flow works.

Huang-Wei · 2020-07-09T02:26:34Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+- choose youngest/oldest pods first
+- choose pods minimizing topology skew among failure domains (e.g. availability zones)
+
+### User Stories [optional]


More stories here:

scaling down deployment not considering affinity kubernetes#89922

preferredDuringSchedulingIgnoredDuringExecution Anti-Affinity rule Doesn't fully distribute pods after kubectl uncordon <NODE> kubernetes#88809

kubernetes/kubernetes#89922

Thanks!!!

kubernetes/kubernetes#88809

This one is a use case for the descheduler directly.

How are we going to serve the different stories? Different people want different strategies?

Is it going to be a configuration for the controller or something that Pods can define?

If that's something we don't want to discuss in this KEP, it should be added to Non-goal.

Huang-Wei · 2020-07-10T23:39:36Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+The constraints can follow the same rules as the scheduler (through importing scheduling plugins)
+or be custom made (e.g. wrt. to application or proprietary requirements).
+The component will implement a mechanism for ranking pods.
+Either by annotating a pod, updating its status or setting a new field in pod's spec.


This is pretty important in this KEP to discuss which solution to go with. I'd suggest to put up a paragraph comparing the pros/cons specifically.

In addition to updating the pod's spec, another option is to use a CRD to represent each Pod's "cost". In this way, controller manage just holds a CRDLister cache and fetch the "cost" when needed.

Huang-Wei · 2020-07-10T23:45:21Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+Examples of scheduling constraints:
+- choose pods running on a node which have a `PreferNoSchedule` taint first
+- choose youngest/oldest pods first
+- choose pods minimizing topology skew among failure domains (e.g. availability zones)


Does that mean the component will provide different sets of "cost" according to different strategies?

Yes, each strategy will provide a different set of costs for a pod group. If configured and implemented properly, suitable weighted sum of ranks can give a final rank as is done in scheduler when scoring multiple nodes today.

Huang-Wei · 2020-07-10T23:46:28Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+## Proposal
+
+Proposed solution is to implement an optional cost-based component that will be watching all
+pods (or its subset) present in a cluster and assigning each one a cost based on a set of scheduling constraints.


Nodes as well as other APIs are also needed.

There are going to be more objects to watch. Pods and nodes are primary. Depending on other strategies services might be required as well. I did not want to go too much into the detail. Altered the sentence a bit.

Huang-Wei · 2020-07-11T00:00:59Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+scheduling plugins can be vendored as well to provide some of the core scheduling logic.
+
+Once a pod is removed, ranks of all relevant pods are re-computed.
+The pod ranking is best-effort so in case a controller is to delete more than one pod


IMO "Delete more than one pod" would be a common req, "selects all the pods with the smallest cost and remove" may flip and imbalance situation easily. So is it possible to ensure the "cost" design to support "Delete X pods in a batch"? No need to be perfect in the beginning, but keep the batch-deletion req in mind from day 1 is necessary.

I have added a few words on this topic. There are two categories of strategies defined. One category where there's no need to recompute costs. So it's safe to scale down with 2 or more replicas at the same time. Other category where each time a pod is added/removed, all the costs need to be re-computed. I am mainly focusing on the first category to provide basic functionality. With a space to discuss the second category.

sftim · 2020-07-23T01:40:51Z

This idea reminds me of https://kubernetes.io/docs/reference/scheduling/profiles/#scheduling-plugins

I wonder if there is an overlap? Maybe the descheduler or something like it could use the same framework but different scoring.

damemi · 2020-07-23T03:40:52Z

@sftim yeah, being able to call out to scheduling plugins in order to run the same algorithm is one part of this (and also a goal for the descheduler, see kubernetes-sigs/descheduler#238). We're working to clean up internal dependencies in the framework in order to make that easier.

tnozicka · 2020-08-24T16:47:42Z

There is a related proposal for delete-priority #1828
(If we were to implement it I wonder if we could make it generic to support multiple writers.)

ahg-g · 2020-09-24T16:57:16Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+
+- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)
+
+## Proposal


At the high level, I think this is a useful feature for various types of workloads (batch or serving) and placement strategies (spreading, binpacking etc.).

ahg-g · 2020-09-24T18:50:36Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+
+In the provided examples the first two strategies do not require rank re-computation.
+
+### Rank normalization and weighted sum


Are you suggesting that multiple controllers could influence the cost of evicting a pod?

I think we can make the assumption that the cost of a pod is updated by a single controller, the implementation detail of that controller is not relevant in this KEP I think.

make the assumption that the cost of a pod is updated by a single controller

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

Yes, it would be part of pod.Status.

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

I feel that the implementation detail of the controller updating the cost is not important here. I don't think this KEP needs to take a stand or promote implementing a controller in k/k. Such a controller warrants its own KEP if we feel we need one in-tree.

ahg-g · 2020-09-24T19:50:24Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
+```
+
+### Implementation Details/Notes/Constraints


I don't think we should center the discussion here around descheduler. What we need to propose is an abstract concept of eviction cost, with the cost being comparable among the pods that belong to the same group/collection (have the same owner ref for example)

If that's the intent of the KEP, it should be specified in Goals/Non-Goals

ahg-g · 2020-09-24T19:54:10Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks)
+- add support for strategies which require rank re-computation
+
+### Option A (field in a pod status)


+1 to this option, I don't think we need more than that, this is a simple and abstract concept will be easy to reason about.

ahg-g · 2020-09-24T19:59:17Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+...
+	// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
+	// +optional
+	Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`


I imagine by default the controller (e.g., the ReplicaSet controller) that creates the pods would set this to the highest conceptual value for new pods because we assume the scheduler will make the best placement possible. Another controller would be responsible for updating the cost to something lower at runtime based on cluster state.

Is it feasible to have the scheduler record its actual opinion of placement, or to leave room for adding that?

For example: a three-zone cluster, even pod spreading, scaling from 2 to 3. The scheduler places the Pod in the zone with no replica and scores this as “I could not have done better”.

Now schedule 3 more Pods. The first two go leaving this.

node a (zone 1): 1 DaemonSet Pod, 2 app Pods

node b (zone 2): 1 DaemonSet Pod, 1 app Pod, 1 large cluster critical Pod

node c (zone 3): 1 DaemonSet Pod, 2 app Pods

The scheduler considers the 3rd Pod and sees it doesn't fit on Node b. The scheduler picks node c. The scheduler also records that the scheduling involved a weighted trade-off that is scored at (say) 0.72, because the Pods for the app are not evenly spread.

I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.

ahg-g · 2020-09-24T20:00:28Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+Very simple, reading the field directly from a pod status.
+No additional apimachinery logic.
+
+### Option B (CRD for a pod group)


I don't see this CRD adding value compared to option 1.

ahg-g · 2020-09-25T15:24:24Z

/cc @alculquicondor

alculquicondor · 2020-09-29T18:43:28Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

@@ -0,0 +1,382 @@
+---


It looks like you are using the old template. Please update

alculquicondor · 2020-09-29T18:45:57Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+## Summary
+
+Cost ranking pods through an external component and scaling down pods based
+on the cost allows to employ various scheduling strategies to keep a cluster


descheduling/eviction strategies?

alculquicondor · 2020-09-29T18:47:37Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+
+## Summary
+
+Cost ranking pods through an external component and scaling down pods based


I would try to be more explicit on what the current state is vs what is being proposed.

alculquicondor · 2020-09-29T18:56:44Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+### Goals
+
+- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan
+- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted


These should be goals for the KEP. Is the plan to just provide the base binary and the framework or is it also to provide strategies A, B and C.

alculquicondor · 2020-09-29T19:05:09Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+...
+	// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
+	// +optional
+	Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`


I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.

alculquicondor · 2020-09-29T19:06:52Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+- choose youngest/oldest pods first
+- choose pods minimizing topology skew among failure domains (e.g. availability zones)
+
+### User Stories [optional]


How are we going to serve the different stories? Different people want different strategies?

Is it going to be a configuration for the controller or something that Pods can define?

If that's something we don't want to discuss in this KEP, it should be added to Non-goal.

alculquicondor · 2020-09-29T19:07:51Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
+```
+
+### Implementation Details/Notes/Constraints


If that's the intent of the KEP, it should be specified in Goals/Non-Goals

alculquicondor · 2020-09-29T19:08:48Z

keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md

+1. **A controller**: Scaling down one more time selects one of {P4, P12}
+1. Topology skew is still `1`
+
+### Risks and Mitigations


Can you separate the risks with a numbered list?

alculquicondor · 2020-11-16T17:38:23Z

We put this on the agenda for next SIG Apps meeting, along with #1828

alculquicondor · 2020-12-14T21:34:52Z

@ingvagabund do you think there is anything here that is not covered in #1828?

ahg-g · 2020-12-14T22:21:52Z

lets converge on one KEP, lets close this one and perhaps focus our energy on #1828

ingvagabund · 2021-01-04T14:43:55Z

@alculquicondor This proposal extends the one in #1828. Going more into detail about how the annotation/status field is going to be consumed. I am fine closing this proposal and building on top of #1828 as long as either an annotation or a pod field is used to set the pod's priority/cost.

alculquicondor · 2021-01-05T14:54:53Z

@ingvagabund the other PR has already some reviewers from sig-apps.

Would you mind leaving your suggestions as a review?

fejta-bot · 2021-04-05T15:05:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

ahg-g · 2021-04-06T01:17:54Z

/close

ref #2255

k8s-ci-robot · 2021-04-06T01:18:04Z

@ahg-g: Closed this PR.

In response to this:

/close

ref #2255

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 8, 2020

k8s-ci-robot requested review from ahg-g and Huang-Wei July 8, 2020 10:42

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 8, 2020

ingvagabund force-pushed the cost-based-scaling-down branch 5 times, most recently from 616b12a to 72397be Compare July 8, 2020 11:28

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jul 8, 2020

ingvagabund mentioned this pull request Jul 8, 2020

Scheduler input should be taken when reducing replicas kubernetes/kubernetes#4301

Open

ingvagabund force-pushed the cost-based-scaling-down branch from 72397be to 65a80f6 Compare July 8, 2020 11:37

damemi reviewed Jul 8, 2020

View reviewed changes

ingvagabund force-pushed the cost-based-scaling-down branch 4 times, most recently from f5a9c34 to 9b036a7 Compare July 9, 2020 16:45

Huang-Wei reviewed Jul 11, 2020

View reviewed changes

ingvagabund force-pushed the cost-based-scaling-down branch 6 times, most recently from 68e607a to 405cb61 Compare July 20, 2020 10:28

ingvagabund force-pushed the cost-based-scaling-down branch 2 times, most recently from f9c82ba to 58fe5fa Compare July 20, 2020 10:39

ingvagabund changed the title ~~WIP: cost based scaling down of pods~~ Cost based scaling down of pods Jul 20, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2020

ingvagabund force-pushed the cost-based-scaling-down branch from 58fe5fa to 966f39a Compare July 20, 2020 10:42

Cost based scaling down of pods

1b5e812

ingvagabund force-pushed the cost-based-scaling-down branch from 966f39a to 1b5e812 Compare July 20, 2020 10:51

tnozicka mentioned this pull request Aug 17, 2020

scaling down deployment not considering affinity kubernetes/kubernetes#89922

Closed

This was referenced Sep 13, 2020

Create 20200529-pod-cost-annotations #1828

Merged

Celery Kubernetes Operator CEP #1 celery/ceps#29

Merged

ahg-g reviewed Sep 24, 2020

View reviewed changes

k8s-ci-robot requested a review from alculquicondor September 25, 2020 15:24

alculquicondor reviewed Sep 29, 2020

View reviewed changes

liggitt mentioned this pull request Oct 19, 2020

Can't shrink by adding label pods when stateless application shrink kubernetes/kubernetes#95688

Closed

Huang-Wei mentioned this pull request Nov 13, 2020

Pod-Scheduler not respecting constraints imposed by topologySpreadConstraints kubernetes/kubernetes#96542

Closed

ahg-g mentioned this pull request Nov 20, 2020

Pod Topology Spread constraints should be taken into account on scale down kubernetes/kubernetes#96748

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021

k8s-ci-robot closed this Apr 6, 2021


		- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)

		## Proposal


		In the provided examples the first two strategies do not require rank re-computation.

		### Rank normalization and weighted sum


		## Summary

		Cost ranking pods through an external component and scaling down pods based

Cost based scaling down of pods #1888

Cost based scaling down of pods #1888

Uh oh!

Conversation

ingvagabund commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 8, 2020

Uh oh!

ingvagabund commented Jul 8, 2020

Uh oh!

damemi left a comment

Choose a reason for hiding this comment

Uh oh!

Huang-Wei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sftim commented Jul 23, 2020

Uh oh!

damemi commented Jul 23, 2020

Uh oh!

tnozicka commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Sep 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund commented Jul 8, 2020 •

edited

Loading

ingvagabund Jul 17, 2020 •

edited

Loading

tnozicka commented Aug 24, 2020 •

edited

Loading

alculquicondor commented Nov 16, 2020 •

edited

Loading

ingvagabund commented Jan 4, 2021 •

edited

Loading