Skip to content

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Jul 8, 2020

The KEP is touching both SIG scheduling (descheduler part) and SIG apps (controller part).

In short: use variation of descheduler (to be implemented) that will watch pods in a cluster and rank a pod with a cost. The cost will reflect impact of diverging from a scheduling plan when the pod gets deleted based on a set of constraints. Each controller respecting the cost will use it to select a pod to be deleted when scaling down to minimize negative impact on the scheduling plan.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 8, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ingvagabund
To complete the pull request process, please assign huang-wei
You can assign the PR to them by writing /assign @huang-wei in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and Huang-Wei July 8, 2020 10:42
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 8, 2020
@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch 5 times, most recently from 616b12a to 72397be Compare July 8, 2020 11:28
@ingvagabund
Copy link
Contributor Author

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jul 8, 2020
@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch from 72397be to 65a80f6 Compare July 8, 2020 11:37
Copy link
Member

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great start @ingvagabund, smarter downscaling is something that has been requested for a long time but isn't quite covered by the descheduler because, like you mention, it's not edge-triggered.

While we consider options for implementation, maybe triggering the descheduler on a scale-down event could be possible too. It could simulate a descheduling run and mark the affected nodes, then the controllers could wait/react. I don't know if that's feasible but just an idea

I think scoring the pods and annotating them (something like alpha.k8s.io/eviction-cost) makes sense, but it will be interesting because in the case of scalable controllers you want to calculate that cost relative to their other replicas (rather than considering the cluster as a whole, like the scheduler/descheduler do now)

I think this also makes a good case for a "descheduler framework" that mirrors the basic setup of the scheduler's framework. The desire to run custom deschedulers like this, as well as the growing list of descheduler strategies, shows that there is a lot of customization users may find useful.

@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch 4 times, most recently from f5a9c34 to 9b036a7 Compare July 9, 2020 16:45
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ingvagabund! Some initial comments below.

I think I understand most of the proposal, but it'd be helpful if you add an end-to-end example to explain how the whole flow works.

- choose youngest/oldest pods first
- choose pods minimizing topology skew among failure domains (e.g. availability zones)

### User Stories [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@ingvagabund ingvagabund Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#89922

Thanks!!!

kubernetes/kubernetes#88809

This one is a use case for the descheduler directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to serve the different stories? Different people want different strategies?

Is it going to be a configuration for the controller or something that Pods can define?

If that's something we don't want to discuss in this KEP, it should be added to Non-goal.

The constraints can follow the same rules as the scheduler (through importing scheduling plugins)
or be custom made (e.g. wrt. to application or proprietary requirements).
The component will implement a mechanism for ranking pods.
Either by annotating a pod, updating its status or setting a new field in pod's spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty important in this KEP to discuss which solution to go with. I'd suggest to put up a paragraph comparing the pros/cons specifically.

In addition to updating the pod's spec, another option is to use a CRD to represent each Pod's "cost". In this way, controller manage just holds a CRDLister cache and fetch the "cost" when needed.

Comment on lines +113 to +128
Examples of scheduling constraints:
- choose pods running on a node which have a `PreferNoSchedule` taint first
- choose youngest/oldest pods first
- choose pods minimizing topology skew among failure domains (e.g. availability zones)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the component will provide different sets of "cost" according to different strategies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, each strategy will provide a different set of costs for a pod group. If configured and implemented properly, suitable weighted sum of ranks can give a final rank as is done in scheduler when scoring multiple nodes today.

## Proposal

Proposed solution is to implement an optional cost-based component that will be watching all
pods (or its subset) present in a cluster and assigning each one a cost based on a set of scheduling constraints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes as well as other APIs are also needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are going to be more objects to watch. Pods and nodes are primary. Depending on other strategies services might be required as well. I did not want to go too much into the detail. Altered the sentence a bit.

scheduling plugins can be vendored as well to provide some of the core scheduling logic.

Once a pod is removed, ranks of all relevant pods are re-computed.
The pod ranking is best-effort so in case a controller is to delete more than one pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "Delete more than one pod" would be a common req, "selects all the pods with the smallest cost and remove" may flip and imbalance situation easily. So is it possible to ensure the "cost" design to support "Delete X pods in a batch"? No need to be perfect in the beginning, but keep the batch-deletion req in mind from day 1 is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few words on this topic. There are two categories of strategies defined. One category where there's no need to recompute costs. So it's safe to scale down with 2 or more replicas at the same time. Other category where each time a pod is added/removed, all the costs need to be re-computed. I am mainly focusing on the first category to provide basic functionality. With a space to discuss the second category.

@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch 6 times, most recently from 68e607a to 405cb61 Compare July 20, 2020 10:28
@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch 2 times, most recently from f9c82ba to 58fe5fa Compare July 20, 2020 10:39
@ingvagabund ingvagabund changed the title WIP: cost based scaling down of pods Cost based scaling down of pods Jul 20, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2020
@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch from 58fe5fa to 966f39a Compare July 20, 2020 10:42
@ingvagabund ingvagabund force-pushed the cost-based-scaling-down branch from 966f39a to 1b5e812 Compare July 20, 2020 10:51
@sftim
Copy link

sftim commented Jul 23, 2020

This idea reminds me of https://kubernetes.io/docs/reference/scheduling/profiles/#scheduling-plugins

I wonder if there is an overlap? Maybe the descheduler or something like it could use the same framework but different scoring.

@damemi
Copy link
Member

damemi commented Jul 23, 2020

@sftim yeah, being able to call out to scheduling plugins in order to run the same algorithm is one part of this (and also a goal for the descheduler, see kubernetes-sigs/descheduler#238). We're working to clean up internal dependencies in the framework in order to make that easier.

@tnozicka
Copy link

tnozicka commented Aug 24, 2020

There is a related proposal for delete-priority #1828
(If we were to implement it I wonder if we could make it generic to support multiple writers.)


- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the high level, I think this is a useful feature for various types of workloads (batch or serving) and placement strategies (spreading, binpacking etc.).


In the provided examples the first two strategies do not require rank re-computation.

### Rank normalization and weighted sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that multiple controllers could influence the cost of evicting a pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the assumption that the cost of a pod is updated by a single controller, the implementation detail of that controller is not relevant in this KEP I think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the assumption that the cost of a pod is updated by a single controller

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

Yes, it would be part of pod.Status.

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

I feel that the implementation detail of the controller updating the cost is not important here. I don't think this KEP needs to take a stand or promote implementing a controller in k/k. Such a controller warrants its own KEP if we feel we need one in-tree.

A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
```

### Implementation Details/Notes/Constraints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should center the discussion here around descheduler. What we need to propose is an abstract concept of eviction cost, with the cost being comparable among the pods that belong to the same group/collection (have the same owner ref for example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the intent of the KEP, it should be specified in Goals/Non-Goals

- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks)
- add support for strategies which require rank re-computation

### Option A (field in a pod status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this option, I don't think we need more than that, this is a simple and abstract concept will be easy to reason about.

...
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
// +optional
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine by default the controller (e.g., the ReplicaSet controller) that creates the pods would set this to the highest conceptual value for new pods because we assume the scheduler will make the best placement possible. Another controller would be responsible for updating the cost to something lower at runtime based on cluster state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it feasible to have the scheduler record its actual opinion of placement, or to leave room for adding that?

For example: a three-zone cluster, even pod spreading, scaling from 2 to 3. The scheduler places the Pod in the zone with no replica and scores this as “I could not have done better”.

Now schedule 3 more Pods. The first two go leaving this.

  • node a (zone 1): 1 DaemonSet Pod, 2 app Pods
  • node b (zone 2): 1 DaemonSet Pod, 1 app Pod, 1 large cluster critical Pod
  • node c (zone 3): 1 DaemonSet Pod, 2 app Pods

The scheduler considers the 3rd Pod and sees it doesn't fit on Node b. The scheduler picks node c. The scheduler also records that the scheduling involved a weighted trade-off that is scored at (say) 0.72, because the Pods for the app are not evenly spread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.

Very simple, reading the field directly from a pod status.
No additional apimachinery logic.

### Option B (CRD for a pod group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this CRD adding value compared to option 1.

@ahg-g
Copy link
Member

ahg-g commented Sep 25, 2020

/cc @alculquicondor

@@ -0,0 +1,382 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are using the old template. Please update

## Summary

Cost ranking pods through an external component and scaling down pods based
on the cost allows to employ various scheduling strategies to keep a cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descheduling/eviction strategies?


## Summary

Cost ranking pods through an external component and scaling down pods based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to be more explicit on what the current state is vs what is being proposed.

### Goals

- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan
- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be goals for the KEP. Is the plan to just provide the base binary and the framework or is it also to provide strategies A, B and C.

...
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
// +optional
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.

- choose youngest/oldest pods first
- choose pods minimizing topology skew among failure domains (e.g. availability zones)

### User Stories [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to serve the different stories? Different people want different strategies?

Is it going to be a configuration for the controller or something that Pods can define?

If that's something we don't want to discuss in this KEP, it should be added to Non-goal.

A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
```

### Implementation Details/Notes/Constraints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the intent of the KEP, it should be specified in Goals/Non-Goals

1. **A controller**: Scaling down one more time selects one of {P4, P12}
1. Topology skew is still `1`

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the risks with a numbered list?

@alculquicondor
Copy link
Member

alculquicondor commented Nov 16, 2020

We put this on the agenda for next SIG Apps meeting, along with #1828

@alculquicondor
Copy link
Member

@ingvagabund do you think there is anything here that is not covered in #1828?

@ahg-g
Copy link
Member

ahg-g commented Dec 14, 2020

lets converge on one KEP, lets close this one and perhaps focus our energy on #1828

@ingvagabund
Copy link
Contributor Author

ingvagabund commented Jan 4, 2021

@alculquicondor This proposal extends the one in #1828. Going more into detail about how the annotation/status field is going to be consumed. I am fine closing this proposal and building on top of #1828 as long as either an annotation or a pod field is used to set the pod's priority/cost.

@alculquicondor
Copy link
Member

@ingvagabund the other PR has already some reviewers from sig-apps.

Would you mind leaving your suggestions as a review?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@ahg-g
Copy link
Member

ahg-g commented Apr 6, 2021

/close

ref #2255

@k8s-ci-robot
Copy link
Contributor

@ahg-g: Closed this PR.

In response to this:

/close

ref #2255

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants