|
| 1 | +--- |
| 2 | +title: Cost based scaling down of pods |
| 3 | +authors: |
| 4 | + - "@ingvagabund" |
| 5 | +owning-sig: sig-scheduling |
| 6 | +participating-sigs: |
| 7 | + - sig-apps |
| 8 | +reviewers: |
| 9 | + - TBD |
| 10 | +approvers: |
| 11 | + - TBD |
| 12 | +editor: TBD |
| 13 | +creation-date: 2020-06-30 |
| 14 | +last-updated: yyyy-mm-dd |
| 15 | +status: provisional |
| 16 | +--- |
| 17 | + |
| 18 | +# Cost based scaling down of pods |
| 19 | + |
| 20 | +## Table of Contents |
| 21 | + |
| 22 | +<!-- toc --> |
| 23 | +- [Summary](#summary) |
| 24 | +- [Motivation](#motivation) |
| 25 | + - [Goals](#goals) |
| 26 | + - [Non-Goals](#non-goals) |
| 27 | +- [Proposal](#proposal) |
| 28 | + - [Examples of strategies](#examples-of-strategies) |
| 29 | + - [Balancing duplicates among topological domains](#balancing-duplicates-among-topological-domains) |
| 30 | + - [Pods not tolerating taints first](#pods-not-tolerating-taints-first) |
| 31 | + - [Minimizing pod anti-affinity](#minimizing-pod-anti-affinity) |
| 32 | + - [Rank normalization and weighted sum](#rank-normalization-and-weighted-sum) |
| 33 | + - [User Stories [optional]](#user-stories-optional) |
| 34 | + - [Story 1](#story-1) |
| 35 | + - [Story 2](#story-2) |
| 36 | + - [Story 3](#story-3) |
| 37 | + - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) |
| 38 | + - [Phases](#phases) |
| 39 | + - [Option A (field in a pod status)](#option-a-field-in-a-pod-status) |
| 40 | + - [Option B (CRD for a pod group)](#option-b-crd-for-a-pod-group) |
| 41 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 42 | +- [Design Details](#design-details) |
| 43 | + - [Test Plan](#test-plan) |
| 44 | + - [Graduation Criteria](#graduation-criteria) |
| 45 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 46 | + - [Version Skew Strategy](#version-skew-strategy) |
| 47 | +- [Implementation History](#implementation-history) |
| 48 | +- [Alternatives [optional]](#alternatives-optional) |
| 49 | +<!-- /toc --> |
| 50 | + |
| 51 | +## Summary |
| 52 | + |
| 53 | +Cost ranking pods through an external component and scaling down pods based |
| 54 | +on the cost allows to employ various scheduling strategies to keep a cluster |
| 55 | +from diverging from an optimal distribution of resources. |
| 56 | +Providing an external solution for selecting the right victim allows to improve ability |
| 57 | +to preserve various conditions such us balancing pods among failure domains, keeping |
| 58 | +aligned with security requirements or respecting application policies. |
| 59 | +Allowing controllers to be free of any scheduling strategy, yet to be aware |
| 60 | +of impact of removing pods on the overall cluster scheduling plan, helps to reduce |
| 61 | +cost of re-scheduling resources. |
| 62 | + |
| 63 | +## Motivation |
| 64 | + |
| 65 | +Scaling down a set of pods does not always results in optimal selection of victims. |
| 66 | +The scheduler relies on filters and scores which may distribute the pods wrt. topology |
| 67 | +spreading and/or load balancing constraints (e.g. pods uniformly balanced among zones). |
| 68 | +Application specific workloads may prefer to scale down short-running pods and favor long-running pods. |
| 69 | +Selecting a victim with a trivial logic can unbalance the topology spreading |
| 70 | +or have jobs that accumulated work to be lost in vain. |
| 71 | +Given it's a natural property of a cluster to shift workloads in time, |
| 72 | +decision made by a scheduler at some time is as good as its ability to predict future demands. |
| 73 | +The default kubernetes scheduler was constructed with a goal to provide high throughput |
| 74 | +at the cost of being simple. Thus, it is quite easy to diverge from the scheduling plan. |
| 75 | +In contrast, descheduler allows to help to re-balance the plan and get closer to |
| 76 | +the scheduler constraints. Yet, it is designed to run and adjust the cluster periodically (e.g. each hour). |
| 77 | +Therefore, unusable for scaling down purposes (which require immediate action). |
| 78 | + |
| 79 | +On the other hand each controller with a scale down operation has its own |
| 80 | +implementation of a victim selection logic. |
| 81 | +The decision making logic does not take into account a scheduling plan. |
| 82 | +Extending each such controller with additional logic to support various scheduling |
| 83 | +constraints is impractical. In cases a proprietary solution for scaling down is required, |
| 84 | +it's impossible. Also, controllers do not necessarily have a whole cluster overview |
| 85 | +so its decision does not have to be optimal. |
| 86 | +Therefore, it's more feasible to locate the logic outside of a controller. |
| 87 | + |
| 88 | +In order to support more informed scaling down operation while keeping scheduling plan in mind, |
| 89 | +additional decision logic that can be extended based on applications requirements is needed. |
| 90 | + |
| 91 | +### Goals |
| 92 | + |
| 93 | +- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan |
| 94 | +- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted |
| 95 | + |
| 96 | +### Non-Goals |
| 97 | + |
| 98 | +- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off) |
| 99 | + |
| 100 | +## Proposal |
| 101 | + |
| 102 | +Proposed solution is to implement an optional cost-based component that will be watching all |
| 103 | +pods (or its subset) and nodes (potentially other objects) present in a cluster. |
| 104 | +Assigning each pod a cost based on a set of scheduling constraints. |
| 105 | +At the same time extending controllers logic to utilize the pod cost when selecting a victim during scale down operation. |
| 106 | + |
| 107 | +The component will allow to select a different list of scheduling constraints for each targeted |
| 108 | +set of pods. Each pod in a set will be given a cost based on how much important it is in the set. |
| 109 | +The constraints can follow the same rules as the scheduler (through importing scheduling plugins) |
| 110 | +or be custom made (e.g. wrt. to application or proprietary requirements). |
| 111 | +The component will implement a mechanism for ranking pods. |
| 112 | +<!-- Either by annotating a pod, updating its status, setting a new field in pod's spec |
| 113 | +or creating a new CRD which will carry a cost. --> |
| 114 | +Each controller will have a choice to either ignore the cost or take it into account |
| 115 | +when scaling down. |
| 116 | + |
| 117 | +This way, the logic for selecting a victim for the scaling down operation will be |
| 118 | +separated from each controller. Allowing each consumer to provide its own |
| 119 | +logic for assigning costs. Yet, having all controllers to consume the cost uniformly. |
| 120 | + |
| 121 | +Given the default scheduler is not a source of truth about how a pod should be distributed |
| 122 | +after it was scheduled, scaling down strategies can exercise completely different approaches. |
| 123 | + |
| 124 | +Examples of scheduling constraints: |
| 125 | +- choose pods running on a node which have a `PreferNoSchedule` taint first |
| 126 | +- choose youngest/oldest pods first |
| 127 | +- choose pods minimizing topology skew among failure domains (e.g. availability zones) |
| 128 | + |
| 129 | +The goal of the proposal is not to provide specific strategies for more informed scaling down operation. |
| 130 | +The primary goal is to provide a mechanism and have controllers implement the mechanism. |
| 131 | +Allowing consumers of the new component to define their own strategies. |
| 132 | + |
| 133 | +### Examples of strategies |
| 134 | + |
| 135 | +Strategies can be divided into two categories: |
| 136 | +- scaling down/up a pod group does not require rank re-computation |
| 137 | +- scaling down/up a pod group requires rank re-computation |
| 138 | + |
| 139 | +#### Balancing duplicates among topological domains |
| 140 | + |
| 141 | +- Evict pods while minimizing skew between topological domains |
| 142 | +- Each pod can be given a cost based on how old/young it is in the same domain: |
| 143 | + - if a pod is the first one in the domain, rank the pod with cost `1` |
| 144 | + - if a pod was created second to the domain, rank the pod with cost `2` |
| 145 | + - continue this way until all pods in all domains are ranked |
| 146 | + - higher rank of a pod, the sooner the pod gets removed |
| 147 | + |
| 148 | +#### Pods not tolerating taints first |
| 149 | + |
| 150 | +- Evict pods that do not tolerate taints before pods that tolerate taints. |
| 151 | +- Each pod can be given a cost based on how many taints are not tolerated |
| 152 | + - higher rank of a pod, the sooner the pod gets removed |
| 153 | + |
| 154 | +#### Minimizing pod anti-affinity |
| 155 | + |
| 156 | +- Evict pods maximizing anti-affinity first |
| 157 | +- Pod that improves anti-affinity on a node gets higher rank |
| 158 | +- Given multiple pod groups can be part of anti-affinity group, scaling down |
| 159 | + a single pod in a group requires re-computation of pod ranks of all pods |
| 160 | + taking part. Also, only a single pod can be scaled down at a time. |
| 161 | + Otherwise, ranks no longer have to provide optimal victim selection. |
| 162 | + |
| 163 | +In the provided examples the first two strategies do not require rank re-computation. |
| 164 | + |
| 165 | +### Rank normalization and weighted sum |
| 166 | + |
| 167 | +In order to allow pod ranking by multiple strategies/constraints, it's important |
| 168 | +to normalize ranks. On the other hand, rank normalization requires all strategies |
| 169 | +to re-compute all ranks every time a pod is created/deleted. To eliminate the need |
| 170 | +to re-compute, each strategy can introduce a threshold where every pod rank |
| 171 | +exceeding the threshold gets rounded to the threshold. |
| 172 | +E.g. if a topology domain has at least 10 pods, 11-th and other pods get the same |
| 173 | +rank as 10-th pod. |
| 174 | +With the threshold based normalization multiple strategies can rank a pod group |
| 175 | +which can be used to compute weighted rank through all relevant strategies. |
| 176 | + |
| 177 | +### User Stories [optional] |
| 178 | + |
| 179 | +#### Story 1 |
| 180 | + |
| 181 | +From [@pnovotnak](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-328685358): |
| 182 | + |
| 183 | +``` |
| 184 | +I have a number of scientific programs that I've wrapped with code to talk |
| 185 | +to a message broker that do not checkpoint state. The cost of deleting the resource |
| 186 | +increases over time (some of these tasks take hours), until it completes the current unit of work. |
| 187 | +
|
| 188 | +Choosing a pod by most idle resources would also work in my case. |
| 189 | +``` |
| 190 | + |
| 191 | +#### Story 2 |
| 192 | + |
| 193 | +From [@cpwood](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-436587548) |
| 194 | + |
| 195 | +``` |
| 196 | +For my use case, I'd prefer Kubernetes to choose its victims from pods that are running on nodes which have a PreferNoSchedule taint. |
| 197 | +``` |
| 198 | + |
| 199 | +#### Story 3 |
| 200 | + |
| 201 | +From [@barucoh](https://github.com/kubernetes/kubernetes/issues/89922) |
| 202 | + |
| 203 | +``` |
| 204 | +A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ. |
| 205 | +``` |
| 206 | + |
| 207 | +### Implementation Details/Notes/Constraints |
| 208 | + |
| 209 | +Currently, the descheduler does not allow to immediately react on changes in a cluster. |
| 210 | +Yet, with some modification another instance of the descheduler (with different set of strategies) |
| 211 | +might be ran in watch mode and rank each pod as it comes. |
| 212 | +Also, once the scheduling framework gets migrated into its own repository, |
| 213 | +scheduling plugins can be vendored as well to provide some of the core scheduling logic. |
| 214 | + |
| 215 | +The pod ranking is best-effort so in case a controller is to delete more than one pod |
| 216 | +it selects all the pods with the highest cost and remove those. |
| 217 | +In case a pod fails to be deleted during the scale down operation and results in resuming the operation in the next cycle, |
| 218 | +it may happen pods get ranked differently and a different set of victim pods gets selected. |
| 219 | + |
| 220 | +Once a pod is removed, ranks of others pods might be required to get re-computed. |
| 221 | +Unless strategies that do not require re-computation are deployed. |
| 222 | +By default, all pods owned by a controller template has to be ranked. |
| 223 | +Otherwise, a controller falls back to each original selection victim logic. |
| 224 | +Resp. it can be configured to wait or back-off. |
| 225 | +Also, the ranking strategies can be configured to target only selected sets of pods. |
| 226 | +Thus, allowing a controller to employ cost based selection only when more sophisticated |
| 227 | +logic is required and available. |
| 228 | + |
| 229 | +During alpha phase, each controller utilizing the pod ranking will feature gate the new logic. |
| 230 | +Starting by utilizing a pod annotation (e.g. `scheduling.alpha.kubernetes.io/cost`) |
| 231 | +which can be eventually promoted to either a field in pod's spec or moved under CRD (see further). |
| 232 | + |
| 233 | +If strategies requiring rank re-computation are employed, it's more practical to define |
| 234 | +a CRD for a pod group and have all the costs in a single place to avoid desynchronization |
| 235 | +of ranks among pods. |
| 236 | + |
| 237 | +#### Phases |
| 238 | + |
| 239 | +Phase 1: |
| 240 | +- add support for strategies which do not need rank re-computation of a pod group |
| 241 | +- only a single strategy can be ran to rank pods (unless threshold based normalization is applied) |
| 242 | +- use annotations to hold a single pod cost |
| 243 | + |
| 244 | +Phase 2A: |
| 245 | +- promote pod cost annotation to a pod status field |
| 246 | +- no synchronization of pods in a pod group, harder support of strategies which require rank re-computation |
| 247 | + |
| 248 | +Phase 2B: |
| 249 | +- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks) |
| 250 | +- add support for strategies which require rank re-computation |
| 251 | + |
| 252 | +### Option A (field in a pod status) |
| 253 | + |
| 254 | +Store a post cost/rank under pod's status so it can be updated only by component |
| 255 | +who has permission to update pod status. |
| 256 | + |
| 257 | +```go |
| 258 | +// PodStatus represents information about the status of a pod. Status may trail the actual |
| 259 | +// state of a system, especially if the node that hosts the pod cannot contact the control |
| 260 | +// plane. |
| 261 | +type PodStatus struct { |
| 262 | +... |
| 263 | + // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost |
| 264 | + // +optional |
| 265 | + Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"` |
| 266 | +... |
| 267 | +} |
| 268 | +``` |
| 269 | + |
| 270 | +Very simple, reading the field directly from a pod status. |
| 271 | +No additional apimachinery logic. |
| 272 | + |
| 273 | +### Option B (CRD for a pod group) |
| 274 | + |
| 275 | +```yaml |
| 276 | +apiVersion: scheduling.k8s.io/v1alpha1 |
| 277 | +kind: PodGroupCost |
| 278 | +metadata: |
| 279 | + name: rc-guestbook-fronted |
| 280 | + namespace: rc-guestbook-fronted-namespace |
| 281 | +spec: |
| 282 | + owner: |
| 283 | + kind: ReplicationController |
| 284 | + name: rc-guestbook-fronted // may be redundant |
| 285 | + costs: |
| 286 | + "rc-guestbook-fronted-pod1": 4 |
| 287 | + "rc-guestbook-fronted-pod2": 8 |
| 288 | + ... |
| 289 | + "rc-guestbook-fronted-podn": 2 |
| 290 | +``` |
| 291 | +
|
| 292 | +More suitable for keeping all pod costs from a pod group in sync. |
| 293 | +Controllers will need to take into account the new CRD (adding informers). |
| 294 | +A CR will live in the same namespace as underlying pod group (RC, RC, etc.). |
| 295 | +
|
| 296 | +### Workflow example |
| 297 | +
|
| 298 | +**Scenario**: pod group of 12 pods, 3 AZs (2 nodes per each AZ), pods are evenly spread among all zones |
| 299 | +
|
| 300 | +1. Assuming a pod group is supposed to respect topology spreading and scale down |
| 301 | +operation is to minimize topology skew between domains. |
| 302 | +1. The ranking component is configured to rank pods based on their presence in a topology domain |
| 303 | +1. The ranking component notices the pods, analyzes the pod group and ranks the pods in the following manner (`{PodName: Rank}`): |
| 304 | + - AZ1: {P1: 1, P2: 2, P3: 3, P4: 4} (P1 getting 1 as it was created first in the domain, P2 getting 2, etc.) |
| 305 | + - AZ2: {P5: 1, P6: 2, P7: 3, P8: 4} |
| 306 | + - AZ3: {P9: 1, P10: 2, P11: 3, P12: 4} |
| 307 | +1. Scale down operation of the pod group is requested |
| 308 | +1. Scale down logic selects one of P4, P8 or P12 as a victim (e.g. P8) |
| 309 | +1. Topology skew is now `1` |
| 310 | +1. No need to re-compute ranks since the ranking does not depend on the pod group size |
| 311 | +1. Scaling down one more time selects one of {P4, P12} |
| 312 | +1. Topology skew is still `1` |
| 313 | + |
| 314 | +### Risks and Mitigations |
| 315 | + |
| 316 | +It may happen the ranking component does not rank all relevant pods in time. |
| 317 | +In that case a controller can either choose to ignore the cost. Or, it can back-off |
| 318 | +with a configurable timeout and retry the scale down operation once all pods in |
| 319 | +a given set are ranked. |
| 320 | + |
| 321 | +From the security perspective a malicious code might assign pod a different cost |
| 322 | +with a goal to remove more vital pods to harm a running application. |
| 323 | +How much is using annotation safe? Might be better to use pod status |
| 324 | +so only clients with pod/status update RBAC are allowed to change the cost. |
| 325 | + |
| 326 | +In case a strategy needs to re-compute costs after scale down operation and |
| 327 | +the component stops working (for any reason), a controller might scale down |
| 328 | +incorrect pod(s) in the next request. More reasons to constraint strategies |
| 329 | +to not need to re-compute pod costs. |
| 330 | + |
| 331 | +In case a scaling down process is too quick, the component may be too slow to |
| 332 | +recompute all scores and provide suboptimal/incorrect costs. |
| 333 | + |
| 334 | +In case two or more controllers own a pod group (through labels), scaling down the group by one |
| 335 | +controller can result in scale up the same group by another controller. |
| 336 | +Entering an endless loop of scaling up and down. Which may result in unexpected |
| 337 | +behavior. Leaving a subgroup of pods unranked. |
| 338 | + |
| 339 | +Deployment upgrades might have different expectations when exercising a rolling update. |
| 340 | +They could just completely ignore the costs. Unless, it's acceptable to scale down by one |
| 341 | +and wait until the costs are recomputed when needed. |
| 342 | + |
| 343 | +## Design Details |
| 344 | + |
| 345 | +### Test Plan |
| 346 | + |
| 347 | +TBD |
| 348 | + |
| 349 | +### Graduation Criteria |
| 350 | + |
| 351 | +- Alpha: Initial support for taking pod cost into account when scaling down in controllers. Disabled by default. |
| 352 | +- Beta: Enabled by default |
| 353 | + |
| 354 | +### Upgrade / Downgrade Strategy |
| 355 | + |
| 356 | +Scaling down based on a pod cost is optional. If no cost is present, scaling down falls back to the original behavior. |
| 357 | + |
| 358 | +### Version Skew Strategy |
| 359 | + |
| 360 | +A controller either recognizes pod's cost or it does not. |
| 361 | + |
| 362 | +## Implementation History |
| 363 | + |
| 364 | +- KEP Started on 06/30/2020 |
| 365 | + |
| 366 | +## Alternatives [optional] |
| 367 | + |
| 368 | +- Controllers might use a webhook and talk to the component directly to select a victim |
| 369 | +- Some controllers might improve their decision logic to cover specific use cases (e.g. introduce new policy for sorting pods based on information located in pod objects) |
0 commit comments