Skip to content

Commit 405cb61

Browse files
committed
Cost based scaling down of pods
1 parent af66ec9 commit 405cb61

File tree

1 file changed

+369
-0
lines changed

1 file changed

+369
-0
lines changed
Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
---
2+
title: Cost based scaling down of pods
3+
authors:
4+
- "@ingvagabund"
5+
owning-sig: sig-scheduling
6+
participating-sigs:
7+
- sig-apps
8+
reviewers:
9+
- TBD
10+
approvers:
11+
- TBD
12+
editor: TBD
13+
creation-date: 2020-06-30
14+
last-updated: yyyy-mm-dd
15+
status: provisional
16+
---
17+
18+
# Cost based scaling down of pods
19+
20+
## Table of Contents
21+
22+
<!-- toc -->
23+
- [Summary](#summary)
24+
- [Motivation](#motivation)
25+
- [Goals](#goals)
26+
- [Non-Goals](#non-goals)
27+
- [Proposal](#proposal)
28+
- [Examples of strategies](#examples-of-strategies)
29+
- [Balancing duplicates among topological domains](#balancing-duplicates-among-topological-domains)
30+
- [Pods not tolerating taints first](#pods-not-tolerating-taints-first)
31+
- [Minimizing pod anti-affinity](#minimizing-pod-anti-affinity)
32+
- [Rank normalization and weighted sum](#rank-normalization-and-weighted-sum)
33+
- [User Stories [optional]](#user-stories-optional)
34+
- [Story 1](#story-1)
35+
- [Story 2](#story-2)
36+
- [Story 3](#story-3)
37+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
38+
- [Phases](#phases)
39+
- [Option A (field in a pod status)](#option-a-field-in-a-pod-status)
40+
- [Option B (CRD for a pod group)](#option-b-crd-for-a-pod-group)
41+
- [Risks and Mitigations](#risks-and-mitigations)
42+
- [Design Details](#design-details)
43+
- [Test Plan](#test-plan)
44+
- [Graduation Criteria](#graduation-criteria)
45+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
46+
- [Version Skew Strategy](#version-skew-strategy)
47+
- [Implementation History](#implementation-history)
48+
- [Alternatives [optional]](#alternatives-optional)
49+
<!-- /toc -->
50+
51+
## Summary
52+
53+
Cost ranking pods through an external component and scaling down pods based
54+
on the cost allows to employ various scheduling strategies to keep a cluster
55+
from diverging from an optimal distribution of resources.
56+
Providing an external solution for selecting the right victim allows to improve ability
57+
to preserve various conditions such us balancing pods among failure domains, keeping
58+
aligned with security requirements or respecting application policies.
59+
Allowing controllers to be free of any scheduling strategy, yet to be aware
60+
of impact of removing pods on the overall cluster scheduling plan, helps to reduce
61+
cost of re-scheduling resources.
62+
63+
## Motivation
64+
65+
Scaling down a set of pods does not always results in optimal selection of victims.
66+
The scheduler relies on filters and scores which may distribute the pods wrt. topology
67+
spreading and/or load balancing constraints (e.g. pods uniformly balanced among zones).
68+
Application specific workloads may prefer to scale down short-running pods and favor long-running pods.
69+
Selecting a victim with a trivial logic can unbalance the topology spreading
70+
or have jobs that accumulated work to be lost in vain.
71+
Given it's a natural property of a cluster to shift workloads in time,
72+
decision made by a scheduler at some time is as good as its ability to predict future demands.
73+
The default kubernetes scheduler was constructed with a goal to provide high throughput
74+
at the cost of being simple. Thus, it is quite easy to diverge from the scheduling plan.
75+
In contrast, descheduler allows to help to re-balance the plan and get closer to
76+
the scheduler constraints. Yet, it is designed to run and adjust the cluster periodically (e.g. each hour).
77+
Therefore, unusable for scaling down purposes (which require immediate action).
78+
79+
On the other hand each controller with a scale down operation has its own
80+
implementation of a victim selection logic.
81+
The decision making logic does not take into account a scheduling plan.
82+
Extending each such controller with additional logic to support various scheduling
83+
constraints is impractical. In cases a proprietary solution for scaling down is required,
84+
it's impossible. Also, controllers do not necessarily have a whole cluster overview
85+
so its decision does not have to be optimal.
86+
Therefore, it's more feasible to locate the logic outside of a controller.
87+
88+
In order to support more informed scaling down operation while keeping scheduling plan in mind,
89+
additional decision logic that can be extended based on applications requirements is needed.
90+
91+
### Goals
92+
93+
- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan
94+
- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted
95+
96+
### Non-Goals
97+
98+
- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)
99+
100+
## Proposal
101+
102+
Proposed solution is to implement an optional cost-based component that will be watching all
103+
pods (or its subset) and nodes (potentially other objects) present in a cluster.
104+
Assigning each pod a cost based on a set of scheduling constraints.
105+
At the same time extending controllers logic to utilize the pod cost when selecting a victim during scale down operation.
106+
107+
The component will allow to select a different list of scheduling constraints for each targeted
108+
set of pods. Each pod in a set will be given a cost based on how much important it is in the set.
109+
The constraints can follow the same rules as the scheduler (through importing scheduling plugins)
110+
or be custom made (e.g. wrt. to application or proprietary requirements).
111+
The component will implement a mechanism for ranking pods.
112+
<!-- Either by annotating a pod, updating its status, setting a new field in pod's spec
113+
or creating a new CRD which will carry a cost. -->
114+
Each controller will have a choice to either ignore the cost or take it into account
115+
when scaling down.
116+
117+
This way, the logic for selecting a victim for the scaling down operation will be
118+
separated from each controller. Allowing each consumer to provide its own
119+
logic for assigning costs. Yet, having all controllers to consume the cost uniformly.
120+
121+
Given the default scheduler is not a source of truth about how a pod should be distributed
122+
after it was scheduled, scaling down strategies can exercise completely different approaches.
123+
124+
Examples of scheduling constraints:
125+
- choose pods running on a node which have a `PreferNoSchedule` taint first
126+
- choose youngest/oldest pods first
127+
- choose pods minimizing topology skew among failure domains (e.g. availability zones)
128+
129+
The goal of the proposal is not to provide specific strategies for more informed scaling down operation.
130+
The primary goal is to provide a mechanism and have controllers implement the mechanism.
131+
Allowing consumers of the new component to define their own strategies.
132+
133+
### Examples of strategies
134+
135+
Strategies can be divided into two categories:
136+
- scaling down/up a pod group does not require rank re-computation
137+
- scaling down/up a pod group requires rank re-computation
138+
139+
#### Balancing duplicates among topological domains
140+
141+
- Evict pods while minimizing skew between topological domains
142+
- Each pod can be given a cost based on how old/young it is in the same domain:
143+
- if a pod is the first one in the domain, rank the pod with cost `1`
144+
- if a pod was created second to the domain, rank the pod with cost `2`
145+
- continue this way until all pods in all domains are ranked
146+
- higher rank of a pod, the sooner the pod gets removed
147+
148+
#### Pods not tolerating taints first
149+
150+
- Evict pods that do not tolerate taints before pods that tolerate taints.
151+
- Each pod can be given a cost based on how many taints are not tolerated
152+
- higher rank of a pod, the sooner the pod gets removed
153+
154+
#### Minimizing pod anti-affinity
155+
156+
- Evict pods maximizing anti-affinity first
157+
- Pod that improves anti-affinity on a node gets higher rank
158+
- Given multiple pod groups can be part of anti-affinity group, scaling down
159+
a single pod in a group requires re-computation of pod ranks of all pods
160+
taking part. Also, only a single pod can be scaled down at a time.
161+
Otherwise, ranks no longer have to provide optimal victim selection.
162+
163+
In the provided examples the first two strategies do not require rank re-computation.
164+
165+
### Rank normalization and weighted sum
166+
167+
In order to allow pod ranking by multiple strategies/constraints, it's important
168+
to normalize ranks. On the other hand, rank normalization requires all strategies
169+
to re-compute all ranks every time a pod is created/deleted. To eliminate the need
170+
to re-compute, each strategy can introduce a threshold where every pod rank
171+
exceeding the threshold gets rounded to the threshold.
172+
E.g. if a topology domain has at least 10 pods, 11-th and other pods get the same
173+
rank as 10-th pod.
174+
With the threshold based normalization multiple strategies can rank a pod group
175+
which can be used to compute weighted rank through all relevant strategies.
176+
177+
### User Stories [optional]
178+
179+
#### Story 1
180+
181+
From [@pnovotnak](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-328685358):
182+
183+
```
184+
I have a number of scientific programs that I've wrapped with code to talk
185+
to a message broker that do not checkpoint state. The cost of deleting the resource
186+
increases over time (some of these tasks take hours), until it completes the current unit of work.
187+
188+
Choosing a pod by most idle resources would also work in my case.
189+
```
190+
191+
#### Story 2
192+
193+
From [@cpwood](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-436587548)
194+
195+
```
196+
For my use case, I'd prefer Kubernetes to choose its victims from pods that are running on nodes which have a PreferNoSchedule taint.
197+
```
198+
199+
#### Story 3
200+
201+
From [@barucoh](https://github.com/kubernetes/kubernetes/issues/89922)
202+
203+
```
204+
A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
205+
```
206+
207+
### Implementation Details/Notes/Constraints
208+
209+
Currently, the descheduler does not allow to immediately react on changes in a cluster.
210+
Yet, with some modification another instance of the descheduler (with different set of strategies)
211+
might be ran in watch mode and rank each pod as it comes.
212+
Also, once the scheduling framework gets migrated into its own repository,
213+
scheduling plugins can be vendored as well to provide some of the core scheduling logic.
214+
215+
The pod ranking is best-effort so in case a controller is to delete more than one pod
216+
it selects all the pods with the highest cost and remove those.
217+
In case a pod fails to be deleted during the scale down operation and results in resuming the operation in the next cycle,
218+
it may happen pods get ranked differently and a different set of victim pods gets selected.
219+
220+
Once a pod is removed, ranks of others pods might be required to get re-computed.
221+
Unless strategies that do not require re-computation are deployed.
222+
By default, all pods owned by a controller template has to be ranked.
223+
Otherwise, a controller falls back to each original selection victim logic.
224+
Resp. it can be configured to wait or back-off.
225+
Also, the ranking strategies can be configured to target only selected sets of pods.
226+
Thus, allowing a controller to employ cost based selection only when more sophisticated
227+
logic is required and available.
228+
229+
During alpha phase, each controller utilizing the pod ranking will feature gate the new logic.
230+
Starting by utilizing a pod annotation (e.g. `scheduling.alpha.kubernetes.io/cost`)
231+
which can be eventually promoted to either a field in pod's spec or moved under CRD (see further).
232+
233+
If strategies requiring rank re-computation are employed, it's more practical to define
234+
a CRD for a pod group and have all the costs in a single place to avoid desynchronization
235+
of ranks among pods.
236+
237+
#### Phases
238+
239+
Phase 1:
240+
- add support for strategies which do not need rank re-computation of a pod group
241+
- only a single strategy can be ran to rank pods (unless threshold based normalization is applied)
242+
- use annotations to hold a single pod cost
243+
244+
Phase 2A:
245+
- promote pod cost annotation to a pod status field
246+
- no synchronization of pods in a pod group, harder support of strategies which require rank re-computation
247+
248+
Phase 2B:
249+
- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks)
250+
- add support for strategies which require rank re-computation
251+
252+
### Option A (field in a pod status)
253+
254+
Store a post cost/rank under pod's status so it can be updated only by component
255+
who has permission to update pod status.
256+
257+
```go
258+
// PodStatus represents information about the status of a pod. Status may trail the actual
259+
// state of a system, especially if the node that hosts the pod cannot contact the control
260+
// plane.
261+
type PodStatus struct {
262+
...
263+
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
264+
// +optional
265+
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`
266+
...
267+
}
268+
```
269+
270+
Very simple, reading the field directly from a pod status.
271+
No additional apimachinery logic.
272+
273+
### Option B (CRD for a pod group)
274+
275+
```yaml
276+
apiVersion: scheduling.k8s.io/v1alpha1
277+
kind: PodGroupCost
278+
metadata:
279+
name: rc-guestbook-fronted
280+
namespace: rc-guestbook-fronted-namespace
281+
spec:
282+
owner:
283+
kind: ReplicationController
284+
name: rc-guestbook-fronted // may be redundant
285+
costs:
286+
"rc-guestbook-fronted-pod1": 4
287+
"rc-guestbook-fronted-pod2": 8
288+
...
289+
"rc-guestbook-fronted-podn": 2
290+
```
291+
292+
More suitable for keeping all pod costs from a pod group in sync.
293+
Controllers will need to take into account the new CRD (adding informers).
294+
A CR will live in the same namespace as underlying pod group (RC, RC, etc.).
295+
296+
### Workflow example
297+
298+
**Scenario**: pod group of 12 pods, 3 AZs (2 nodes per each AZ), pods are evenly spread among all zones
299+
300+
1. Assuming a pod group is supposed to respect topology spreading and scale down
301+
operation is to minimize topology skew between domains.
302+
1. The ranking component is configured to rank pods based on their presence in a topology domain
303+
1. The ranking component notices the pods, analyzes the pod group and ranks the pods in the following manner (`{PodName: Rank}`):
304+
- AZ1: {P1: 1, P2: 2, P3: 3, P4: 4} (P1 getting 1 as it was created first in the domain, P2 getting 2, etc.)
305+
- AZ2: {P5: 1, P6: 2, P7: 3, P8: 4}
306+
- AZ3: {P9: 1, P10: 2, P11: 3, P12: 4}
307+
1. Scale down operation of the pod group is requested
308+
1. Scale down logic selects one of P4, P8 or P12 as a victim (e.g. P8)
309+
1. Topology skew is now `1`
310+
1. No need to re-compute ranks since the ranking does not depend on the pod group size
311+
1. Scaling down one more time selects one of {P4, P12}
312+
1. Topology skew is still `1`
313+
314+
### Risks and Mitigations
315+
316+
It may happen the ranking component does not rank all relevant pods in time.
317+
In that case a controller can either choose to ignore the cost. Or, it can back-off
318+
with a configurable timeout and retry the scale down operation once all pods in
319+
a given set are ranked.
320+
321+
From the security perspective a malicious code might assign pod a different cost
322+
with a goal to remove more vital pods to harm a running application.
323+
How much is using annotation safe? Might be better to use pod status
324+
so only clients with pod/status update RBAC are allowed to change the cost.
325+
326+
In case a strategy needs to re-compute costs after scale down operation and
327+
the component stops working (for any reason), a controller might scale down
328+
incorrect pod(s) in the next request. More reasons to constraint strategies
329+
to not need to re-compute pod costs.
330+
331+
In case a scaling down process is too quick, the component may be too slow to
332+
recompute all scores and provide suboptimal/incorrect costs.
333+
334+
In case two or more controllers own a pod group (through labels), scaling down the group by one
335+
controller can result in scale up the same group by another controller.
336+
Entering an endless loop of scaling up and down. Which may result in unexpected
337+
behavior. Leaving a subgroup of pods unranked.
338+
339+
Deployment upgrades might have different expectations when exercising a rolling update.
340+
They could just completely ignore the costs. Unless, it's acceptable to scale down by one
341+
and wait until the costs are recomputed when needed.
342+
343+
## Design Details
344+
345+
### Test Plan
346+
347+
TBD
348+
349+
### Graduation Criteria
350+
351+
- Alpha: Initial support for taking pod cost into account when scaling down in controllers. Disabled by default.
352+
- Beta: Enabled by default
353+
354+
### Upgrade / Downgrade Strategy
355+
356+
Scaling down based on a pod cost is optional. If no cost is present, scaling down falls back to the original behavior.
357+
358+
### Version Skew Strategy
359+
360+
A controller either recognizes pod's cost or it does not.
361+
362+
## Implementation History
363+
364+
- KEP Started on 06/30/2020
365+
366+
## Alternatives [optional]
367+
368+
- Controllers might use a webhook and talk to the component directly to select a victim
369+
- Some controllers might improve their decision logic to cover specific use cases (e.g. introduce new policy for sorting pods based on information located in pod objects)

0 commit comments

Comments
 (0)