Adjust Eviction Control AEP for in-place updates #5754

jbartosik · 2023-05-11T13:31:03Z

What type of PR is this?

/kind documentation
/kind feature
/kind api-change

What this PR does / why we need it:

Adjust Autoscaling Enhancement Proposal 4831 to work better with in place update support. See #5755 for details of in-place support

I divided the PR into a few changes to make the review easier (separating content main content change, formatting changes, file rename and minor edits).

@voelzmo @kgolab @wangchen615 please take a look

#4831

- Add history - Change name from KEP (Kubernetes Enhancement Proposal) to AEP (Autoscaling Enhancement Proposal)

jbartosik · 2023-05-11T13:49:35Z

@pbetkier please take a look

liggitt · 2023-05-11T15:11:18Z

if this is ready for API review,
/assign @thockin
for in-place context

jbartosik · 2023-05-12T07:40:52Z

/retest

k8s-ci-robot · 2023-05-12T12:06:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [jbartosik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

voelzmo

Thanks for splitting this PR up in nice, digestible chunks – that made looking at the individual parts very easy!

My understanding of what this change is trying to achieve is: Make this proposal useful after in-place resource updates have become the norm, such that we don't add features to the eviction workflow in VPA which will (potentially?) go away in a while.

I appreciate the idea behind this! At this point, I'm not sure how useful restricting actuation based on scaling direction is in the case of in-place updates. Our main motivation when proposing controlling eviction behavior based on scaling direction was to avoid disruptions for scenarios in which we're fine trading inefficiency (resources being higher than necessary) for freedom from evictions.

So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates.

Does this make sense?

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

voelzmo · 2023-05-15T11:39:49Z

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

+The following analysis focuses on the `InPlaceOrRecreate` mode. In other modes having separate controls for in-place and
+eviction actuation doesn't offer any more control than single setting, because VPA won't actively actuate (`Off` and
+`Initial`) or because it will use only one mode of actuation (`Recreate` and `InPlaceOnly`). It also assumes that scale
+downs always succeed.


It also assumes that scale
downs always succeed.

Can you elaborate a bit on this assumption? How/why is this safe to assume?
From looking at the KEP, I see that the process of scaling down limits may be (temporarily) blocked by reclaiming some memory – are we sure this always (eventually) succeeds?

The KEP itself isn't very detailed about failure scenarios, though, so I'm not sure if this is because everything eventually turns out fine or if at this point it is hard to tell what happens ;)

I noticed I was making that assumption, it seemed like it would make sense so I wanted to make it explicit. I didn't check it very carefully.

I think without it it still makes sense to have one set of controls for actuation.

Without this assumption having separate controls makes some sense. One might want to scale down only in place (don't evict to save resources) but allow eviction to scale up (evict to allow continued functioning) but I'm not sure how useful that is

kgolab · 2023-05-16T12:21:15Z

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

+VPA wil attempt to scale in place before attempting to evict. Since scale downs succeed VPA will never consider evicting
+pods to scale them down so configuring scale down in place separately from scale down by eviction is unnecessary.
+
+If user wants to scale up in place only, not evict they should use `InPlaceOnly` mode (since we assume scale downs in


The only case I could think of to act differently here is if my app is not that good at reducing its resource (mostly memory) usage w/o a restart. IIUC this could be addressed within PodSpec, not via VPA settings, right?

I had some problems with updating requests of containers that require restart (restart didn't happen, resize got stuck in InProgress for >1h (I stopped looking)).

I hope this will be fixed before beta and we can rely on containers being restarted and properly configured

kgolab · 2023-05-16T12:23:44Z

So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates.

@voelzmo , I'm not sure I fully understood here: are you saying you'd always want in-place updates to happen and only care about evictions?

If yes: unlimited in-place scale down is likely to work but might end up in an eviction later as it might not be reversible (e.g. more Pods getting scheduled on a Node or others growing). Would that be a good reason do limit in-place scale down as well?

jbartosik · 2023-05-17T18:28:12Z

So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates.

Marco Voelz , I'm not sure I fully understood here: are you saying you'd always want in-place updates to happen and only care about evictions?

If yes: unlimited in-place scale down is likely to work but might end up in an eviction later as it might not be reversible (e.g. more Pods getting scheduled on a Node or others growing). Would that be a good reason do limit in-place scale down as well?

Marco Voelz Karol Gołąb

I don't think eviction will go away completely in the foreseeable future. I don't think we'll have guarantee that in-place update will succeed so we'll want eviction as a fallback.

This is more about the case Karol described - user might want to avoid scale downs (because there is no guarantee that they'll be able to scale up without disruption) but allow scale-ups (preferring in-place but with eviction as fallback).

liggitt · 2023-05-25T16:16:04Z

Based on #5754 (comment), it looks like this wasn't asking for API review

/remove-label api-review

k8s-ci-robot · 2023-05-25T16:16:06Z

@liggitt: Those labels are not set on the issue: api-review

In response to this:

Based on #5754 (comment), it looks like this wasn't asking for API review

/remove-label api-review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-05-25T16:40:09Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

kwiesmueller · 2023-06-15T20:49:41Z

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

+of type `[]*ActuationRequirement`. A single `ActuationRequirement` defines a condition which must be `true` to allow
+actuation for the corresponding `Container`. When multiple `ActuationRequirements` are specified for a `Pod`, all of
+them must evaluate to `true` to allow eviction. VPA may partially actuate its recommendation in place, obeying all the
+`ActuationRequirements` while not applying parts of the recommendation that would violate any `ActuationRequirement`.


If actuation still happens (using in-place) I think naming the field EvictionRequirements or RestartRequirements would be more transparent.

If we later want to add a similar mechanism to prevent even in-place changes in certain scenarios that would still be possible.

I see you mention that below in part, I agree with that.
For easier understanding I'd still suggest using a different name though.

dbenque · 2023-07-03T13:11:56Z

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

+
+### Goals
+* For a specific VPA object, allow turning off actuation for downscaling entirely while still allow for upscaling
+* Allow for resource-specific decisions: The desired policy may be different for CPU and Memory


This might be open for other resource than CPU and Memory int he future, so let's pay attention to have an API open to handle other type of resources. I am thinking here at network bandwidth reservation for example.

I think the proposal is easy to extend to other resources. Do you expect any problems?

dbenque · 2023-07-03T13:17:29Z

vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md

+(`Target`) with the existing requests on a Pod (`Requests`). Possible values for `Resources` are `[CPU]` and `[Memory]`
+or both `[CPU,Memory]`. If `Resources: [CPU, Memory]`, the condition must be true for either of the two resources to
+allow for actuation. Possible values for `ChangeRequirement` are `TargetHigherThanRequests`,
+`TargetHigherThanOrEqualToRequests`, `TargetLowerThanRequests` and `TargetLowerThanOrEqualToRequests`.


Why do we need the variations with OrEqual ? couldn't it be OrEqual by default, does 0 or 0.0001 diff make a difference in that usecase ?

Given that VPA has other mechanisms in place to ensure there is a large enough diff between the new recommendation and the current requests, I guess it should be fine to remove the orEqual variations entirely. WDYT, @jbartosik?

Yes, let's do that

voelzmo · 2023-07-05T12:49:04Z

So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates.

@voelzmo , I'm not sure I fully understood here: are you saying you'd always want in-place updates to happen and only care about evictions?

Sorry for dropping the ball on this one, I completely got sidetracked by other things.
You understood correctly: I'm voting for not extending the original proposal to also apply to in-place updates, but rather only deal with evictions, as it was intended.
This means, it should still apply whenever the Updater would fall back to eviction, but in-place resource updates (regardless if the containers specified they need a restart or not) would no be subject to the EvictionRequirements.

If yes: unlimited in-place scale down is likely to work but might end up in an eviction later as it might not be reversible (e.g. more Pods getting scheduled on a Node or others growing). Would that be a good reason do limit in-place scale down as well?

TBH, I'm currently not super concerned about that scenario. If you're running in a setting where Nodes are 99% allocated, even scale-ups without previous scale-downs would need to result in evictions. If that's the case, and you need to avoid disruptions at all cost, it seems like VPA is probably not what you want to use at all?

thockin · 2023-07-17T17:54:45Z

I ACK the nomination to do API review. I am a little out of touch with the VPA APIs, so I will have to come up to speed. Maybe we can have a slack or Zoom discussion after code freeze?

voelzmo · 2023-07-18T07:38:16Z

@thockin I think we tried to remove the request for api-review once and twice, but this didn't work as expected. I think this should be fine without a review from your side. Sorry for the confusion ;(

We are changing parts of the VPA API by adding an option for the user to control if a Pod should be evicted based on if the VPA is scaling up or down, but I think we're ok to discuss these changes inside SIG autoscaling? Don't want to waste your time here :)

@jbartosik: please correct me if I misunderstood the intent and you wanted @thockin to review this from the API side.

/remove-kind api-change

jbartosik · 2023-07-21T14:28:43Z

I think it makes sense to have controls for in-place scaling. In place resize can still be a disruption (in place updates might cause container restart). So one might want to allow scale ups (to avoid even bigger disruption when containers start crashing).

jbartosik · 2023-07-24T09:46:04Z

I think that we will need to allow control over direction of in-place resizes. From what @kwiesmueller and @voelzmo write like you'd prefer separate controls for in-place resizes and eviction. I preferred single control for both because having fewer config options makes things easier to use (by making it harder to have bad configs, less things to understand).

So I'd

Close this PR
Add a note to 4016-in-place-updates-support AEP that we should have separate controls for in-place resizes, with short explanation why we prefer that over having one set of controls for in-place and evict actuation (in-place is in alpha so I don't think we need to do full EAP yet)
Adjust 4831-control-eviction-behavior to drop TargetHigherThanOrEqualToRequests and TargetLowerThanOrEqualToRequests

@kwiesmueller @voelzmo please:

confirm that I understood you correctly
confirm that actions I wrote down seem like a good plan to you
let me know if you want to do any of the changes to existing AEPs I mention (or if I should do them)

kwiesmueller · 2023-07-24T14:42:45Z

Sounds good to me, I don't think I'll get to making any changes this week :-/
If it has time I can work on a PR to update the in-place AEP next week.

voelzmo · 2023-07-25T10:51:21Z

I'm happy with closing this PR and having EvictionRequirements apply only for evictions. I'm not sure that we have at this point a clear indication that we will need the same configuration options for in-place resource edits as well, but I think it is a good idea to move this into a separate PR/discussion.

I can take over point 3

Adjust 4831-control-eviction-behavior to drop TargetHigherThanOrEqualToRequests and TargetLowerThanOrEqualToRequests

And remove the OrEqual options from the proposal and the API/implementation PRs.

Thanks for moving this forward, Joachim!

jbartosik · 2023-07-27T13:39:42Z

Thank you @kwiesmueller and @voelzmo . I opened #5990 and #5991 and assigned them to you to make it easier for us to not forget about this.

jbartosik added 3 commits May 11, 2023 14:20

Break lines in control eviction behavior AEP

b7be575

Minor adjustments to control eviction behavior AEP

71b99b0

- Add history - Change name from KEP (Kubernetes Enhancement Proposal) to AEP (Autoscaling Enhancement Proposal)

Adjust control eviction behavior AEP for in-place updates

1acac6d

jbartosik added area/vertical-pod-autoscaler api-review Categorizes an issue or PR as actively needing an API review. labels May 11, 2023

k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels May 11, 2023

jbartosik changed the title ~~Eviction control aep mod~~ Adjust Eviction Control AEP for in-place updates May 11, 2023

jbartosik mentioned this pull request May 11, 2023

AEP for support of in-place updates for VPA #5755

Merged

k8s-ci-robot assigned thockin May 11, 2023

jbartosik removed the api-review Categorizes an issue or PR as actively needing an API review. label May 12, 2023

jbartosik unassigned thockin May 12, 2023

jbartosik mentioned this pull request May 12, 2023

API for KEP-4831: Control VPA eviction behavior based on scaling direction and resource #5176

Merged

Rename directory for AEP 4831

0c57898

jbartosik force-pushed the eviction-control-aep-mod branch from 2eedcfd to 0c57898 Compare May 12, 2023 12:06

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2023

voelzmo suggested changes May 15, 2023

View reviewed changes

kgolab reviewed May 16, 2023

View reviewed changes

Apply review comment to AEP 4831

731173b

kwiesmueller reviewed Jun 15, 2023

View reviewed changes

jbartosik mentioned this pull request Jun 26, 2023

Clarification on restart-free updates with VPA #5885

Closed

dbenque reviewed Jul 3, 2023

View reviewed changes

k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jul 18, 2023

This was referenced Jul 25, 2023

Remove ChangeRequirements with OrEqual from EvictionControl enhancement proposal #5981

Merged

Implementation for KEP-4831: Control VPA eviction behavior based on scaling direction and resource #5599

Merged

This was referenced Jul 27, 2023

Remove TargetHigherThanOrEqualToRequests and TargetLowerThanOrEqualToRequests from AEP 4831 #5990

Closed

Mention that we should consider controls for direction of in-place updates in AEP 4016 #5991

Closed

jbartosik closed this Jul 27, 2023

Adjust Eviction Control AEP for in-place updates #5754

Adjust Eviction Control AEP for in-place updates #5754

Uh oh!

Conversation

jbartosik commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Uh oh!

jbartosik commented May 11, 2023

Uh oh!

liggitt commented May 11, 2023

Uh oh!

jbartosik commented May 12, 2023

Uh oh!

k8s-ci-robot commented May 12, 2023

Uh oh!

voelzmo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgolab commented May 16, 2023

Uh oh!

jbartosik commented May 17, 2023

Uh oh!

liggitt commented May 25, 2023

Uh oh!

k8s-ci-robot commented May 25, 2023

Uh oh!

k8s-triage-robot commented May 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbenque Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbenque Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voelzmo commented Jul 5, 2023

Uh oh!

thockin commented Jul 17, 2023

Uh oh!

voelzmo commented Jul 18, 2023

Uh oh!

jbartosik commented Jul 21, 2023

Uh oh!

jbartosik commented Jul 24, 2023

Uh oh!

kwiesmueller commented Jul 24, 2023

Uh oh!

voelzmo commented Jul 25, 2023

Uh oh!

jbartosik commented Jul 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jbartosik commented May 11, 2023 •

edited

Loading

dbenque Jul 3, 2023 •

edited

Loading

dbenque Jul 3, 2023 •

edited

Loading