-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Adjust Eviction Control AEP for in-place updates #5754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add history - Change name from KEP (Kubernetes Enhancement Proposal) to AEP (Autoscaling Enhancement Proposal)
@pbetkier please take a look |
if this is ready for API review, |
/retest |
2eedcfd
to
0c57898
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jbartosik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting this PR up in nice, digestible chunks – that made looking at the individual parts very easy!
My understanding of what this change is trying to achieve is: Make this proposal useful after in-place resource updates have become the norm, such that we don't add features to the eviction workflow in VPA which will (potentially?) go away in a while.
I appreciate the idea behind this! At this point, I'm not sure how useful restricting actuation based on scaling direction is in the case of in-place updates. Our main motivation when proposing controlling eviction behavior based on scaling direction was to avoid disruptions for scenarios in which we're fine trading inefficiency (resources being higher than necessary) for freedom from evictions.
So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates.
Does this make sense?
vertical-pod-autoscaler/enhancements/4831-control-actuation-behavior/README.md
Outdated
Show resolved
Hide resolved
The following analysis focuses on the `InPlaceOrRecreate` mode. In other modes having separate controls for in-place and | ||
eviction actuation doesn't offer any more control than single setting, because VPA won't actively actuate (`Off` and | ||
`Initial`) or because it will use only one mode of actuation (`Recreate` and `InPlaceOnly`). It also assumes that scale | ||
downs always succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also assumes that scale
downs always succeed.
Can you elaborate a bit on this assumption? How/why is this safe to assume?
From looking at the KEP, I see that the process of scaling down limits may be (temporarily) blocked by reclaiming some memory – are we sure this always (eventually) succeeds?
The KEP itself isn't very detailed about failure scenarios, though, so I'm not sure if this is because everything eventually turns out fine or if at this point it is hard to tell what happens ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed I was making that assumption, it seemed like it would make sense so I wanted to make it explicit. I didn't check it very carefully.
I think without it it still makes sense to have one set of controls for actuation.
Without this assumption having separate controls makes some sense. One might want to scale down only in place (don't evict to save resources) but allow eviction to scale up (evict to allow continued functioning) but I'm not sure how useful that is
VPA wil attempt to scale in place before attempting to evict. Since scale downs succeed VPA will never consider evicting | ||
pods to scale them down so configuring scale down in place separately from scale down by eviction is unnecessary. | ||
|
||
If user wants to scale up in place only, not evict they should use `InPlaceOnly` mode (since we assume scale downs in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case I could think of to act differently here is if my app is not that good at reducing its resource (mostly memory) usage w/o a restart. IIUC this could be addressed within PodSpec, not via VPA settings, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some problems with updating requests of containers that require restart (restart didn't happen, resize got stuck in InProgress for >1h (I stopped looking)).
I hope this will be fixed before beta and we can rely on containers being restarted and properly configured
@voelzmo , I'm not sure I fully understood here: are you saying you'd always want in-place updates to happen and only care about evictions? If yes: unlimited in-place scale down is likely to work but might end up in an eviction later as it might not be reversible (e.g. more Pods getting scheduled on a Node or others growing). Would that be a good reason do limit in-place scale down as well? |
So, when in-place resource updates make it to the VPA, I don't think I would want any restrictions being applied to disruption-free resource updates and still want to have restrictions for eviction-based resource updates – be it because VPA is falling back to eviction or because my workload is configured with an update mode which doesn't allow for in-place updates. Marco Voelz , I'm not sure I fully understood here: are you saying you'd always want in-place updates to happen and only care about evictions? If yes: unlimited in-place scale down is likely to work but might end up in an eviction later as it might not be reversible (e.g. more Pods getting scheduled on a Node or others growing). Would that be a good reason do limit in-place scale down as well? I don't think eviction will go away completely in the foreseeable future. I don't think we'll have guarantee that in-place update will succeed so we'll want eviction as a fallback. This is more about the case Karol described - user might want to avoid scale downs (because there is no guarantee that they'll be able to scale up without disruption) but allow scale-ups (preferring in-place but with eviction as fallback). |
Based on #5754 (comment), it looks like this wasn't asking for API review /remove-label api-review |
@liggitt: Those labels are not set on the issue: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
of type `[]*ActuationRequirement`. A single `ActuationRequirement` defines a condition which must be `true` to allow | ||
actuation for the corresponding `Container`. When multiple `ActuationRequirements` are specified for a `Pod`, all of | ||
them must evaluate to `true` to allow eviction. VPA may partially actuate its recommendation in place, obeying all the | ||
`ActuationRequirements` while not applying parts of the recommendation that would violate any `ActuationRequirement`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If actuation still happens (using in-place) I think naming the field EvictionRequirements
or RestartRequirements
would be more transparent.
If we later want to add a similar mechanism to prevent even in-place changes in certain scenarios that would still be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you mention that below in part, I agree with that.
For easier understanding I'd still suggest using a different name though.
|
||
### Goals | ||
* For a specific VPA object, allow turning off actuation for downscaling entirely while still allow for upscaling | ||
* Allow for resource-specific decisions: The desired policy may be different for CPU and Memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be open for other resource than CPU and Memory int he future, so let's pay attention to have an API open to handle other type of resources. I am thinking here at network bandwidth reservation for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the proposal is easy to extend to other resources. Do you expect any problems?
(`Target`) with the existing requests on a Pod (`Requests`). Possible values for `Resources` are `[CPU]` and `[Memory]` | ||
or both `[CPU,Memory]`. If `Resources: [CPU, Memory]`, the condition must be true for either of the two resources to | ||
allow for actuation. Possible values for `ChangeRequirement` are `TargetHigherThanRequests`, | ||
`TargetHigherThanOrEqualToRequests`, `TargetLowerThanRequests` and `TargetLowerThanOrEqualToRequests`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the variations with OrEqual
? couldn't it be OrEqual
by default, does 0 or 0.0001 diff make a difference in that usecase ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that VPA has other mechanisms in place to ensure there is a large enough diff between the new recommendation and the current requests, I guess it should be fine to remove the orEqual
variations entirely. WDYT, @jbartosik?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do that
Sorry for dropping the ball on this one, I completely got sidetracked by other things.
TBH, I'm currently not super concerned about that scenario. If you're running in a setting where Nodes are 99% allocated, even scale-ups without previous scale-downs would need to result in evictions. If that's the case, and you need to avoid disruptions at all cost, it seems like VPA is probably not what you want to use at all? |
I ACK the nomination to do API review. I am a little out of touch with the VPA APIs, so I will have to come up to speed. Maybe we can have a slack or Zoom discussion after code freeze? |
@thockin I think we tried to remove the request for api-review once and twice, but this didn't work as expected. I think this should be fine without a review from your side. Sorry for the confusion ;( We are changing parts of the VPA API by adding an option for the user to control if a Pod should be evicted based on if the VPA is scaling up or down, but I think we're ok to discuss these changes inside SIG autoscaling? Don't want to waste your time here :) @jbartosik: please correct me if I misunderstood the intent and you wanted @thockin to review this from the API side. /remove-kind api-change |
I think it makes sense to have controls for in-place scaling. In place resize can still be a disruption (in place updates might cause container restart). So one might want to allow scale ups (to avoid even bigger disruption when containers start crashing). |
I think that we will need to allow control over direction of in-place resizes. From what @kwiesmueller and @voelzmo write like you'd prefer separate controls for in-place resizes and eviction. I preferred single control for both because having fewer config options makes things easier to use (by making it harder to have bad configs, less things to understand). So I'd
@kwiesmueller @voelzmo please:
|
Sounds good to me, I don't think I'll get to making any changes this week :-/ |
I'm happy with closing this PR and having I can take over point 3
And remove the Thanks for moving this forward, Joachim! |
Thank you @kwiesmueller and @voelzmo . I opened #5990 and #5991 and assigned them to you to make it easier for us to not forget about this. |
What type of PR is this?
/kind documentation
/kind feature
/kind api-change
What this PR does / why we need it:
Adjust Autoscaling Enhancement Proposal 4831 to work better with in place update support. See #5755 for details of in-place support
I divided the PR into a few changes to make the review easier (separating content main content change, formatting changes, file rename and minor edits).
@voelzmo @kgolab @wangchen615 please take a look
#4831