-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5471 Extended Toleration Operators for Threshold-Based Placement #5473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5471 Extended Toleration Operators for Threshold-Based Placement #5473
Conversation
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
## Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include the scenario of adding tolerations on the semantic version comparison? It will likely require either a new operator or some way to express that a taint's string needs to be parsed as a semver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current scope for this KEP is integer-only support for SLA/failure-probability because it’s the simplest and safest to implement (avoids floating-point parsing and complex type semantics).
I agree that semver comparisons are a valid and important future use case (e.g., kubelet version taints, device firmware versions). To keep this KEP narrowly scoped and implementable, I propose to document the semantic version comparison in a new Future Work section. And also consider if we would add this to the node affinity as well. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consideration here is what is the delta here. It is a lot of red tape to run thru scenarios and API reviews. Making two changes together may be much easier than splitting into separate KEPs.
I am not objecting to this scope though. This is not blocking proposal, but I believe it will be very useful.
BTW, we may also want to support semver in CEL as well =). So potentially another scope increase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the scope to include both semver
and CEL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho fyi for semver addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, after discussing it with @sanposhiho @macsko, we will have a separate KEP for how the semver can be implemented onto nodeaffinity, taint/toleration, and CEL, along with evaluating use case for each.
Created issue to track this feature ->#5500
Signed-off-by: Heba Elayoty <[email protected]>
Signed-off-by: Heba Elayoty <[email protected]>
2a36559
to
c9e75ba
Compare
Signed-off-by: Heba Elayoty <[email protected]>
Signed-off-by: Heba Elayoty <[email protected]>
- Enhance CEL semver library with additional comparison functions for consistent version handling. | ||
- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`). | ||
- Provide unified semantic version handling across scheduling and admission control. | ||
- Backward compatible and opt‑in via feature gates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fair to add a goal to ensure that the addition of new operators has zero operational effect on existing pod scheduling outcomes using the Equality or Exists operators? I'm thinking of the various switch toleration.Operator
bits in source and assuming that adding more operators should have zero cost to existing scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, totally fair. Added a goal in addition to performance tests.
|
||
As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there. | ||
|
||
I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider: for every scenario that is (strictly speaking) possible using existing semantics (e.g., NodeAffinity) it might be helpful to include that scenario here. A before/after type picture. Not wanting to add too much information to the KEP, but that would help folks to assess the ergonomic improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differed to the separate KEP we will create for semver/CEL
- key: node.kubernetes.io/sla | ||
operator: Gt | ||
value: "750" | ||
effect: NoSchedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also inclue an example pod spec that will explicitly not get scheduled onto a spot node, e.g.,
---
# Critical workload will not be scheduled until a suitable high reliability node has capacity
apiVersion: v1
kind: Pod
metadata:
name: critical-workload
spec:
tolerations:
- key: node.kubernetes.io/sla
operator: Gt
value: "950"
effect: NoSchedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the example.
value: "24" | ||
effect: NoSchedule | ||
--- | ||
# Batch training workload tolerates degraded devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's call this a "Short-lived batch training workload" to distinguish from other (very normal) batch workloads that would not want to be interrupted in an hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
|
||
#### Story 6 — Kubernetes version compatibility for critical workloads | ||
|
||
As a cluster operator managing a mixed-version Kubernetes cluster during rolling upgrades, I want to ensure critical workloads only run on nodes with Kubernetes version >= 1.20.0 due to specific API features they require, while allowing development workloads to tolerate older versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: to make this more current, let's add 1.33/1.34 as our two example versions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differed to the future KEP for semver/CEL
spec: | ||
tolerations: | ||
- key: node.kubernetes.io/version | ||
operator: SemverGt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SemverGe for this and below in this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differed to the future KEP for semver/CEL
- Values must conform to [Semantic Versioning 2.0.0](https://semver.org/) specification requiring exactly 3 components (major.minor.patch) | ||
- Supports both prefixed (`"v1.20.1"`) and non-prefixed (`"1.20.1"`) formats following Kubernetes conventions | ||
- Invalid semver strings (e.g., `"1.20"`, `"1.20.1.1"`, `"1"`) cause validation errors | ||
- Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the pre-release and build metadata points are part of semver, so we could drop them. Or if being super explicit is helpful, let's add the "per semver specification" suffix to the Pre-release versions line item as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differed to the future KEP for semver/CEL
- Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`) | ||
- Build metadata (`+build.123`) is ignored in comparisons per semver specification | ||
|
||
- **Type Mismatch Handling**: If toleration and taint values cannot be parsed as the same type (integer vs semver), the toleration does not match. This prevents unexpected behavior and ensures type safety. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies I could have two nodes on my cluster like this:
apiVersion: v1
kind: Node
metadata:
name: node-1
spec:
taints:
- key: node.kubernetes.io/foo
value: "100"
effect: NoExecute
---
apiVersion: v1
kind: Node
metadata:
name: node-2
spec:
taints:
- key: node.kubernetes.io/foo
value: "v1.34.0"
effect: NoExecute
And that the scheduler would simply schedule workloads with proper tolerations using either of the above semantics (int comparison or semver comparison) to the right node.
Also, apiserver has no preference between either of the below:
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
spec:
template:
spec:
tolerations:
- key: node.kubernetes.io/sla
operator: Gt
value: "950"
effect: NoExecute
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
spec:
template:
spec:
tolerations:
- key: node.kubernetes.io/sla
operator: Gt
value: "v1.33.0"
effect: NoExecute
During scheduling, if we encounter a matching node taint whose value is of a different type than the workload toleration (int vs semver) then we simply continue and rule out that node as a candidate for scheduling.
Just thinking out loud here, did I get that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Differed to the future KEP for semver/CEL
5df1161
to
cd9d4a3
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: helayoty The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Heba Elayoty <[email protected]>
e5b0b7b
to
4e1a1cb
Compare
|
||
nVal, nErr := strconv.ParseInt(taintVal, 10, 64) | ||
if nErr != nil { | ||
return false // Invalid taint value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't returning just false here too simple? How would the user know that the node was rejected because of non-numeric taint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add logging or return errors here. I'd keep it simple return false for performance reason (this code path is hot and will be called for every pod-node combination during scheduling) and to be consistenent with the current ToleratesTaint
method (returns false silently).
Invalid numeric values should be caught by API validation when pods/nodes are created, not during runtime scheduling. Users will still get feedback through API validation errors and scheduler events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can't catch the invalid numeric values for the taints earlier, because node.kubernetes.io/sla=95.5
is a correct taint - it can be matched by the Equal
operator. If the parsing during scheduling fails, we have 2 options:
- Node taint is misconfigured, because it uses non-numeric taint, when expected to use numeric one (e.g. 95.5 value instead of 955).
- Pod toleration is misconfigured, because it wants to use numeric operator on non-numeric taint.
I think for both cases we would like to inform user what's wrong, not only that the toleration is not matching for any node.
- `scheduler_numeric_tolerations_total`: To measure the number of pods scheduled using numeric toleration operators. | ||
- `scheduler_numeric_taint_mismatches_total`: To measure the scheduling failures due to numeric taint/toleration mismatches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you implement those metrics? Pod could fail due to multiple reasons, depending on the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this issue. After some thoughts the scheduler_numeric_taint_mismatches_total
metric is problematic because pods can fail scheduling due to multiple plugins simultaneously. Also, the same pod might be rejected by different nodes for different reasons. Metrics updated.
|
||
### Semantics | ||
|
||
- To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to be well documented in the API to make sure the users won't misunderstand
Signed-off-by: Heba Elayoty <[email protected]>
a65244f
to
2bdd819
Compare
Uh oh!
There was an error while loading. Please reload this page.