KEP-5471 Extended Toleration Operators for Threshold-Based Placement #5473

helayoty · 2025-08-11T21:48:14Z

One-line PR description: Add numeric comparison operators (Lt, Gt) to Tolerations for SLA-based scheduling with threshold-based taint matching.

Issue link: Extended Toleration Operators for Threshold-Based Placement #5471

Other comments: cc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

SergeyKanzhelev · 2025-08-11T23:34:05Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+## Summary


can we include the scenario of adding tolerations on the semantic version comparison? It will likely require either a new operator or some way to express that a taint's string needs to be parsed as a semver.

The current scope for this KEP is integer-only support for SLA/failure-probability because it’s the simplest and safest to implement (avoids floating-point parsing and complex type semantics).

I agree that semver comparisons are a valid and important future use case (e.g., kubelet version taints, device firmware versions). To keep this KEP narrowly scoped and implementable, I propose to document the semantic version comparison in a new Future Work section. And also consider if we would add this to the node affinity as well. wdyt?

One consideration here is what is the delta here. It is a lot of red tape to run thru scenarios and API reviews. Making two changes together may be much easier than splitting into separate KEPs.

I am not objecting to this scope though. This is not blocking proposal, but I believe it will be very useful.

BTW, we may also want to support semver in CEL as well =). So potentially another scope increase

Updated the scope to include both semver and CEL.

@sanposhiho fyi for semver addition.

fyi, after discussing it with @sanposhiho @macsko, we will have a separate KEP for how the semver can be implemented onto nodeaffinity, taint/toleration, and CEL, along with evaluating use case for each.

Created issue to track this feature ->#5500

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

Signed-off-by: Heba Elayoty <[email protected]>

jackfrancis · 2025-08-22T22:20:22Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+- Enhance CEL semver library with additional comparison functions for consistent version handling.
+- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
+- Provide unified semantic version handling across scheduling and admission control.
+- Backward compatible and opt‑in via feature gates.


Is it fair to add a goal to ensure that the addition of new operators has zero operational effect on existing pod scheduling outcomes using the Equality or Exists operators? I'm thinking of the various switch toleration.Operator bits in source and assuming that adding more operators should have zero cost to existing scenarios.

Yes, totally fair. Added a goal in addition to performance tests.

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

jackfrancis · 2025-08-22T22:35:37Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+
+As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there.
+
+I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.


Something to consider: for every scenario that is (strictly speaking) possible using existing semantics (e.g., NodeAffinity) it might be helpful to include that scenario here. A before/after type picture. Not wanting to add too much information to the KEP, but that would help folks to assess the ergonomic improvements.

Differed to the separate KEP we will create for semver/CEL

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

jackfrancis · 2025-08-22T22:46:22Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+  - key: node.kubernetes.io/sla
+    operator: Gt
+    value: "750"
+    effect: NoSchedule


Maybe we can also inclue an example pod spec that will explicitly not get scheduled onto a spot node, e.g.,

--- # Critical workload will not be scheduled until a suitable high reliability node has capacity apiVersion: v1 kind: Pod metadata: name: critical-workload spec: tolerations: - key: node.kubernetes.io/sla operator: Gt value: "950" effect: NoSchedule

Added the example.

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

jackfrancis · 2025-08-22T22:51:07Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+      value: "24"
+      effect: NoSchedule
+---
+# Batch training workload tolerates degraded devices


Nit: Let's call this a "Short-lived batch training workload" to distinguish from other (very normal) batch workloads that would not want to be interrupted in an hour.

jackfrancis · 2025-08-22T22:52:11Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+
+#### Story 6 — Kubernetes version compatibility for critical workloads
+
+As a cluster operator managing a mixed-version Kubernetes cluster during rolling upgrades, I want to ensure critical workloads only run on nodes with Kubernetes version >= 1.20.0 due to specific API features they require, while allowing development workloads to tolerate older versions.


Nit: to make this more current, let's add 1.33/1.34 as our two example versions here

Differed to the future KEP for semver/CEL

jackfrancis · 2025-08-22T22:52:35Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+    spec:
+      tolerations:
+      - key: node.kubernetes.io/version
+        operator: SemverGt


SemverGe for this and below in this example?

Differed to the future KEP for semver/CEL

jackfrancis · 2025-08-22T22:56:08Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+  - Values must conform to [Semantic Versioning 2.0.0](https://semver.org/) specification requiring exactly 3 components (major.minor.patch)
+  - Supports both prefixed (`"v1.20.1"`) and non-prefixed (`"1.20.1"`) formats following Kubernetes conventions  
+  - Invalid semver strings (e.g., `"1.20"`, `"1.20.1.1"`, `"1"`) cause validation errors
+  - Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`)


Nit: the pre-release and build metadata points are part of semver, so we could drop them. Or if being super explicit is helpful, let's add the "per semver specification" suffix to the Pre-release versions line item as well.

Differed to the future KEP for semver/CEL

jackfrancis · 2025-08-22T23:07:30Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+  - Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`)
+  - Build metadata (`+build.123`) is ignored in comparisons per semver specification
+
+- **Type Mismatch Handling**: If toleration and taint values cannot be parsed as the same type (integer vs semver), the toleration does not match. This prevents unexpected behavior and ensures type safety.


This implies I could have two nodes on my cluster like this:

apiVersion: v1 kind: Node metadata: name: node-1 spec: taints: - key: node.kubernetes.io/foo value: "100" effect: NoExecute --- apiVersion: v1 kind: Node metadata: name: node-2 spec: taints: - key: node.kubernetes.io/foo value: "v1.34.0" effect: NoExecute

And that the scheduler would simply schedule workloads with proper tolerations using either of the above semantics (int comparison or semver comparison) to the right node.

Also, apiserver has no preference between either of the below:

apiVersion: apps/v1 kind: Deployment metadata: name: inference-service spec: template: spec: tolerations: - key: node.kubernetes.io/sla operator: Gt value: "950" effect: NoExecute --- apiVersion: apps/v1 kind: Deployment metadata: name: inference-service spec: template: spec: tolerations: - key: node.kubernetes.io/sla operator: Gt value: "v1.33.0" effect: NoExecute

During scheduling, if we encounter a matching node taint whose value is of a different type than the workload toleration (int vs semver) then we simply continue and rule out that node as a candidate for scheduling.

Just thinking out loud here, did I get that right?

Differed to the future KEP for semver/CEL

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

k8s-ci-robot · 2025-08-26T10:04:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helayoty
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric, sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Heba Elayoty <[email protected]>

macsko · 2025-08-27T08:39:47Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+
+    nVal, nErr := strconv.ParseInt(taintVal, 10, 64)  
+    if nErr != nil {
+        return false // Invalid taint value


Isn't returning just false here too simple? How would the user know that the node was rejected because of non-numeric taint?

We shouldn't add logging or return errors here. I'd keep it simple return false for performance reason (this code path is hot and will be called for every pod-node combination during scheduling) and to be consistenent with the current ToleratesTaint method (returns false silently).

Invalid numeric values should be caught by API validation when pods/nodes are created, not during runtime scheduling. Users will still get feedback through API validation errors and scheduler events.

But we can't catch the invalid numeric values for the taints earlier, because node.kubernetes.io/sla=95.5 is a correct taint - it can be matched by the Equal operator. If the parsing during scheduling fails, we have 2 options:

Node taint is misconfigured, because it uses non-numeric taint, when expected to use numeric one (e.g. 95.5 value instead of 955).

Pod toleration is misconfigured, because it wants to use numeric operator on non-numeric taint.

I think for both cases we would like to inform user what's wrong, not only that the toleration is not matching for any node.

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

macsko · 2025-09-02T08:13:37Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+- `scheduler_numeric_tolerations_total`: To measure the number of pods scheduled using numeric toleration operators.
+- `scheduler_numeric_taint_mismatches_total`: To measure the scheduling failures due to numeric taint/toleration mismatches.


How would you implement those metrics? Pod could fail due to multiple reasons, depending on the node.

Thanks for pointing this issue. After some thoughts the scheduler_numeric_taint_mismatches_total metric is problematic because pods can fail scheduling due to multiple plugins simultaneously. Also, the same pod might be rejected by different nodes for different reasons. Metrics updated.

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

macsko · 2025-09-02T08:50:16Z

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

+
+### Semantics
+
+- To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%).


This would need to be well documented in the API to make sure the users won't misunderstand

Signed-off-by: Heba Elayoty <[email protected]>

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 11, 2025

github-project-automation bot added this to SIG Apps and SIG Scheduling Aug 11, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 11, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Aug 11, 2025

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 11, 2025

k8s-ci-robot requested a review from dom4ha August 11, 2025 21:48

github-project-automation bot moved this to Needs Triage in SIG Scheduling Aug 11, 2025

k8s-ci-robot requested a review from macsko August 11, 2025 21:48

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 11, 2025

This was referenced Aug 9, 2025

Extended Toleration Operators for Threshold-Based Placement #5471

Open

Allow nodes to declare failure probability/SLA kubernetes/kubernetes#118669

Open

jackfrancis reviewed Aug 11, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

jackfrancis reviewed Aug 11, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Aug 11, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

everpeace reviewed Aug 12, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

macsko reviewed Aug 14, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

helayoty added 2 commits August 15, 2025 16:18

Implement Extended Toleration Operators KEP

35389d0

Signed-off-by: Heba Elayoty <[email protected]>

Address feedback by remove Ge/Le and add DRA example

c9e75ba

Signed-off-by: Heba Elayoty <[email protected]>

helayoty force-pushed the helayoty/enable-sla-based-schedule branch from 2a36559 to c9e75ba Compare August 15, 2025 23:18

helayoty requested review from jackfrancis, everpeace and nojnhuh August 15, 2025 23:18

helayoty moved this from Needs Triage to In Progress in SIG Scheduling Aug 15, 2025

Address PreferNoSchedule feedback

c811db4

Signed-off-by: Heba Elayoty <[email protected]>

helayoty requested review from macsko and SergeyKanzhelev August 16, 2025 00:13

format toc

cd9d4a3

Signed-off-by: Heba Elayoty <[email protected]>

helayoty requested a review from everpeace August 22, 2025 15:39

jackfrancis reviewed Aug 22, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

jackfrancis reviewed Aug 22, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

jackfrancis reviewed Aug 22, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Show resolved Hide resolved

jackfrancis reviewed Aug 22, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

sanposhiho reviewed Aug 24, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

helayoty force-pushed the helayoty/enable-sla-based-schedule branch from 5df1161 to cd9d4a3 Compare August 26, 2025 08:40

helayoty requested review from sanposhiho and jackfrancis August 26, 2025 10:05

Address feedback

4e1a1cb

Signed-off-by: Heba Elayoty <[email protected]>

helayoty force-pushed the helayoty/enable-sla-based-schedule branch from e5b0b7b to 4e1a1cb Compare August 26, 2025 10:14

helayoty mentioned this pull request Aug 26, 2025

Add semver comparison and CEL expressions support to Tolerations and NodeAffinity #5500

Open

4 tasks

helayoty moved this from Needs Triage to Backlog in SIG Apps Aug 26, 2025

helayoty moved this from Backlog to Needs Review in SIG Apps Aug 26, 2025

macsko reviewed Aug 27, 2025

View reviewed changes

macsko reviewed Aug 28, 2025

View reviewed changes

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md Outdated Show resolved Hide resolved

macsko reviewed Sep 2, 2025

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 2, 2025

Remove comments and address feedback

2bdd819

Signed-off-by: Heba Elayoty <[email protected]>

helayoty force-pushed the helayoty/enable-sla-based-schedule branch from a65244f to 2bdd819 Compare September 2, 2025 19:04


		As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there.

		I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.


		#### Story 6 — Kubernetes version compatibility for critical workloads

		As a cluster operator managing a mixed-version Kubernetes cluster during rolling upgrades, I want to ensure critical workloads only run on nodes with Kubernetes version >= 1.20.0 due to specific API features they require, while allowing development workloads to tolerate older versions.

		- `scheduler_numeric_tolerations_total`: To measure the number of pods scheduled using numeric toleration operators.
		- `scheduler_numeric_taint_mismatches_total`: To measure the scheduling failures due to numeric taint/toleration mismatches.


		### Semantics

		- To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%).

KEP-5471 Extended Toleration Operators for Threshold-Based Placement #5473

Are you sure you want to change the base?

KEP-5471 Extended Toleration Operators for Threshold-Based Placement #5473

Conversation

helayoty commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

helayoty commented Aug 11, 2025 •

edited

Loading

helayoty Aug 26, 2025 •

edited

Loading

helayoty Aug 26, 2025 •

edited

Loading

helayoty Aug 26, 2025 •

edited

Loading

helayoty Aug 26, 2025 •

edited

Loading