LendingLimit bias when removing nodes from the cluster #252

tardieu · 2024-10-08T20:38:37Z

If a node is removed from the cluster that was previously flagged as unhealthy, we keep subtracting the unhealthy node resources from the lending limit(s) on the slack quota queue forever.

We need to account for deleted nodes and properly prune the cached node information in the node health monitor.

For now, we can work around the issue by restarting the controller after removing an unhealthy node.

1. Split node monitoring into two reconcilers, one to monitor Nodes and one to monitor and update the designated slack ClusterQueue. 2. Remove entries from in memory caches when a Node is deleted. 3. Watch slack cluster queue to be able to react to changes in nominalQuotas and adjust lendingLimits accordingly. Fixes project-codeflare#252.

dgrove-oss self-assigned this Oct 14, 2024

dgrove-oss mentioned this issue Oct 15, 2024

Redesign node monitoring to account for Node deletion #255

Merged

dgrove-oss closed this as completed in #255 Oct 16, 2024

dgrove-oss closed this as completed in cacf2c7 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LendingLimit bias when removing nodes from the cluster #252

LendingLimit bias when removing nodes from the cluster #252

tardieu commented Oct 8, 2024 •

edited

Loading

LendingLimit bias when removing nodes from the cluster #252

LendingLimit bias when removing nodes from the cluster #252

Comments

tardieu commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tardieu commented Oct 8, 2024 •

edited

Loading