Dynamically adjust slack quota #212

tardieu · 2024-07-22T21:26:19Z

This PR makes it possible to designate a cluster queue as slack, for example:

    appwrapper:
      enableKueueIntegrations: true
      defaultQueueName: default-queue
      slackQueueName: slack-cluster-queue
      autopilot:
        injectAntiAffinities: true
        migrateImpactedWorkloads: true
        resourceUnhealthyConfig:
          nvidia.com/gpu:
            autopilot.ibm.com/gpuhealth: ERR

The AppWrapper controller will dynamically inject and adjust lending limits on the resources in this cluster queue. These limits are computed by subtracting unhealthy resource counts (as reported via autopilot labels) to the nominal quotas for these resources.

For instance, if an eight-gpu node reports an error (via label autopilot.ibm.com/gpuhealth: ERR), the following lending limit is injected in the slack cluster queue:

     resources:
      - name: cpu
        nominalQuota: "8"
      - name: memory
        nominalQuota: 128Gi
      - lendingLimit: "16"
        name: nvidia.com/gpu
        nominalQuota: "24"

config/rbac/role.yaml

Dynamically adjust slack quota

9f6a35b

dgrove-oss approved these changes Jul 23, 2024

View reviewed changes

config/rbac/role.yaml Outdated Show resolved Hide resolved

Drop clusterqueues create and delete permissions

aed744e

tardieu force-pushed the slack branch from d807895 to aed744e Compare July 23, 2024 12:51

tardieu mentioned this pull request Jul 23, 2024

Monitor unhealthy resource quantities #211

Closed

tardieu merged commit 22e852a into main Jul 23, 2024
2 checks passed

dgrove-oss added the enhancement New feature or request label Jul 23, 2024

tardieu deleted the slack branch July 24, 2024 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamically adjust slack quota #212

Dynamically adjust slack quota #212

Uh oh!

tardieu commented Jul 22, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dynamically adjust slack quota #212

Dynamically adjust slack quota #212

Uh oh!

Conversation

tardieu commented Jul 22, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!