Skip to content

MCAD Quota Management Image issue on OC 4.13 #478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbusche opened this issue Jul 13, 2023 · 9 comments
Closed

MCAD Quota Management Image issue on OC 4.13 #478

jbusche opened this issue Jul 13, 2023 · 9 comments
Assignees

Comments

@jbusche
Copy link
Contributor

jbusche commented Jul 13, 2023

@asm582 has the his test image quay.io/asmalvan/quota-mgmt-0712 working on OC 4.11 but he noticed issues with OC 4.13. I tried it myself on OC 4.13 and seeing that appwrappers aren't getting scheduled.

oc logs -f mcad-controller-68c575f55d-6db7h -n kube-system |grep ScheduleNext

E0713 21:13:19.865360       1 queuejob_controller_ex.go:1210] [ScheduleNext] Failed to updated status in etcd for app wrapper 'default/defaultaw-schd-spec-with-timeout-11', status = {Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:9 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-07-13 21:11:20.054349544 +0000 UTC m=+61.083581462 ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-07-13 21:11:20.05458499 +0000 UTC m=+61.083816907 LastTransitionMicroTime:2023-07-13 21:11:20.05458508 +0000 UTC m=+61.083816977 Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-07-13 21:11:20.054809526 +0000 UTC m=+61.084041423 LastTransitionMicroTime:2023-07-13 21:11:20.054809606 +0000 UTC m=+61.084041493 Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-07-13 21:11:25.642716112 +0000 UTC m=+66.671948019 LastTransitionMicroTime:2023-07-13 21:11:25.642716212 +0000 UTC m=+66.671948119 Reason:FrontOfQueue. Message:}] PendingPodConditions:[] TotalCPU:0 TotalMemory:0 TotalGPU:0 RequeueingTimeInSeconds:0 NumberOfRequeueings:0}, err=appwrappers.mcad.ibm.com "defaultaw-schd-spec-with-timeout-11" not found

However, on this cluster, prior to installing mcad I did have the codeflare 0.0.6 and ODH 1.7.0 installed.

Will try again on a fresh cluster.

@jbusche
Copy link
Contributor Author

jbusche commented Jul 13, 2023

Installation steps:

  1. Fresh OC 4.13 cluster
  2. git clone https://github.com/project-codeflare/multi-cluster-app-dispatcher.git
    cd multi-cluster-app-dispatcher/deployment/mcad-controller
  3. helm install the following
helm upgrade --install mcad-controller .  --namespace kube-system --wait --set image.repository=quay.io/asmalvan/quota-mgmt-0712  --set image.tag=latest  --set configMap.name=mcad-controller-configmap --set configMap.podCreationTimeout='"120000"'  --set coscheduler.rbac.apiGroup=scheduling.sigs.k8s.io --set coscheduler.rbac.resource=podgroups --set loglevel=10 --set resources.limits.cpu=1500m --set resources.requests.cpu=1160m --set configMap.quotaEnabled='"true"' --set configMap.preemptionEnabled='"true"'
  1. start a few mcad appwrappers
../../test/perf-test/perf.sh
  1. Watch to see that they get scheduled:
oc logs -f mcad-controller-68c575f55d-6db7h -n kube-system |grep ScheduleNext

@jbusche
Copy link
Contributor Author

jbusche commented Jul 13, 2023

While trying the install on a fresh OC 4.13 system, I'm getting a timeout on the helm install:

helm ls

NAME           	NAMESPACE  	REVISION	UPDATED                                	STATUS	CHART                	APP VERSION
mcad-controller	kube-system	1       	2023-07-13 15:54:11.184788043 -0700 PDT	failed	mcad-controller-0.1.0	           

and

oc logs -f mcad-controller-c6bc4cd4b-dffsx

E0713 23:06:13.528330       1 reflector.go:138] go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.QuotaSubtree: failed to list *v1.QuotaSubtree: quotasubtrees.ibm.com is forbidden: User "system:serviceaccount:kube-system:mcad-controller" cannot list resource "quotasubtrees" in API group "ibm.com" at the cluster scope

@asm582
Copy link
Member

asm582 commented Jul 13, 2023

@jbusche This PR might help: #475

@jbusche
Copy link
Contributor Author

jbusche commented Jul 14, 2023

OK, I did the following:

  1. Clone MCAD
git clone https://github.com/project-codeflare/multi-cluster-app-dispatcher.git
cd multi-cluster-app-dispatcher/deployment/mcad-controller
  1. Fix the clusterrole
vi templates/deployment.yaml

and search for "xqueuejobs" and add ibm.com and quotasubtrees just above it like this:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  name: system:controller:xqueuejob-controller
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
rules:
- apiGroups:
  - mcad.ibm.com
  - ibm.com
  resources:
  - quotasubtrees
  - xqueuejobs
  - queuejobs
  - schedulingspecs
  - appwrappers
  - appwrappers/finalizers
  - appwrappers/status
  1. Install MCAD with the latest built image:
helm upgrade --install mcad-controller .  --namespace kube-system --wait --set image.repository=quay.io/asmalvan/quota-mgmt-0712  --set image.tag=latest  --set configMap.name=mcad-controller-configmap --set configMap.podCreationTimeout='"120000"'  --set coscheduler.rbac.apiGroup=scheduling.sigs.k8s.io --set coscheduler.rbac.resource=podgroups --set loglevel=10 --set resources.limits.cpu=1500m --set resources.requests.cpu=1160m --set configMap.quotaEnabled='"true"' --set configMap.preemptionEnabled='"true"'
  1. Start a few appwrappers
../../test/perf-test/perf.sh
  1. Monitor the mcad controller log:

oc logs -f mcad-controller-c6bc4cd4b-cmnqp |grep ScheduleNext

I0714 00:22:14.513152       1 queuejob_controller_ex.go:1459] [ScheduleNext] [Agent Mode] backing off app wrapper 'default/defaultaw-schd-spec-with-timeout-1' after waiting for 424.792µs activeQ=false Unsched=true &qj=0xc00294ef00 Version=71472 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:9 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-07-14 00:21:34.428277 +0000 UTC ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-07-14 00:21:34.428444 +0000 UTC LastTransitionMicroTime:2023-07-14 00:21:34.428444 +0000 UTC Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-07-14 00:21:34.428665 +0000 UTC LastTransitionMicroTime:2023-07-14 00:21:34.428665 +0000 UTC Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-07-14 00:21:34.439785 +0000 UTC LastTransitionMicroTime:2023-07-14 00:21:34.439785 +0000 UTC Reason:FrontOfQueue. Message:} {Type:Backoff Status:True LastUpdateMicroTime:2023-07-14 00:21:34.454989 +0000 UTC LastTransitionMicroTime:2023-07-14 00:21:34.454989 +0000 UTC Reason:AppWrapperNotRunnable. consumer NAMESPACE_default_AWNAME_defaultaw-schd-spec-with-timeout-1 already allocated on forest MCAD-CONTROLLER-FOREST Message:Insufficient quota to dispatch AppWrapper.}] PendingPodConditions:[] TotalCPU:0 TotalMemory:0 TotalGPU:0 RequeueingTimeInSeconds:0 NumberOfRequeueings:0}

@asm582
Copy link
Member

asm582 commented Jul 14, 2023

@jbusche Thanks, looks like there is no quota tree: https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/main/test/e2e-kuttl/install-quota-subtree.yaml, can you add this quota tree, delete old AW and submit a new AW?

@asm582
Copy link
Member

asm582 commented Jul 14, 2023

If possible please share the status of AWs and corresponding status of pods

@jbusche
Copy link
Contributor Author

jbusche commented Jul 14, 2023

Hey, that worked!

  • On my OC 4.13.4 cluster:
oc version
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.13.4
Kubernetes Version: v1.26.5+7d22122
  • I've got the mcad running with image
    Image:         quay.io/asmalvan/quota-mgmt-0712:latest
  • I have the clusterrole so it has the ibm.com and qutoasubtrees added:
oc describe clusterrole system:controller:xqueuejob-controller |grep quotasubtrees
  quotasubtrees.ibm.com                []                 []              [create delete deletecollection get list patch update watch]
  quotasubtrees.mcad.ibm.com           []                 []              [create delete deletecollection get list patch update watch]
  • I applied the multi-cluster-app-dispatcher/test/e2e-kuttl/install-quota-subtree.yaml script
oc apply -f multi-cluster-app-dispatcher/test/e2e-kuttl/install-quota-subtree.yaml
  • I kicked off a few Appwrappers
test/perf-test/perf.sh

and it results in two completed pods in default namespace:

oc get appwrappers,jobs,pods -n default
NAME                                                         AGE
appwrapper.mcad.ibm.com/defaultaw-schd-spec-with-timeout-1   5m52s
appwrapper.mcad.ibm.com/defaultaw-schd-spec-with-timeout-2   5m51s

NAME                                           COMPLETIONS   DURATION   AGE
job.batch/defaultaw-schd-spec-with-timeout-1   1/1           16s        5m51s
job.batch/defaultaw-schd-spec-with-timeout-2   1/1           17s        5m49s

NAME                                           READY   STATUS      RESTARTS   AGE
pod/defaultaw-schd-spec-with-timeout-1-7gptc   0/1     Completed   0          5m51s
pod/defaultaw-schd-spec-with-timeout-2-pxzt9   0/1     Completed   0          5m49s
  • Describing one of the appwrappers, the status looks good:
    oc describe appwrapper defaultaw-schd-spec-with-timeout-1 -n default
Status:
  Succeeded:  1
  Conditions:
    Last Transition Micro Time:      2023-07-14T17:48:37.798780Z
    Last Update Micro Time:          2023-07-14T17:48:37.798780Z
    Status:                          True
    Type:                            Init
    Last Transition Micro Time:      2023-07-14T17:48:37.798971Z
    Last Update Micro Time:          2023-07-14T17:48:37.798971Z
    Reason:                          AwaitingHeadOfLine
    Status:                          True
    Type:                            Queueing
    Last Transition Micro Time:      2023-07-14T17:48:37.809700Z
    Last Update Micro Time:          2023-07-14T17:48:37.809700Z
    Reason:                          FrontOfQueue.
    Status:                          True
    Type:                            HeadOfLine
    Last Transition Micro Time:      2023-07-14T17:48:38.205067Z
    Last Update Micro Time:          2023-07-14T17:48:38.205067Z
    Reason:                          AppWrapperRunnable
    Status:                          True
    Type:                            Dispatched
    Last Transition Micro Time:      2023-07-14T17:48:42.995906Z
    Last Update Micro Time:          2023-07-14T17:48:42.995906Z
    Reason:                          PodsRunning
    Status:                          True
    Type:                            Running
    Last Transition Micro Time:      2023-07-14T17:48:58.052701Z
    Last Update Micro Time:          2023-07-14T17:48:58.052701Z
    Reason:                          PodsCompleted
    Status:                          True
    Type:                            Completed
  Controllerfirstdispatchtimestamp:  2023-07-14T17:48:48.012632Z
  Controllerfirsttimestamp:          2023-07-14T17:48:37.798579Z
  Filterignore:                      true
  Number Of Requeueings:             0
  Queuejobstate:                     Completed
  Requeueing Time In Seconds:        0
  Sender:                            before [manageQueueJob] setCompleted
  State:                             Completed
  Systempriority:                    9
  Totalcpu:                          10
  Totalmemory:                       10485760
Events:                              <none>

@jbusche jbusche self-assigned this Jul 14, 2023
@jbusche jbusche moved this to In Progress in Project CodeFlare Sprint Board Jul 14, 2023
@jbusche
Copy link
Contributor Author

jbusche commented Jul 18, 2023

This behaves the same on OC 4.12.22 as well. (Need @asm582 PR 475 Plus the quota trees

oc version
Client Version: 4.12.22
Kustomize Version: v4.5.7
Server Version: 4.12.22
Kubernetes Version: v1.25.10+8c21020
oc describe appwrapper defaultaw-schd-spec-with-timeout-1
Name:         defaultaw-schd-spec-with-timeout-1
Namespace:    default
Labels:       quota_context=default
              quota_service=default
Annotations:  <none>
API Version:  mcad.ibm.com/v1beta1
Kind:         AppWrapper
Metadata:
  Creation Timestamp:  2023-07-18T20:11:29Z
  Generation:          2
  Managed Fields:
    API Version:  mcad.ibm.com/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:priority:
        f:resources:
          .:
          f:Items:
        f:schedulingSpec:
          .:
          f:minAvailable:
          f:requeuing:
            .:
            f:growthType:
            f:maxNumRequeuings:
            f:maxTimeInSeconds:
            f:numRequeuings:
            f:timeInSeconds:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-07-18T20:11:29Z
    API Version:  mcad.ibm.com/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:quota_context:
          f:quota_service:
      f:spec:
        f:resources:
          f:GenericItems:
          f:metadata:
        f:schedulingSpec:
          f:dispatchDuration:
        f:service:
          .:
          f:spec:
      f:status:
        .:
        f:controllerfirsttimestamp:
        f:filterignore:
        f:numberOfRequeueings:
        f:requeueingTimeInSeconds:
        f:systempriority:
    Manager:      Go-http-client
    Operation:    Update
    Time:         2023-07-18T20:11:31Z
    API Version:  mcad.ibm.com/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:Succeeded:
        f:conditions:
        f:controllerfirstdispatchtimestamp:
        f:queuejobstate:
        f:sender:
        f:state:
        f:totalcpu:
        f:totalmemory:
    Manager:         Go-http-client
    Operation:       Update
    Subresource:     status
    Time:            2023-07-18T20:11:57Z
  Resource Version:  80436
  UID:               00b6c6b8-00c0-409d-b61c-f0dac5a40bf4
Spec:
  Priority:  9
  Resources:
    Generic Items:
      Allocated:         0
      Completionstatus:  Complete
      Custompodresources:
        Limits:
          Cpu:             500m
          Memory:          128M
          nvidia.com/gpu:  0
        Replicas:          1
        Requests:
          Cpu:             10m
          Memory:          10M
          nvidia.com/gpu:  0
      Generictemplate:
        API Version:  batch/v1
        Kind:         Job
        Metadata:
          Name:       defaultaw-schd-spec-with-timeout-1
          Namespace:  default
        Spec:
          Completions:  1
          Parallelism:  1
          Template:
            Metadata:
              Labels:
                appwrapper.mcad.ibm.com:  defaultaw-schd-spec-with-timeout-1
              Namespace:                  default
            Spec:
              Containers:
                Args:
                  sleep 10
                Command:
                  /bin/bash
                  -c
                  --
                Image:  ubi8-minimal:latest
                Name:   defaultaw-schd-spec-with-timeout-1
                Resources:
                  Limits:
                    Cpu:     500m
                    Memory:  128Mi
                  Requests:
                    Cpu:       10m
                    Memory:    10Mi
              Restart Policy:  Never
      Metadata:
      Priority:       0
      Priorityslope:  0
      Replicas:       1
    Items:
    Metadata:
  Scheduling Spec:
    Dispatch Duration:
    Min Available:  1
    Requeuing:
      Growth Type:          exponential
      Max Num Requeuings:   0
      Max Time In Seconds:  0
      Num Requeuings:       0
      Time In Seconds:      120
  Service:
    Spec:
Status:
  Succeeded:  1
  Conditions:
    Last Transition Micro Time:      2023-07-18T20:11:29.019683Z
    Last Update Micro Time:          2023-07-18T20:11:29.019683Z
    Status:                          True
    Type:                            Init
    Last Transition Micro Time:      2023-07-18T20:11:29.019903Z
    Last Update Micro Time:          2023-07-18T20:11:29.019903Z
    Reason:                          AwaitingHeadOfLine
    Status:                          True
    Type:                            Queueing
    Last Transition Micro Time:      2023-07-18T20:11:29.029506Z
    Last Update Micro Time:          2023-07-18T20:11:29.029506Z
    Reason:                          FrontOfQueue.
    Status:                          True
    Type:                            HeadOfLine
    Last Transition Micro Time:      2023-07-18T20:11:29.045201Z
    Last Update Micro Time:          2023-07-18T20:11:29.045201Z
    Message:                         Insufficient quota to dispatch AppWrapper.
    Reason:                          AppWrapperNotRunnable. Missing required quota designation: quota_service, quota_context.
    Status:                          True
    Type:                            Backoff
    Last Transition Micro Time:      2023-07-18T20:11:31.483367Z
    Last Update Micro Time:          2023-07-18T20:11:31.483367Z
    Reason:                          AppWrapperRunnable
    Status:                          True
    Type:                            Dispatched
    Last Transition Micro Time:      2023-07-18T20:11:41.650059Z
    Last Update Micro Time:          2023-07-18T20:11:41.650059Z
    Reason:                          PodsRunning
    Status:                          True
    Type:                            Running
    Last Transition Micro Time:      2023-07-18T20:11:57.616541Z
    Last Update Micro Time:          2023-07-18T20:11:57.616541Z
    Reason:                          PodsCompleted
    Status:                          True
    Type:                            Completed
  Controllerfirstdispatchtimestamp:  2023-07-18T20:11:47.481787Z
  Controllerfirsttimestamp:          2023-07-18T20:11:29.019473Z
  Filterignore:                      true
  Number Of Requeueings:             0
  Queuejobstate:                     Completed
  Requeueing Time In Seconds:        0
  Sender:                            before [manageQueueJob] setCompleted
  State:                             Completed
  Systempriority:                    9
  Totalcpu:                          10
  Totalmemory:                       10485760
Events:                              <none>

@jbusche
Copy link
Contributor Author

jbusche commented Aug 17, 2023

These are in now - closing issue...

@jbusche jbusche closed this as completed Aug 17, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Project CodeFlare Sprint Board Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants