Skip to content

Create e2e test to emulate deployment use cases for MCAD #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
z103cb opened this issue May 31, 2023 · 15 comments · Fixed by #400
Closed

Create e2e test to emulate deployment use cases for MCAD #399

z103cb opened this issue May 31, 2023 · 15 comments · Fixed by #400
Assignees
Labels
enhancement New feature or request quota-management

Comments

@z103cb
Copy link
Contributor

z103cb commented May 31, 2023

Description

We need the to ensure that MCAD functions correctly in the following deployment scenarios which are most likely encountered when upgrading / deploying a new version of MCAD with the quota management feature enabled. It is assumed that the changes in #396, are incorporated and that the "default" quota node is available.

To achieve the goals, we'll need to augment our kuttl end to end tests to support the following scenarios. It is assumed that the same kind configuration / cluster setup as the existing end to end tests is used. The scenarios below assume that the kind cluster is up and running.

Deployment Scenarios

Deploy in cluster with running app wrappers where quota management was not previously deployed

Purpose

To validate that we can deploy MCAD with quota management feature turned on a cluster where there are app wrappers in a RUNNING, COMPLETED, RUNNING HOLD COMPLETION state which did not specify quota labels. Once the MCAD is re-deployed in with quota management feature enabled, it is expected that existing app wrappers in RUNNING and RUNNING HOLD COMPLETION state without any quota management labels will use the quota from the "default" quota node.

Steps

  1. Deploy MCAD on cluster with quota management turned off
  2. Submit 1 app wrapper with a deployment running echo server to emulate long running AppWrappers with the expectation that the existing app wrappers and associated resources will continue to run. Expect that the submitted app wrapper will be in a "RUNNING" state.
  3. Submit 1 app wrapper with one job that is expected to run for a limited time. Expect that the submitted app wrapper will be in a "COMPLETED" state.
  4. Submit 1 app wrapper that would end up in a "RUNNING HOLD COMPLETION"
  5. Redeploy MCAD on cluster with quota management turned on, and no quota trees submitted with the expectation that the existing app wrappers and associated resources will continue to run and MCAD continues to function.
  6. Submit quota trees with a "default" node and additional quota nodes with the expectation that the existing app wrappers are associated with the "default" quota node and MCAD continues to function.
  7. Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for the "default" node. Expect the app wrapper to be in a "Pending" state.
  8. Delete app wrappers submitted in step 2 and 3 expect that the app wrapper in the previous step will be in a RUNNING state.
  9. Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "RUNNING" or "COMPLETE" state.

Deploy in cluster with running app wrappers where quota management was previously deployed

Purpose

To validate that we can deploy and re deploy MCAD with quota management feature turned on a cluster where there are app wrappers in a RUNNING, COMPLETED, RUNNING HOLD COMPLETION with specific quota labels. Once the MCAD is re-deployed in with quota management feature enabled, it is expected that existing app wrappers in RUNNING and RUNNING HOLD COMPLETION state would use up the quota from their respective nodes.

Steps:

  1. Deploy CRDS in the cluster
  2. Deploy quota trees in the cluster, with a "default" node.
  3. Deploy MCAD on cluster with quota management feature flag turned on.
  4. Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "RUNNING" and "COMPLETED" state.
  5. Submit additional app wrappers with no quota nodes specified. Expect that the app wrappers would be in a "Running" and "Completed" state and they would use up the "default" node quota.
  6. Redeploy MCAD on cluster with quota management turned on or kill the MCAD pod
  7. Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for the "default" node. Expect the app wrapper to be in a "Pending" state.
  8. Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for a node other than default. Expect the app wrapper to be in a "Pending" state.

Note: This scenario should be probably incorporated into the existing quota management e2e test.

Update quota trees in a running cluster

Purpose

To validate that updates to the quota trees can be processed while MCAD is running and there are no disruption in processing for existing app wrappers and new app wrappers can be submitted.

Steps

  1. Deploy CRDS in the cluster
  2. Deploy quota trees in the cluster, with a "default" node.
  3. Deploy MCAD on cluster with quota management feature flag turned on.
  4. Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "RUNNING" or "COMPLETED" state.
  5. Submit additional app wrappers with no quota nodes specified with the expectation that they would be in a "Running" and "Completed" state.
  6. Submit CRD that adds a new quota node to the tree defined in step 2. Expect that the existing app wrappers will continue to execute and MCAD is still functional.
  7. Submit additional app wrappers for the quota node defined in step 6 with the expectation that they would be in a "RUNNING" or "COMPLETED" state.
@tardieu
Copy link
Member

tardieu commented May 31, 2023

The most critical scenario is the clean, orchestrated MCAD deployment and configuration.

  1. deploy CRDs and MCAD controller,
  2. wait for controller to be ready,
  3. deploy 3-node quota tree at once,
  4. test AppWrapper deployment, possibly deploying first an AppWrapper under quota then after completion an AppWrapper over quota.

We can then relax this scenario by deploying either:

  • the CRDs, the controller, the quota tree without waiting for the controller to be ready,
  • the CRDs then the quota tree then the controller,
  • the CRDs then part of the tree then the controller then the rest of the tree.

We can then consider scenarios including existing AppWrappers (running or not).

@dmatch01
Copy link
Collaborator

dmatch01 commented Jun 1, 2023

Deploy in cluster with running app wrappers where quota management was not previously deployed

Steps:

Deploy MCAD on cluster with quota management turned off
Submit 2 app wrappers with a deployment running echo server to emulate long running AppWrappers with the expectation that the existing app wrappers and associated resources will continue to run. Expect that the submitted app wrapper will be in a "Running" state.
Redeploy MCAD on cluster with quota management turned on, and no quota trees submitted with the expectation that the existing app wrappers and associated resources will continue to run and MCAD continues to function.
Submit quota trees with a "default" node and additional quota nodes with the expectation that the existing app wrappers are associated with the "default" quota node and MCAD continues to function.

the above step is a 3rd AW, correct? Based on the next step there is still available quota but not enough for the AW in the next step?

Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for the "default" node. Expect the app wrapper to be in a "Pending" state.

the above step is 4th AW, correct? This is assuming available quota < 4th AW quota demand?

Delete app wrappers submitted in step 2, expect that the app wrapper in step 5 will be in a running state

this is assuming the 4th AW, correct? implying 4th AW quota demand <= current available quota + quota demand from in step 2, correct?

Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "Running" or "Completed" state

this is assuming a 5th AW quota demand <= quota node entitlement + borrowing if applicable?

@dmatch01
Copy link
Collaborator

dmatch01 commented Jun 1, 2023

Deploy in cluster with running app wrappers where quota management was previously deployed

Steps:

Deploy CRDS in the cluster
Deploy quota trees in the cluster, with a "default" node.
Deploy MCAD on cluster with quota management feature flag turned on.
Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "Running" and "Completed" state.
Submit additional app wrappers with no quota nodes specified with the expectation that they would be in a "Running" and "Completed" state.

Would this test case have different behavior if the AW Completed vs Running?

Redeploy MCAD on cluster with quota management turned on or kill the MCAD pod
Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for the "default" node. Expect the app wrapper to be in a "Pending" state.
Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for a node other than default. Expect the app wrapper to be in a "Pending" state.

@z103cb could you explain a bit more about what this test case is testing? Plus can I assume the # of AWs submitted is the same as the previous test case?

@dmatch01
Copy link
Collaborator

dmatch01 commented Jun 1, 2023

Update quota trees in a running cluster

Steps:

Deploy CRDS in the cluster
Deploy quota trees in the cluster, with a "default" node.
Deploy MCAD on cluster with quota management feature flag turned on.
Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "Running" and "Completed" state.
Submit additional app wrappers with no quota nodes specified with the expectation that they would be in a "Running" and "Completed" state.
Submit CRD that adds a new quota node to the tree defined in step 2. Expect that the existing app wrappers will continue to execute and MCAD is still functional
Submit additional app wrappers for the quota node defined in step 6 with the expectation that they would be in a "Running" and "Completed" state.

this is assuming this final AW quota demand <= quota node entitlement + borrowing if applicable?

@dmatch01
Copy link
Collaborator

dmatch01 commented Jun 1, 2023

Thanks @z103cb for the test case descriptions above. A couple scenario, not necessarily related to default behavior is the following:

  1. create tree: MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
  2. create AW1 with label MyTree=NodeA and demanding 8 CPUs until running state
  3. update MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
    I think the design here was to leave AW running as is. If AW1 get preempted the job will stay queued due to insufficient quota.

Another Scenario

  1. create tree: MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
  2. create AW1 with label MyTree=NodeA and demanding 8 CPUs until running state
  3. create AW2 with label MyTree=NodeA and demanding 12 CPUs until pending state
  4. update MyTree with following tree structure
    Root(20CPU)->GroupA(20CPU)->NodeA(10CPU-Soft)
    both AW1 and AW2 should running.

@z103cb
Copy link
Contributor Author

z103cb commented Jun 6, 2023

The most critical scenario is the clean, orchestrated MCAD deployment and configuration.

  1. deploy CRDs and MCAD controller,
  2. wait for controller to be ready,
  3. deploy 3-node quota tree at once,
  4. test AppWrapper deployment, possibly deploying first an AppWrapper under quota then after completion an AppWrapper over quota.

We can then relax this scenario by deploying either:

  • the CRDs, the controller, the quota tree without waiting for the controller to be ready,
  • the CRDs then the quota tree then the controller,
  • the CRDs then part of the tree then the controller then the rest of the tree.

We can then consider scenarios including existing AppWrappers (running or not).

@tardieu I believe that these scenarios are mostly covered by the existing e2e test. A slight modification to tests would be needed to flip the order of steps in the test setup as such:

  - script: helm upgrade  --install mcad-controller deployment/mcad-controller --skip-crds --namespace kube-system --wait --set loglevel=${LOG_LEVEL} --set resources.requests.cpu=1000m --set resources.requests.memory=1024Mi --set resources.limits.cpu=4000m --set resources.limits.memory=4096Mi --set image.repository=$IMAGE_REPOSITORY_MCAD --set image.tag=$IMAGE_TAG_MCAD --set image.pullPolicy=$MCAD_IMAGE_PULL_POLICY --set configMap.quotaEnabled='"true"' --set quotaManagement.rbac.apiGroup=ibm.com --set quotaManagement.rbac.resource=quotasubtrees  --set configMap.name=mcad-controller-configmap --set configMap.preemptionEnabled='"true"' 
  - command: kubectl apply -f ./e2e-kuttl/install-quota-subtree.yaml

Is this alteration sufficient or you think we need a separate test to cover the scenario you have described ?

@z103cb
Copy link
Contributor Author

z103cb commented Jun 6, 2023

Deploy in cluster with running app wrappers where quota management was not previously deployed
Steps:

Deploy MCAD on cluster with quota management turned off
Submit 2 app wrappers with a deployment running echo server to emulate long running AppWrappers with the expectation that the existing app wrappers and associated resources will continue to run. Expect that the submitted app wrapper will be in a "Running" state.
Redeploy MCAD on cluster with quota management turned on, and no quota trees submitted with the expectation that the existing app wrappers and associated resources will continue to run and MCAD continues to function.
Submit quota trees with a "default" node and additional quota nodes with the expectation that the existing app wrappers are associated with the "default" quota node and MCAD continues to function.

the above step is a 3rd AW, correct?
No, the step above adds the CRDs that define the quota trees. There should be no new app wrappers submitted. It is expected that existing app wrappers will associate with the "default" node and use up all the resource quota (cpu) defined for that node.
Based on the next step there is still available quota but not enough for the AW in the next step?
Correct, the next submitted app wrapper should be put in "PENDING" state, due to insufficient available quota. Quota should be freed up for the this wrapper once the existing app wrappers are deleted.

Submit a new app wrapper with enough resource request to use up the remainder of the quota allocation for the "default" node. Expect the app wrapper to be in a "Pending" state.

the above step is 4th AW, correct? This is assuming available quota < 4th AW quota demand?
Correct.

Delete app wrappers submitted in step 2, expect that the app wrapper in step 5 will be in a running state

this is assuming the 4th AW, correct? implying 4th AW quota demand <= current available quota + quota demand from in step 2, correct?
Yes

Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "Running" or "Completed" state

this is assuming a 5th AW quota demand <= quota node entitlement + borrowing if applicable?
Yes.

@z103cb
Copy link
Contributor Author

z103cb commented Jun 6, 2023

Update quota trees in a running cluster
Steps:

Deploy CRDS in the cluster
Deploy quota trees in the cluster, with a "default" node.
Deploy MCAD on cluster with quota management feature flag turned on.
Submit additional app wrappers consuming quota nodes other than default with the expectation that they would be in a "Running" and "Completed" state.
Submit additional app wrappers with no quota nodes specified with the expectation that they would be in a "Running" and "Completed" state.
Submit CRD that adds a new quota node to the tree defined in step 2. Expect that the existing app wrappers will continue to execute and MCAD is still functional
Submit additional app wrappers for the quota node defined in step 6 with the expectation that they would be in a "Running" and "Completed" state.

this is assuming this final AW quota demand <= quota node entitlement + borrowing if applicable?

Yes, that is correct.

@z103cb
Copy link
Contributor Author

z103cb commented Jun 6, 2023

Thanks @z103cb for the test case descriptions above. A couple scenario, not necessarily related to default behavior is the following:

  1. create tree: MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
  2. create AW1 with label MyTree=NodeA and demanding 8 CPUs until running state
  3. update MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
    I think the design here was to leave AW running as is. If AW1 get preempted the job will stay queued due to insufficient quota.

Another Scenario

  1. create tree: MyTree with following tree structure
    Root(10CPU)->NodeA(10CPU-Soft)
  2. create AW1 with label MyTree=NodeA and demanding 8 CPUs until running state
  3. create AW2 with label MyTree=NodeA and demanding 12 CPUs until pending state
  4. update MyTree with following tree structure
    Root(20CPU)->GroupA(20CPU)->NodeA(10CPU-Soft)
    both AW1 and AW2 should running.

@dmatch01 I think that these scenarios should be perhaps be incorporated into the existing end to end tests. I feel that they are perhaps outside the scope / purpose of this issue which is to define the test cases that would allow us to validate that MCAD works when is deployed with or without quota management feature enabled and pre existing running / submitted app wrappers. Do you mind if I create a separate issue to track their implementation ?

@dmatch01
Copy link
Collaborator

dmatch01 commented Jun 8, 2023

@dmatch01 I think that these scenarios should be perhaps be incorporated into the existing end to end tests. I feel that they are perhaps outside the scope / purpose of this issue which is to define the test cases that would allow us to validate that MCAD works when is deployed with or without quota management feature enabled and pre existing running / submitted app wrappers. Do you mind if I create a separate issue to track their implementation ?

No problem, thank you!

@z103cb
Copy link
Contributor Author

z103cb commented Jun 13, 2023

After merging #396, scenario 1 fails. Created issue #409.

@tardieu
Copy link
Member

tardieu commented Jun 14, 2023

@tardieu I believe that these scenarios are mostly covered by the existing e2e test. A slight modification to tests would be needed to flip the order of steps in the test setup as such:

  - script: helm upgrade  --install mcad-controller deployment/mcad-controller --skip-crds --namespace kube-system --wait --set loglevel=${LOG_LEVEL} --set resources.requests.cpu=1000m --set resources.requests.memory=1024Mi --set resources.limits.cpu=4000m --set resources.limits.memory=4096Mi --set image.repository=$IMAGE_REPOSITORY_MCAD --set image.tag=$IMAGE_TAG_MCAD --set image.pullPolicy=$MCAD_IMAGE_PULL_POLICY --set configMap.quotaEnabled='"true"' --set quotaManagement.rbac.apiGroup=ibm.com --set quotaManagement.rbac.resource=quotasubtrees  --set configMap.name=mcad-controller-configmap --set configMap.preemptionEnabled='"true"' 
  - command: kubectl apply -f ./e2e-kuttl/install-quota-subtree.yaml

Is this alteration sufficient or you think we need a separate test to cover the scenario you have described ?

Yes this should cover the 1st scenario I described.

@tardieu
Copy link
Member

tardieu commented Jun 14, 2023

AFAIK scenarios 1 and 2 are targeting hard quotas since they assume some AWs will remaining pending which implies they cannot successfully borrow.

We should also test the introduction of soft quotas, borrowing, and preemption.

Variation of scenario 1 should be something like:

  • submit long-running AW1 with no quota
  • turn on and define quota so that AW1 is over soft default quota but under root quota
  • AW1 should continue to run
  • submit AW2 on a different quota node, below or at quota, to trigger preemption of AW1 (AW1 request + AW2 request > root quota).

For the restart scenario:

  • submit long-running AW1 exceeding soft quota but below root quota,
  • restart MCAD
  • trigger preemption akin to previous scenario.

If we want to cover all possible states at the time we either 1 enable quotas or 2 restart MCAD we also need to consider states where we have pending AWs in 1 because of lack of resources in 2 because of lack of resources and/or lack of quota. In particular for 2, we would like the running AWs to continue running and pending AWs to remain pending.

While this is probably less important in practice, we could also have a default hard quota that is below the volume of running AWs at the time we enable quotas.

@tardieu
Copy link
Member

tardieu commented Jun 14, 2023

Looking more carefully at scenario 1 in #400, it looks like you already have soft and hard quotas, so variation of scenario 1 above could probably be instead an extension of the implemented scenario where step 10 verifies that the AW launched at step 6 has been preempted by the AW launched at step 9. This can probably be achieved by making bronze soft and generous and the last submitted AW larger.

@z103cb
Copy link
Contributor Author

z103cb commented Jun 14, 2023

Blocked by #297

@z103cb z103cb moved this from In Progress to Ready For Review in Project CodeFlare Sprint Board Jul 13, 2023
@github-project-automation github-project-automation bot moved this from Ready For Review to Done in Project CodeFlare Sprint Board Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request quota-management
Projects
Development

Successfully merging a pull request may close this issue.

3 participants