Skip to content

Updates to the controller logic to better handle failures in etc updates #424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 12, 2023

Conversation

z103cb
Copy link
Contributor

@z103cb z103cb commented Jun 20, 2023

Updated the ETCD update functions.
Added error handling.
Reduce the calls to update.

@z103cb z103cb requested review from metalcycling and asm582 June 20, 2023 14:30
Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, please address the review

@z103cb z103cb requested review from asm582 and dmatch01 and removed request for metalcycling June 22, 2023 21:43
@z103cb z103cb self-assigned this Jun 22, 2023
@z103cb z103cb requested review from asm582 and removed request for dmatch01 June 23, 2023 08:02
@z103cb z103cb marked this pull request as ready for review June 26, 2023 08:00
@asm582
Copy link
Member

asm582 commented Jun 27, 2023

Something is off with this PR, I built an image from your branch and submitted just 1 AW, it does not run, log message:

0627 11:04:45.616792 1 queuejob_controller_ex.go:1752] [Informer-addQJ] enqueue defaultaw-schd-spec-with-timeout-191 &qj=0xc002a8a500 Version=4842895 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State: Message: SystemPriority:9 QueueJobState:Init ControllerFirstTimestamp:2023-06-27 11:04:45.616785091 +0000 UTC m=+8.773521703 ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:false Sender: Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-06-27 11:04:45.616785429 +0000 UTC m=+8.773522036 LastTransitionMicroTime:2023-06-27 11:04:45.616785482 +0000 UTC m=+8.773522079 Reason: Message:}] PendingPodConditions:[]}
I0627 11:04:45.616856 1 queuejob_controller_ex.go:2403] [getAppWrapper] geting a copy of 'default/defaultaw-schd-spec-with-timeout-191' when called by '[syncQueueJob] get fresh appwrapper '.
I0627 11:04:45.616861 1 queuejob_controller_ex.go:2410] [getAppWrapper] get a copy of 'default/defaultaw-schd-spec-with-timeout-191' suceeded when called by '[syncQueueJob] get fresh appwrapper '
I0627 11:04:45.616895 1 queuejob_controller_ex.go:1475] [updateStatusInEtcd] trying to update 'default/defaultaw-schd-spec-with-timeout-191' called by 'manageQueueJob - setQueueing'
E0627 11:04:45.618282 1 queuejob_controller_ex.go:2091] [manageQueueJob] Failed to updated etcd for AppWrapper Job 'default/defaultaw-schd-spec-with-timeout-191', err=appwrappers.mcad.ibm.com "defaultaw-schd-spec-with-timeout-191" not found
E0627 11:04:45.618295 1 queuejob_controller_ex.go:1924] [worker] Failed to sync AppWrapper 'default/defaultaw-schd-spec-with-timeout-191', err &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"appwrappers.mcad.ibm.com \"defaultaw-schd-spec-with-timeout-191\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc00668e1e0), Code:404}}
W0627 11:04:45.618314 1 queuejob_controller_ex.go:1933] [worker] Fail to process item from eventQueue, err appwrappers.mcad.ibm.com "defaultaw-schd-spec-with-timeout-191" not found. Attempting to re-enqueque...
W0627 11:04:45.618321 1 queuejob_controller_ex.go:1937] [worker] Item re-enqueued%!(EXTRA *errors.StatusError=appwrappers.mcad.ibm.com "defaultaw-schd-spec-with-timeout-191" not found)
I0627 11:04:45.618342 1 queuejob_controller_ex.go:2403] [getAppWrapper] geting a copy of 'default/defaultaw-schd-spec-with-timeout-191' when called by '[syncQueueJob] get fresh appwrapper '.
I0627 11:04:45.618350 1 queuejob_controller_ex.go:2410] [getAppWrapper] get a copy of 'default/defaultaw-schd-spec-with-timeout-191' suceeded when called by '[syncQueueJob] get fresh appwrapper '

image used: podman pull quay.io/asmalvan/z103cb_issue297

z103cb added 2 commits July 5, 2023 12:09
Fixed warning in go.mod file
Fixed failed test
Log message improvements
@asm582
Copy link
Member

asm582 commented Jul 5, 2023

I restarted the build, it was still failing with some updateEtcd failures.

@z103cb z103cb requested a review from astefanutti July 6, 2023 18:50
@tardieu tardieu mentioned this pull request Jul 10, 2023
Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test cases pass on the laptop, Scheduling fail fast test case is now passing post changes.

@asm582
Copy link
Member

asm582 commented Jul 12, 2023

All except 1 test pass. the 1 test is pending by design, it is a bad AW and the system rejects, below is the test:

  Create AppWrapper  - Bad Generic Item Only
  /home/travis/gopath/src/github.com/project-codeflare/multi-cluster-app-dispatcher/test/e2e/queue.go:356
------------------------------
P [PENDING]

The test should be changed, it should pass when such AW is never dispatched. We need to figure stop failing builds for tests that are pending until the test is changed.

@asm582 asm582 merged commit 089cf9f into project-codeflare:main Jul 12, 2023
asm582 added a commit that referenced this pull request Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants