Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies #141

smarterclayton · 2019-03-25T19:54:30Z

The sync loop currently mixes a combination of:

Try forever (top level sync loop)
Cancel sync when a timeout is reached
Retry individual Apply actions in the sync workers up to a max
retry
Retry within Apply actions

This led to bugs in initialization where transient errors cause a prereq
step to fail and never complete, leading to other operators never going
live.

This commit simplifies the structure of our retry logic to be:

Try to sync forever
Cancel sync after a certain period of failure and go into backoff
Retry individual Applies forever with capped backoff
Retry applies as long as context is not cancelled

This implies all top level sync has a context with a deadline which is the
case today - some test cases are changed for it.

This should also reduce the lag between cancel (on user action, for
instance) and the sync loop terminating, since everything is context gated.

Also adds a Makefile because I like consistency.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1691513

smarterclayton · 2019-03-25T19:57:03Z

@abhinavdahiya with this the sync loop is context aware all the way down and retries until context is cancelled. Retry and bail didn't have much value for us anyway except for status updates and we should gather those on cancel (I need to verify that and we can make apply report partial status if necessary).

wking · 2019-03-25T21:03:01Z

unit:

E0325 19:56:04.262847    4132 task.go:68] error running apply for test "file-yml" (3 of 3): unable to proceed
SIGQUIT: quit

abhinavdahiya · 2019-03-26T01:21:01Z

9221dec Also adds a Makefile because I like consistency. :( but ok!
1212873 LGTM

@abhinavdahiya with this the sync loop is context aware all the way down and retries until context is cancelled. Retry and bail didn't have much value for us anyway except for status updates and we should gather those on cancel

The context we have is the global one (leader election ctx), I didn't see any deadlines.

cluster-version-operator/pkg/cvo/cvo.go

Line 219 in 3d7d6bc

go optr.configSync.Start(ctx, 16)

cluster-version-operator/pkg/cvo/sync_worker.go

Line 232 in 3d7d6bc

ctx, cancelFn := context.WithCancel(ctx)

cluster-version-operator/pkg/cvo/sync_worker.go

Line 242 in 3d7d6bc

return w.syncOnce(ctx, work, maxWorkers, reporter)

cluster-version-operator/pkg/cvo/sync_worker.go

Line 402 in 3d7d6bc

return w.apply(ctx, w.payload, work, maxWorkers, reporter)

cluster-version-operator/pkg/cvo/sync_worker.go

Line 455 in 3d7d6bc

    
           err := payload.RunGraph(ctx, graph, maxWorkers, func(ctx context.Context, tasks []*payload.Task) error {

cluster-version-operator/pkg/cvo/sync_worker.go

Line 471 in 3d7d6bc

if err := task.Run(ctx, version, w.builder, work.State); err != nil {

(I need to verify that and we can make apply report partial status if necessary).

Yeah, we will only see status updates on success or global cancel...

smarterclayton · 2019-03-26T01:31:45Z

The context we have is the global one (leader election ctx), I didn't see any deadlines.

There should have been one, I'm going to add that. Was meaningless while we looped, but is meaningful now.

Also adds a Makefile because I like consistency. :( but ok!

Every other repo has a simple one, it helps orient and define some minimum carry over between them. I don't expect it to grow or be heavy, but it saves time for folks working everywhere.

smarterclayton · 2019-03-26T01:39:11Z

Yeah, we will only see status updates on success or global cancel...

We'll see a progress percentage for every group that increases. It's only crash loops that will be delayed. I'll tune the sync loop timeout based on that progress upgrade distinct from sync distinct from initializing.

Simply covers the existing common actions

Will be used by test infra to cancel sync.

The sync loop currently mixes a combination of: 1. Try forever (top level sync loop) 2. Cancel sync when a timeout is reached 3. Retry individual Apply actions in the sync workers up to a max retry 4. Retry within Apply actions This led to bugs in initialization where transient errors cause a prereq step to fail and never complete, leading to other operators never going live. This commit simplifies the structure of our retry logic to be: 1. Try to sync forever 2. Cancel sync after a certain period of failure and go into backoff 3. Retry individual Applies forever with capped backoff 4. Retry applies as long as context is not cancelled This implies all top level sync has a context with a deadline which is the case today - some test cases are changed for it. This should also reduce the lag between cancel (on user action, for instance) and the sync loop terminating, since everything is context gated.

smarterclayton · 2019-03-26T02:16:53Z

For updates we may want to keep going until we see repeated errors, and then report those up. One thing I think we need is summarizing the errors from multiple nodes in the graph into a stable output message that isn't going back and forth. Reconcile could also accumulate all possible errors that occur during a pass without retries and then summarize any failures in the progressing message, potentially moving to failing if two successive syncs have any errors.

Initialization should report no-progress after 10 minutes, upgrade should report faster, and we should document that reconciling needs to be improved to do a pass over the content. The timeouts are higher than the current numbers.

smarterclayton · 2019-03-26T03:21:54Z

We don't need to exit sync in order to report status, but we should keep in mind that we want to guarantee two things during reconcile:

we eventually try every objects
we want to guarantee an upper bound on the interval between any two reconcile attempts of an object

abhinavdahiya · 2019-03-26T22:50:45Z

838794a LGTM

/retest

abhinavdahiya · 2019-03-26T22:55:42Z

/test e2e-aws

abhinavdahiya · 2019-03-27T00:23:26Z

/test e2e-aws

wking · 2019-03-27T07:40:18Z

e2e-aws:

Failing tests:

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Inline-volume (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]

/retest

wking · 2019-03-27T09:19:50Z

e2e-aws hit openshift/origin#22412 again.

wking · 2019-03-27T13:41:35Z

openshift/origin#22412 landed.

/retest

wking · 2019-03-27T19:19:26Z

/lgtm

openshift-ci-robot · 2019-03-27T19:19:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-03-27T20:35:06Z

🎉

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 25, 2019

openshift-ci-robot requested review from abhinavdahiya and crawford March 25, 2019 19:54

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2019

smarterclayton changed the title ~~Bug 1691513: sync should retry until cancelled~~ Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies Mar 25, 2019

smarterclayton force-pushed the context_cancel branch from 3d7d6bc to b08c55c Compare March 26, 2019 01:20

smarterclayton added 3 commits March 25, 2019 22:02

hack: Add a Makefile to simplify common interaction

a2ab299

Simply covers the existing common actions

cvo: Pass context instead of stopCh to the controller

20308e0

Will be used by test infra to cancel sync.

smarterclayton force-pushed the context_cancel branch from b08c55c to 005e1fa Compare March 26, 2019 02:12

smarterclayton force-pushed the context_cancel branch from 005e1fa to 838794a Compare March 26, 2019 03:19

openshift-ci-robot assigned wking Mar 27, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019

openshift-merge-robot merged commit 2b66cef into openshift:master Mar 27, 2019

wking mentioned this pull request Mar 28, 2019

fix: Gopkg.lock after running dep ensure on pkg/terraform/exec openshift/installer#1472

Merged

wking mentioned this pull request Apr 9, 2019

status: Report the operators that have not yet deployed #158

Merged

Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies #141

Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies #141

Uh oh!

Conversation

smarterclayton commented Mar 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Mar 25, 2019

Uh oh!

wking commented Mar 25, 2019

Uh oh!

abhinavdahiya commented Mar 26, 2019

Uh oh!

smarterclayton commented Mar 26, 2019

Uh oh!

smarterclayton commented Mar 26, 2019

Uh oh!

smarterclayton commented Mar 26, 2019

Uh oh!

smarterclayton commented Mar 26, 2019

Uh oh!

abhinavdahiya commented Mar 26, 2019

Uh oh!

abhinavdahiya commented Mar 26, 2019

Uh oh!

abhinavdahiya commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

openshift-ci-robot commented Mar 27, 2019

Uh oh!

wking commented Mar 27, 2019

Uh oh!

Uh oh!

smarterclayton commented Mar 25, 2019 •

edited

Loading