Skip to content

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 25, 2019

The sync loop currently mixes a combination of:

  1. Try forever (top level sync loop)
  2. Cancel sync when a timeout is reached
  3. Retry individual Apply actions in the sync workers up to a max
    retry
  4. Retry within Apply actions

This led to bugs in initialization where transient errors cause a prereq
step to fail and never complete, leading to other operators never going
live.

This commit simplifies the structure of our retry logic to be:

  1. Try to sync forever
  2. Cancel sync after a certain period of failure and go into backoff
  3. Retry individual Applies forever with capped backoff
  4. Retry applies as long as context is not cancelled

This implies all top level sync has a context with a deadline which is the
case today - some test cases are changed for it.

This should also reduce the lag between cancel (on user action, for
instance) and the sync loop terminating, since everything is context gated.

Also adds a Makefile because I like consistency.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1691513

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 25, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2019
@smarterclayton
Copy link
Contributor Author

@abhinavdahiya with this the sync loop is context aware all the way down and retries until context is cancelled. Retry and bail didn't have much value for us anyway except for status updates and we should gather those on cancel (I need to verify that and we can make apply report partial status if necessary).

@smarterclayton smarterclayton changed the title Bug 1691513: sync should retry until cancelled Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies Mar 25, 2019
@wking
Copy link
Member

wking commented Mar 25, 2019

unit:

E0325 19:56:04.262847    4132 task.go:68] error running apply for test "file-yml" (3 of 3): unable to proceed
SIGQUIT: quit

@abhinavdahiya
Copy link
Contributor

9221dec Also adds a Makefile because I like consistency. :( but ok!
1212873 LGTM

@abhinavdahiya with this the sync loop is context aware all the way down and retries until context is cancelled. Retry and bail didn't have much value for us anyway except for status updates and we should gather those on cancel

The context we have is the global one (leader election ctx), I didn't see any deadlines.

go optr.configSync.Start(ctx, 16)

ctx, cancelFn := context.WithCancel(ctx)

return w.syncOnce(ctx, work, maxWorkers, reporter)

return w.apply(ctx, w.payload, work, maxWorkers, reporter)

err := payload.RunGraph(ctx, graph, maxWorkers, func(ctx context.Context, tasks []*payload.Task) error {

if err := task.Run(ctx, version, w.builder, work.State); err != nil {

(I need to verify that and we can make apply report partial status if necessary).

Yeah, we will only see status updates on success or global cancel...

@smarterclayton
Copy link
Contributor Author

The context we have is the global one (leader election ctx), I didn't see any deadlines.

There should have been one, I'm going to add that. Was meaningless while we looped, but is meaningful now.

Also adds a Makefile because I like consistency. :( but ok!

Every other repo has a simple one, it helps orient and define some minimum carry over between them. I don't expect it to grow or be heavy, but it saves time for folks working everywhere.

@smarterclayton
Copy link
Contributor Author

Yeah, we will only see status updates on success or global cancel...

We'll see a progress percentage for every group that increases. It's only crash loops that will be delayed. I'll tune the sync loop timeout based on that progress upgrade distinct from sync distinct from initializing.

Simply covers the existing common actions
Will be used by test infra to cancel sync.
The sync loop currently mixes a combination of:

1. Try forever (top level sync loop)
2. Cancel sync when a timeout is reached
3. Retry individual Apply actions in the sync workers up to a max retry
4. Retry within Apply actions

This led to bugs in initialization where transient errors cause a
prereq step to fail and never complete, leading to other operators
never going live.

This commit simplifies the structure of our retry logic to be:

1. Try to sync forever
2. Cancel sync after a certain period of failure and go into backoff
3. Retry individual Applies forever with capped backoff
4. Retry applies as long as context is not cancelled

This implies all top level sync has a context with a deadline which
is the case today - some test cases are changed for it.

This should also reduce the lag between cancel (on user action, for
instance) and the sync loop terminating, since everything is context
gated.
@smarterclayton
Copy link
Contributor Author

For updates we may want to keep going until we see repeated errors, and then report those up. One thing I think we need is summarizing the errors from multiple nodes in the graph into a stable output message that isn't going back and forth. Reconcile could also accumulate all possible errors that occur during a pass without retries and then summarize any failures in the progressing message, potentially moving to failing if two successive syncs have any errors.

Initialization should report no-progress after 10 minutes, upgrade
should report faster, and we should document that reconciling needs
to be improved to do a pass over the content. The timeouts are
higher than the current numbers.
@smarterclayton
Copy link
Contributor Author

We don't need to exit sync in order to report status, but we should keep in mind that we want to guarantee two things during reconcile:

  1. we eventually try every objects
  2. we want to guarantee an upper bound on the interval between any two reconcile attempts of an object

@abhinavdahiya
Copy link
Contributor

838794a LGTM

/retest

@abhinavdahiya
Copy link
Contributor

/test e2e-aws

1 similar comment
@abhinavdahiya
Copy link
Contributor

/test e2e-aws

@wking
Copy link
Member

wking commented Mar 27, 2019

e2e-aws:

Failing tests:

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Inline-volume (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]

/retest

@wking
Copy link
Member

wking commented Mar 27, 2019

e2e-aws hit openshift/origin#22412 again.

@wking
Copy link
Member

wking commented Mar 27, 2019

openshift/origin#22412 landed.

/retest

@wking
Copy link
Member

wking commented Mar 27, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 2b66cef into openshift:master Mar 27, 2019
@wking
Copy link
Member

wking commented Mar 27, 2019

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants