-
Notifications
You must be signed in to change notification settings - Fork 205
Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies #141
Conversation
@abhinavdahiya with this the sync loop is context aware all the way down and retries until context is cancelled. Retry and bail didn't have much value for us anyway except for status updates and we should gather those on cancel (I need to verify that and we can make apply report partial status if necessary). |
unit:
|
3d7d6bc
to
b08c55c
Compare
9221dec
The context we have is the global one (leader election ctx), I didn't see any deadlines. cluster-version-operator/pkg/cvo/cvo.go Line 219 in 3d7d6bc
Yeah, we will only see status updates on success or global cancel... |
There should have been one, I'm going to add that. Was meaningless while we looped, but is meaningful now.
Every other repo has a simple one, it helps orient and define some minimum carry over between them. I don't expect it to grow or be heavy, but it saves time for folks working everywhere. |
We'll see a progress percentage for every group that increases. It's only crash loops that will be delayed. I'll tune the sync loop timeout based on that progress upgrade distinct from sync distinct from initializing. |
Simply covers the existing common actions
Will be used by test infra to cancel sync.
The sync loop currently mixes a combination of: 1. Try forever (top level sync loop) 2. Cancel sync when a timeout is reached 3. Retry individual Apply actions in the sync workers up to a max retry 4. Retry within Apply actions This led to bugs in initialization where transient errors cause a prereq step to fail and never complete, leading to other operators never going live. This commit simplifies the structure of our retry logic to be: 1. Try to sync forever 2. Cancel sync after a certain period of failure and go into backoff 3. Retry individual Applies forever with capped backoff 4. Retry applies as long as context is not cancelled This implies all top level sync has a context with a deadline which is the case today - some test cases are changed for it. This should also reduce the lag between cancel (on user action, for instance) and the sync loop terminating, since everything is context gated.
b08c55c
to
005e1fa
Compare
For updates we may want to keep going until we see repeated errors, and then report those up. One thing I think we need is summarizing the errors from multiple nodes in the graph into a stable output message that isn't going back and forth. Reconcile could also accumulate all possible errors that occur during a pass without retries and then summarize any failures in the progressing message, potentially moving to failing if two successive syncs have any errors. |
Initialization should report no-progress after 10 minutes, upgrade should report faster, and we should document that reconciling needs to be improved to do a pass over the content. The timeouts are higher than the current numbers.
005e1fa
to
838794a
Compare
We don't need to exit sync in order to report status, but we should keep in mind that we want to guarantee two things during reconcile:
|
838794a LGTM /retest |
/test e2e-aws |
1 similar comment
/test e2e-aws |
/retest |
e2e-aws hit openshift/origin#22412 again. |
openshift/origin#22412 landed. /retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
🎉 |
The sync loop currently mixes a combination of:
retry
This led to bugs in initialization where transient errors cause a prereq
step to fail and never complete, leading to other operators never going
live.
This commit simplifies the structure of our retry logic to be:
This implies all top level sync has a context with a deadline which is the
case today - some test cases are changed for it.
This should also reduce the lag between cancel (on user action, for
instance) and the sync loop terminating, since everything is context gated.
Also adds a Makefile because I like consistency.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1691513