Skip to content

✨ release leader election lease on manager cancellation #1689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

joelanford
Copy link
Member

Description

Fixes #1687

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@joelanford joelanford requested a review from a team as a code owner January 31, 2025 21:27
Copy link

netlify bot commented Jan 31, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit c83e3ed
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/679d5b4c40d8ad0008a37224
😎 Deploy Preview https://deploy-preview-1689--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@joelanford joelanford added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025
@joelanford
Copy link
Member Author

I added the do-not-merge/work-in-progress label. Let's get consensus that this is a safe change before merge. I know @camilamacedo86 wanted to go over the details more closely.

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 67.74%. Comparing base (9b08aea) to head (c83e3ed).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
catalogd/cmd/catalogd/main.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1689      +/-   ##
==========================================
- Coverage   67.74%   67.74%   -0.01%     
==========================================
  Files          57       57              
  Lines        4620     4622       +2     
==========================================
+ Hits         3130     3131       +1     
- Misses       1265     1266       +1     
  Partials      225      225              
Flag Coverage Δ
e2e 53.36% <100.00%> (-0.08%) ⬇️
unit 54.45% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joelanford
Copy link
Member Author

Ha. The latest release doesn't release the lease. So we'll have to make the changes in main.go's, then release again, then revert the e2e changes.

@joelanford joelanford force-pushed the leader-election-release-on-cancel branch from a008922 to c83e3ed Compare January 31, 2025 23:22
@perdasilva
Copy link
Contributor

Just for my understanding.

The root-cause was something like this:

  • deploy latest release
  • an o-c and a catd controller get elected as leaders
  • upgrade olm by patching o-c and catd deployments with new changes
  • old o-c and catd pods get killed, but hold on the lease for up to (now longer) timeout seconds
  • new o-c and catd pods come up, but are waiting longer to begin operations as they wait to get the lease -> we blow out the eventually timeout

The proposed fix is to set: LeaderElectionReleaseOnCancel to true. This means that if the pod is shutdown properly, it will give up its lease. Is that right? What are the risks?

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2025
@joelanford
Copy link
Member Author

This means that if the pod is shutdown properly, it will give up its lease. Is that right? What are the risks?

Correct, and if it is not shut down properly, it likely won't give up its lease because giving up its lease is the last thing it does.

The main risk is that we go out of our way to run a separate goroutine that does leader-y things whose lifetime goes beyond that of the manager. If we do that, then the manager will release the lease, some other process could assume leadership, and the our other goroutine and the new leader will be duking it out.

I don't think that's a big risk though.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@camilamacedo86
Copy link
Contributor

/lgtm

Metrics: metricsServerOptions,
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "9c4404e7.operatorframework.io",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we not use operator-controller-lock here as well like we have in catalogd?
Maybe for another PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that. Because then old code and new code could both get elected leader under different leader election IDs.

@joelanford joelanford removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025
@joelanford joelanford added this pull request to the merge queue Feb 3, 2025
Merged via the queue into operator-framework:main with commit 1a52e2e Feb 3, 2025
20 of 22 checks passed
@joelanford joelanford deleted the leader-election-release-on-cancel branch February 14, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release leader election lease on shutdown
4 participants