🐛 Reduce Frequency of Update Requests for Copied CSVs #3497

bentito · 2025-01-22T15:32:52Z

Description of the change:

Please checkout this doc on scoping out this change: https://docs.google.com/document/d/1P4cSYEP05vDyuhBfilyuWgL5d5OOD5z7JlAyOxpPqps

In this PR we are resurrecting #3411 with the intent to fix what that PR was originally going to fix. Follow on work will address the then revealed problem with metadata.generation|resourceVersion as per the doc comment by @tmshort

Motivation for the change:
[from the linked doc, "How Did We Get Here:]

Original Change for Memory Optimization: Sixteen months ago, we merged a PR in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects. This change was crucial for memory utilization performance gains, enabling OLM to run efficiently on MicroShift, a lightweight Kubernetes distribution designed for edge computing.
Limited Access to Spec/Status: By using PartialObjectMetadata, we only have access to the metadata of copied CSVs, not their spec or status fields. This means the operator lacks the information needed to compare the full content of the copied CSVs with the originals.
Removal of “Hash and Compare” Logic: The change inadvertently removed a core piece of the “hash and compare” logic. Previously, the operator used annotations containing hashes of the non-status and status fields of the original CSV to determine if a copied CSV needed updating. These annotations were not set on the copied CSVs after the change.
Resulting in Excessive Updates: Without the ability to compare hashes, the operator began issuing updates for copied CSVs 100% of the time, regardless of whether they were in sync with the originals. This behavior introduced a high load on the Kubernetes API server, especially in environments with many namespaces and CSVs installed in AllNamespace mode. The increased load also led to higher audit log volumes, impacting users with increased logging costs.

Architectural changes:

Reintroducing Annotations: A proposed fix PR adds back the annotations olm.operatorframework.io/nonStatusCopyHash and olm.operatorframework.io/statusCopyHash to the copied CSVs. These annotations store hashes of the non-status and status fields of the original CSV, respectively.
Reducing Unnecessary Updates: By comparing these hashes, the operator can determine if the copied CSVs are out of sync with the originals and only issue updates when necessary. This reduces the frequency of update requests to the API server and lowers audit log volumes.
Uncovering a New Bug: However, reintroducing the hash comparison logic will uncover a bug due to the use of PartialObjectMetadata for caching copied CSVs. Since we only have access to metadata, if a user manually modifies the spec or status of a copied CSV without changing the hash annotations, the operator cannot detect the change. The operator would incorrectly assume the copied CSV is in sync with the original, leading to potential inconsistencies.

Testing remarks:

Except from the expected changes around the inability to track copied CSV changes made by a user, we should be careful to test the following:

Cannot Revert Memory Optimization: Reverting the changes that introduced PartialObjectMetadata caching is not feasible. The memory optimization is critical for running OLM on MicroShift and supporting edge computing scenarios where resources are constrained.

Reviewer Checklist

camilamacedo86 · 2025-02-12T17:33:01Z

pkg/controller/operators/olm/operatorgroup.go

+	} else {
+		// Even if they're the same, ensure the returned prototype is annotated.
+		prototype.Annotations[statusCopyHashAnnotation] = status
+		updated = prototype
 	}


From the code implemented in this PR to the current state, the main addition seems to be this else block (beyond tests).

I’m not entirely sure I fully understand—are we also looking to implement what’s outlined in the Proposed Fixes section of this document? How/where are we addressing the concerns raised in the: Why don’t we just merge the [fix PR](https://github.com/operator-framework/operator-lifecycle-manager/pull/3411) as-is? section?

this is first pass, basically, just merge the old PR. With this PR we're taking path of #4 in the scoping doc: merge the PR with some possible problems, they should be a minor use case: users changing the copied CSVs

but the else is not the only thing done here, the main thing added is the tracking hashes so we can tell what's in need of update.

this is first pass, basically, just merge the old PR.

We previously agreed that the old PR wasn't quite the right approach, correct? Given that, I’m not sure it makes sense to merge it as-is.

If we need to do a release before we have the proper solution in place, we might include a change we don’t want. That doesn’t seem ideal to me. That is a case that I would request changes since it does not provide the desired solution, or fix the problem accordingly as defined in the doc. See that the doc has a section about that Why don’t we just merge the PR as it is?

It’s fine to add it as you did, but what do you think about creating a commit on top with the solution we intend to use? Could we focus on implementing the correct fix for the bug instead?
Is there any reason we need to merge the old PR without the correct fix? Can we not do that as proposed?

c/c @tmshort

Please see @tmshort comment on the doc. Idea being, merging this PR is a first step, it gives some relief, then we'll make another pass after this settles. Settling involves seeing much less API activity, especially on clusters with many namespaces for the CSV to be copied to.

and settling also involves seeing if the things mentioned in the doc, primarily whether OLM not correcting user-modified copied CSVs, will be a real-world problem.

My understanding is we have two problems:

spamming logs and api server requests

changes to copied csvs won't be detected and therefore will linger

The proposed approach is to separate these two problems by resolving 1 (which has a material impact on the cluster and customer's bottom line) and later handling 2.

It's my understanding that changes to copied csvs don't carry any behavioral changes in the system anyway. They only exist to make it possible to discover which operators are available in a particular namespace with a kubectl command. Also, I'd assume that write access to CSVs will be restricted in most real world cases to the cluster admin and the namespace admin. If these two assumptions hold, I think the blast radius of modifying the copied CSV and not having it reconcile back to the intended spec should be pretty small. So, I tend to agree with the approach here. Let's address the big problem of api server/log spamming, then worry about the relatively small problem of inconsistent copied csvs.

tmshort

I think there might be some simplification that can be done with the setting of the status/nonstatus annotations.

tmshort · 2025-02-12T19:34:19Z

pkg/controller/operators/olm/operator.go

I'm assuming all the changes here are due to lint?

yeah, and I just ran make lint locally to make sure nothing changed. nothing changed.

tmshort · 2025-02-12T20:00:20Z

pkg/controller/operators/olm/operatorgroup.go

@@ -803,6 +808,7 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns

 	existing, err := a.copiedCSVLister.Namespace(nsTo).Get(prototype.GetName())
 	if apierrors.IsNotFound(err) {
+		prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus


Because copyToNamespace is called in a loop, prototype, being a pointer, is reused multiple times. Which means that these annotations may already be set. Is there any reason why these annotations simply aren't set in ensureCSVsInNamesapces() where the hashes are calculcated?

good point possibly. checking...

So looking at it closer it seems like we shouldn't change it, here's my reasoning:

Keeping the annotation logic here, in copyToNamespace(), encapsulate the update semantics so each call handles its own CSV's state reliably.

We're reusing prototype and accounting for possibly set annotations. If we move the logic to ensureCSVsInNamesapces(), we'll have to duplicate the annotation checking logic because the logic for handling those annotations is tightly coupled with the CSV’s create/update lifecycle.

In copyToNamespace() we need to:
• Distinguish between a new creation (where the annotations don’t exist yet) and an update (where the annotations might already be set but could be outdated).
• Apply the updates in a specific order (first updating the non-status hash, then the status hash, including a status update to avoid mismatches).
• Ensure that each target CSV reflects the current state as expected for that specific namespace.

Aside from the hash handling we'd still need to be doing the above work in copyToNamespace()

.

camilamacedo86

Hi @bentito @tmshort

Could we address the properly solution on this one as defined in the doc instead of get merged the old pr fix? See: #3497 (comment)

I do not want to block anything, request changes regards: #3497 (review) (so dismissing)

perdasilva · 2025-05-05T13:15:49Z

In general, I think it lgtm. I'm curious why we need the annotations at all. Would it be sufficient to do something like:

generate the desired copied csv from the parent csv
get the existing copied csv
if not found => create desired copied csv and return
else => compute desired and existing non-status hashes
if desiredNonStatusHash != existingNonStatusHash => update copied csv and return
else, compute desired and existing status hashes
if desiredStatusHash != existingStatusHash => update status and return
else, return

I think the answer might be because the sync function is being driven off parent/non-copied csv events. Even so, we should be able to use the general approach outlined above (minus the returns). wdyt? I guess the only major difference is swapping memory (storing the hashes in the annotations) for computation (calculating both desired and current hashes). Maybe that's cleaner and adds fewer implementation concerns to the copied csv?

everettraven · 2025-05-05T13:34:35Z

In general, I think it lgtm. I'm curious why we need the annotations at all. Would it be sufficient to do something like:

generate the desired copied csv from the parent csv get the existing copied csv if not found => create desired copied csv and return else => compute desired and existing non-status hashes if desiredNonStatusHash != existingNonStatusHash => update copied csv and return else, compute desired and existing status hashes if desiredStatusHash != existingStatusHash => update status and return else, return

I think the answer might be because the sync function is being driven off parent/non-copied csv events. Even so, we should be able to use the general approach outlined above (minus the returns). wdyt? I guess the only major difference is swapping memory (storing the hashes in the annotations) for computation (calculating both desired and current hashes). Maybe that's cleaner and adds fewer implementation concerns to the copied csv?

I've not looked at the exact changes in the PR, but having been originally involved in bug when it first rolled around the reason I had determined we couldn't do this is because we only ever cached the copied CSV metadata and not the whole copied CSV.

IIRC the bug was originally concerned with the numerous update calls, which were happening because we always decided we needed to update the CSVs.

At the time, it was also determined that we could change what we cached because edge-computing solutions like MicroShift needed the reduction in memory footprint to run OLM.

Because of this, you would always have to make a GET call to the kube-apiserver for every copied CSV you need to evaluate. This could lead to at worst making twice the API calls we were already making -- a GET followed by an UPDATE as opposed to always issuing an UPDATE. At best, this was exactly the same -- always making a GET request.

If the constraints of not being able to change how the data is cached are still in play, you'll end up with the same problem. The bug was originally filed as a "OLM makes too many calls to the Kubernetes API server, inflating the audit logs and thus our log ingestion/monitoring bill". Switching from always issuing an UPDATE to always issuing a GET and possibly an UPDATE would make this worse than just not fixing the bug.

perdasilva · 2025-05-05T13:54:20Z

Because of this, you would always have to make a GET call to the kube-apiserver for every copied CSV you need to evaluate. This could lead to at worst making twice the API calls we were already making -- a GET followed by an UPDATE as opposed to always issuing an UPDATE. At best, this was exactly the same -- always making a GET request.

Gotcha. I missed that we weren't doing a live GET call. But, rather pulling it from cache. That all makes sense. Thank you =D

perdasilva · 2025-05-05T14:14:51Z

pkg/controller/operators/olm/operatorgroup.go

+		prototype.UID = existing.UID
+		// sync hash annotations
+		prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus
+		prototype.Annotations[statusCopyHashAnnotation] = status


should we update the status hash post status update, i.e. with the observed resource version and generation annotations (for the same reason described in line 897)?

done in soon to be pushed commit 4b21aa6e

perdasilva · 2025-05-05T14:59:54Z

It would be cool if we could add a regression test of some sort to measure the number of api server requests or log size or something that helps us verify/measure the effectiveness of the fix (and make sure we don't fall in the same trap again). Though, only if there is a relatively easy way to do it. Otherwise, could we post up some manual verification? i.e. pre-fix logs are big, post-fix logs are ok. Or maybe grabbing some api server stats or something...? wdyt?

bentito · 2025-05-08T17:17:03Z

@per replying to your comment:

Would it be sufficient to do something like:
generate the desired copied csv from the parent csv

Why annotations vs “just re-hash everything on the fly”

Performance & churn
Marshaling the entire CSV spec+status and hashing it on every parent-CSV event could get expensive (especially in big clusters with many CSVs) and can lead to unnecessary Update calls (and status flaps) if you miss a subtle drift.
Persistence across restarts
We need to remember what we last applied, even if the controller pod restarts or skips an event. Annotations live on the resource itself, so we don’t need an external cache or to recompute “last applied” from scratch.
Robust “drift-guard”
By also storing observed generation & resourceVersion, we catch manual edits made directly to the copied CSV (spec drift or status drift) even if no new parent-CSV event fires. Without annotations you’d have to separately watch the copied CSVs, trigger extra reconciles, and still figure out whether you should stomp on user edits or not.
Simpler reconciliation logic
Instead of a big “compute desired, compute existing, diff non-status, diff status, decide what to do” every single time, the annotations let you do a cheap string compare in most cases (“is desiredHash == lastAppliedHash?”) and skip all the heavy lifting.

You’re absolutely right that a pure-compute approach is conceptually possible, but to get the same guarantees (no missed updates, no unnecessary writes, survive restarts, catch manual drift) you’d end up
re-implementing a lot of what annotations give you “for free.” That was the driving motivation here: swap a tiny bit of metadata on the resource for a much simpler, more robust controller loop.

perdasilva

lgtm - you need to rebase to get over the go-verdiff hump

by adding annotations to copied CSVs that are populated with hashes of the non-status fields and the status fields. This seems to be how this was intended to work, but was not actually working this way because the annotations never actually existed on the copied CSV. This resulted in a hot loop of update requests being made on all copied CSVs. Signed-off-by: everettraven <[email protected]>

Signed-off-by: everettraven <[email protected]>

Signed-off-by: Brett Tofel <[email protected]>

Since we switched to a PartialObjectMetadata cache to save memory, we lost visibility into copied CSV spec and status fields, and the reintroduced nonStatusCopyHash/statusCopyHash annotations only partially solved the problem. Manual edits to a copied CSV could still go undetected, causing drift without reconciliation. This commit adds two new annotations: olm.operatorframework.io/observedGeneration and olm.operatorframework.io/observedResourceVersion. It implements a mechanism to guard agains metadata drift at the top of the existing-copy path in copyToNamespace. If a stored observedGeneration or observedResourceVersion no longer matches the live object, the operator now: • Updates the spec and hash annotations • Updates the status subresource • Records the new generation and resourceVersion in the guard annotations Because the guard only fires when its annotations are already present, all existing unit tests pass unchanged. We preserve the memory benefits of the metadata‐only informer, avoid extra GETs, and eliminate unnecessary API churn. Future work may explore a WithTransform informer to regain full object visibility with minimal memory impact. Signed-off-by: Brett Tofel <[email protected]>

Verifies that exactly three updates (spec, status, guard) are issued when the observedGeneration doesn’t match. Signed-off-by: Brett Tofel <[email protected]>

Signed-off-by: Brett Tofel <[email protected]>

benluddy · 2025-06-05T16:55:47Z

Original Change for Memory Optimization: Sixteen months ago, we merged a #3003 in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects.

The type of the Go objects in the cache is distinct from the API resource it's built from -- you could use a CSV ListerWatcher and configure a https://pkg.go.dev/k8s.io/client-go/tools/cache#TransformFunc on the informer that transforms a CSV into whatever compact representation you need.

openshift-merge-robot · 2025-06-10T09:34:07Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tmshort · 2025-06-19T07:14:24Z

Closing in favor #3597

openshift-ci bot requested review from camilamacedo86 and kevinrizza January 22, 2025 15:33

bentito force-pushed the OCPBUGS-37982 branch from 01343d6 to 67c82b9 Compare January 22, 2025 16:17

bentito force-pushed the OCPBUGS-37982 branch from 4bbdc20 to 4b2c88b Compare February 11, 2025 14:43

camilamacedo86 reviewed Feb 12, 2025

View reviewed changes

tmshort reviewed Feb 12, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

camilamacedo86 requested review from camilamacedo86 and tmshort March 7, 2025 13:41

camilamacedo86 previously requested changes Mar 7, 2025

View reviewed changes

camilamacedo86 self-requested a review March 10, 2025 17:47

bentito force-pushed the OCPBUGS-37982 branch from 8c0659a to 74bf0f7 Compare April 29, 2025 16:05

perdasilva reviewed May 5, 2025

View reviewed changes

perdasilva previously approved these changes May 16, 2025

View reviewed changes

everettraven and others added 7 commits May 19, 2025 16:07

update unit tests

8d15c88

Signed-off-by: everettraven <[email protected]>

updates to test so far

692a14e

Signed-off-by: everettraven <[email protected]>

Small changes

2f0e825

Signed-off-by: Brett Tofel <[email protected]>

Tests for metadata guard

2215296

Verifies that exactly three updates (spec, status, guard) are issued when the observedGeneration doesn’t match. Signed-off-by: Brett Tofel <[email protected]>

Persist observed annotations on all status updates

42a1f10

Signed-off-by: Brett Tofel <[email protected]>

bentito dismissed perdasilva’s stale review via 42a1f10 May 19, 2025 20:07

bentito force-pushed the OCPBUGS-37982 branch from 74bf0f7 to 42a1f10 Compare May 19, 2025 20:07

GCI the file

61eeca5

Signed-off-by: Brett Tofel <[email protected]>

grokspawn changed the title ~~OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs~~ 🐛 Reduce Frequency of Update Requests for Copied CSVs May 27, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2025

tmshort mentioned this pull request Jun 10, 2025

🐛 OCPBUGS-37982: Reduce Frequency of Update Requests for Copied CSVs #3597

Merged

11 tasks

tmshort closed this Jun 19, 2025

🐛 Reduce Frequency of Update Requests for Copied CSVs #3497

🐛 Reduce Frequency of Update Requests for Copied CSVs #3497

Uh oh!

Conversation

bentito commented Jan 22, 2025

Uh oh!

camilamacedo86 Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmshort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

camilamacedo86 left a comment

Choose a reason for hiding this comment

Uh oh!

perdasilva commented May 5, 2025

Uh oh!

everettraven commented May 5, 2025

Uh oh!

perdasilva commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

perdasilva commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bentito commented May 8, 2025

Uh oh!

perdasilva left a comment

Choose a reason for hiding this comment

Uh oh!

benluddy commented Jun 5, 2025

Uh oh!

openshift-merge-robot commented Jun 10, 2025

Uh oh!

tmshort commented Jun 19, 2025

Uh oh!

Uh oh!

camilamacedo86 Feb 12, 2025 •

edited

Loading

camilamacedo86 Mar 7, 2025 •

edited

Loading

perdasilva commented May 5, 2025 •

edited

Loading