Skip to content

InformerEventSource throwing NPE if the watched resource is deleted #830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vmuzikar opened this issue Jan 13, 2022 · 16 comments · Fixed by #833
Closed

InformerEventSource throwing NPE if the watched resource is deleted #830

vmuzikar opened this issue Jan 13, 2022 · 16 comments · Fixed by #833
Assignees
Labels
fabric8 kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@vmuzikar
Copy link
Contributor

Bug Report

What did you do?

  1. Configured the event source:
    @Override
    public List<EventSource> prepareEventSources(EventSourceContext<Keycloak> context) {
        SharedIndexInformer<Deployment> deploymentInformer =
                client.apps().deployments().inAnyNamespace()
                        .withLabels(Constants.DEFAULT_LABELS)
                        .runnableInformer(0);

        return List.of(new InformerEventSource<>(
                deploymentInformer, d -> {
            var ownerReferences = d.getMetadata().getOwnerReferences();
            if (!ownerReferences.isEmpty()) {
                return Set.of(new ResourceID(ownerReferences.get(0).getName(),
                        d.getMetadata().getNamespace()));
            } else {
                return Collections.emptySet();
            }
        }));
    }
  1. Created the secondary resource (Deployment) in a reconciliation loop.
  2. Deleted the Deployment manually to simulate user unintentionally deleting it.

What did you expect to see?

No NPE. ;)

What did you see instead? Under which circumstances?

2022-01-13 18:16:00,804 ERROR [io.fab.kub.cli.inf.cac.SharedProcessor] (OkHttp https://127.0.0.1:52740/...) Failed invoking io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource$1@18998b60 event handler: null: java.lang.NullPointerException
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource$1.onUpdate(InformerEventSource.java:88)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource$1.onUpdate(InformerEventSource.java:78)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener$UpdateNotification.handle(ProcessorListener.java:85)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener.add(ProcessorListener.java:47)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.lambda$distribute$0(SharedProcessor.java:79)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.lambda$distribute$1(SharedProcessor.java:101)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.lambda$execute$0(SerialExecutor.java:40)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.scheduleNext(SerialExecutor.java:52)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.execute(SerialExecutor.java:46)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.distribute(SharedProcessor.java:98)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.distribute(SharedProcessor.java:79)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorStore.update(ProcessorStore.java:48)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorStore.update(ProcessorStore.java:29)
	at io.fabric8.kubernetes.client.informers.cache.Reflector$ReflectorWatcher.eventReceived(Reflector.java:134)
	at io.fabric8.kubernetes.client.informers.cache.Reflector$ReflectorWatcher.eventReceived(Reflector.java:114)
	at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.eventReceived(AbstractWatchManager.java:203)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:269)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:68)
	at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$BuilderImpl$1.onMessage(OkHttpWebSocketImpl.java:92)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Environment

Kubernetes cluster type:

vanilla

$ Mention java-operator-sdk version from pom.xml file

2.0.0-RC1 (commit: fd6e493)

$ java -version

openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment Homebrew (build 11.0.12+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.12+0, mixed mode)

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:33:37Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:35:25Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

@csviri
Copy link
Collaborator

csviri commented Jan 13, 2022

This seems to be an issues with informer in fabric8 client, will implement a quick fix on our side, and address it there.

@csviri
Copy link
Collaborator

csviri commented Jan 14, 2022

For now was not able to reproduce this locally, will try further. Putting @manusa @shawkins on cc,

What happens that the new object is null on update:
https://github.com/java-operator-sdk/java-operator-sdk/blob/62ce303e107108a2acdf05249f3105b0eda5d758/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerEventSource.java#L88

Is it expected in any situation that the new object is null?
(in other words Is this a bug or a feature? :) )

@metacosm
Copy link
Collaborator

I think the question is why onUpdate is called when the object is deleted. Shouldn't the informer only call onDelete in that case?

@csviri
Copy link
Collaborator

csviri commented Jan 14, 2022

true, it should be delete called.

@manusa
Copy link
Contributor

manusa commented Jan 14, 2022

I think the question is why onUpdate is called when the object is deleted. Shouldn't the informer only call onDelete in that case?

Following the watcher logic (no idea if this is what happened on the Informer side), for regular deletes, the first step is to add the deletionTimestamp field. So at least a modified event should be received prior to the deletion.

@csviri
Copy link
Collaborator

csviri commented Jan 14, 2022

Following the watcher logic (no idea if this is what happened on the Informer side), for regular deletes, the first step is to add the deletionTimestamp field. So at least a modified event should be received prior to the deletion.

yes, but the old and new object pair should be still there, the new object will be in this case what is "marked for deletion". At least this is what I saw before.

@manusa
Copy link
Contributor

manusa commented Jan 14, 2022

yes, but the old and new object pair should be still there, the new object will be in this case what is "marked for deletion". At least this is what I saw before.

From the watcher perspective yes. But I'm not sure if the Informer might be merging events.
The only explanation that occurs to me (without checking the informer code), is that the resource is deleted from the cache before the update event is emitted. i.e. modified + delete event are received by the informer watcher, when the modified event is composed+emitted, the resource was already deleted because of the watcher delete event.

@csviri csviri linked a pull request Jan 14, 2022 that will close this issue
@csviri
Copy link
Collaborator

csviri commented Jan 14, 2022

ok, thank you!!
created a PR, with a fix and integration test on our side - what did not reproduce it, so it might be a concurrency issue, not sure.
will investigate further a little bit later.

@csviri csviri reopened this Jan 14, 2022
@csviri csviri self-assigned this Jan 14, 2022
@shawkins
Copy link
Collaborator

It's not generally possible for the newObject to be null. The incoming event is processed by the reflector. There is an explicit null check there: https://github.com/fabric8io/kubernetes-client/blob/9e8ae39711143648158f555757296c613175c819/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/cache/Reflector.java#L122

That same object reference will be passed all the way to your event handler with the stacktrace you are showing.

The resourceVersion is read by the Watcher as well, so we know that the same call was not an NPE at this point:
https://github.com/fabric8io/kubernetes-client/blob/9e8ae39711143648158f555757296c613175c819/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/cache/Reflector.java#L140

Since the resource is added to the store prior to the handler being called it is possible that other logic is modifying the object first - is it possible that the operator framework is setting the metadata to null? Or are you sure that the new object is null - that would be very surprising.

@csviri
Copy link
Collaborator

csviri commented Jan 14, 2022

thx @shawkins !
@vmuzikar could you retest it with the current snapshot version?
Also share the code which are you using for testing?

@metacosm
Copy link
Collaborator

Maybe we should split the line on which the NPE occurs so that we have a better idea of what is null?

@vmuzikar
Copy link
Contributor Author

vmuzikar commented Jan 17, 2022

The issue still persists for me but with a little bit more debugging I believe I found the root cause:
https://github.com/java-operator-sdk/java-operator-sdk/blob/09180abaeb93646bd7bbe8e924fd2336e61808f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerEventSource.java#L94-L95
Happens when oldObject.getMetadata().getResourceVersion() == null.

The thing is that I'm also creating or updating the Deployment in the reconciliation loop. And the loop is triggered by the Development being deleted, then created in the reconciliation loop, etc. Probably I'm just doing something wrong. :D But it's basically a race condition, happens only sometimes when the reconciliation is too fast.

@shawkins
Copy link
Collaborator

So this is happening due to modifications to the cached resource.

It might not even be the operator logic making the modification. This was touched on in #3078 but not really addressed. Essentially the logic in createOrReplace, replace, or patch is allowed to modify the passed in object - in particular the resourceVersion. For example using an item directly from the cache in createOrReplace will set the resourceVersion to null - https://github.com/fabric8io/kubernetes-client/blob/32c6a88f029ba64b1f4225c5f122f2247b1b74d7/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/CreateOrReplaceHelper.java#L46

I'm thinking we need fabric8 to not do this. To be safe we'll need to clone those objects before modifying: fabric8io/kubernetes-client#3756

It has also been discussed whether the objects obtained from the cache should already be cloned or write protected to prevent any future modifications - only cloning would be easy to implement. The only downside would be the general performance overhead - it's also possible to consider adding methods that would differentiate get vs getDirect.

@metacosm
Copy link
Collaborator

Thanks for the updates!

@jmrodri jmrodri added kind/bug Categorizes issue or PR as related to a bug. fabric8 labels Feb 17, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@github-actions
Copy link

github-actions bot commented May 4, 2022

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric8 kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants