Operator task suddenly shutdowns and not receiving any events after that #533

flavioschuindt · 2021-09-13T18:56:02Z

Bug Report

What did you do?

Created a simple operator that manages kubernetes jobs;
In the beginning, everything works good. Tried multiple performance tests: 3k, 5k, 8k and 15k CustomResource creations and operator worked very good;
Then, tried one last one with 30k CustomResource. Everything again works great, but then suddenly I see this on my operator stack trace:

2021-09-13 11:42:48.113 ERROR [-] [] 1 --- [/10.233.0.1/...] i.f.k.c.d.internal.AbstractWatchManager  : Unhandled exception encountered in watcher event handler

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@f74a29b[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@594c4226[Wrapped task = io.javaoperatorsdk.operator.processing.ExecutionConsumer@7b06f65d]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@4c40de55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 42]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705)
	at io.javaoperatorsdk.operator.processing.DefaultEventHandler.executeBufferedEvents(DefaultEventHandler.java:149)
	at io.javaoperatorsdk.operator.processing.DefaultEventHandler.handleEvent(DefaultEventHandler.java:128)
	at io.javaoperatorsdk.operator.processing.event.internal.CustomResourceEventSource.eventReceived(CustomResourceEventSource.java:102)
	at io.javaoperatorsdk.operator.processing.event.internal.CustomResourceEventSource.eventReceived(CustomResourceEventSource.java:29)
	at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.eventReceived(AbstractWatchManager.java:178)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:234)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:93)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.kt:290)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.kt:220)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.kt:100)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.kt:247)
	at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:181)
	at okhttp3.RealCall$AsyncCall.run(RealCall.kt:138)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

After this stack trace, I don't have new CR event coming to my operator code, simply does not work anymore. Only option is to restart it.

What did you expect to see?

I expect the operator to be up and running all the time even on high loads.

What did you see instead? Under which circumstances?

As per the logs, looks like the ExecutionConsumer task is rejected because the thread pool executor managed by java operator SDK is terminated.

Environment

Kubernetes cluster type:

Vanilla

$ Mention java-operator-sdk version from pom.xml file

1.9.6

$ java -version

Java 11

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.7", GitCommit:"b4455102ef392bf7d594ef96b97a4caa79d729d9", GitTreeState:"clean", BuildDate:"2020-06-17T11:32:20Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

Possible Solution

Additional context

The operator receives some contextual information and start and monitor a Kubernetes job on users behalf. 30,000 is a huge load and Kubernetes only allows to run ~ 100 pods per node. With smaller batches I see in parallel around ~200 CR at same time in the cluster and it just works fine. With 30k, I saw numbers like ~400 CR in the cluster at a single point in time. This is defitely because the cluster can't keep the pace of the amount of executions (lots of PODs in Pending state waiting for node capacity). Then, it seems to me that this makes the list of current active CR to increase which impacts the java operator sdk and therefore causes the thread pool to terminate and it never comes back again.

The text was updated successfully, but these errors were encountered:

csviri · 2021-09-14T05:57:11Z

hi @flavioschuindt , thanx for the bug report, we will try to take a look asap.

flavioschuindt · 2021-09-14T21:42:54Z

Thanks, @csviri. Just to add more information: I am running more tests here in batches of 10k. So far, I ran 3 batches of 10k, i.e., 30k CRs in total. All good. Now, I went to the fourth batch. I didn't see the above stack trace in this batch yet, but at certain point in time the events of the CR are super slow on my code, it takes forever...This really impacts the overall processing of the operator.

metacosm · 2021-09-15T09:51:23Z

Would it be possible to give us access to your testing code?

metacosm · 2021-09-15T09:55:51Z

Another thing: have you tried increasing the reconciliation pool size? You can configure it by changing the value returned by ConfigurationService.concurrentReconciliationThreads.

flavioschuindt · 2021-09-16T07:24:35Z

Hi, @metacosm.

Unfortunatelly, I can't put the test code here as there is IP involved. But what I can tell you is that the code is simple and entirely based on the sample you guys provided in the repo. In a nutshell:

Same as highlighted code above from the link, create a bean that register the controllers;
Create a controller bean that implements ResourceController;
The controller bean then is my operator logic: It has a createOrUpdateResource that does my business logic which essentially creates K8s jobs on users behalfs and feed the result of it in a third-party system. This bean also has a deleteResource method that just return DeleteControl.DEFAULT_DELETE.

Then my performance test is just create this huge amount of CRs (~30k) and make sure that my operator is rock solid and can process all of them which is where I am facing the issues that I reported before.

I know it is not the best answer as would be much better for you to take a look in the code, but is what I can do for now.

About the ConfigurationService.concurrentReconciliationThreads where am I supposed to change it? I see that the operator accepts a ConfigurationService as second argument: Operator operator = new Operator(client, DefaultConfigurationService.instance());. Are you saying to create a ConfigurationService instance and pass as second argument? Could you please share an example? Thank you!

metacosm · 2021-09-16T13:20:24Z

I know it is not the best answer as would be much better for you to take a look in the code, but is what I can do for now.

No problem! 😃

About the ConfigurationService.concurrentReconciliationThreads where am I supposed to change it? I see that the operator accepts a ConfigurationService as second argument: Operator operator = new Operator(client, DefaultConfigurationService.instance());. Are you saying to create a ConfigurationService instance and pass as second argument? Could you please share an example? Thank you!

Yes, you would need to pass your own implementation of ConfigurationService in that case. This is indeed not trivial to do and should be made easier. I created a PR for this: #543 which also contains an example of how to use it.

If you're using the Quarkus extension, you can also configure this via application.properties using the quarkus.operator-sdk.concurrent-reconciliation-threads property, passing it the number of threads you want to use.

flavioschuindt · 2021-09-17T07:05:48Z

Thanks for the inputs, @metacosm. So, I was able to parametrize the number of threads:

int numConcurrentReconciliationThreads = k8OperatorsProperties.getOperator().getConcurrentReconciliationThreads();
Operator operator = new Operator(client,
                ConfigurationServiceOverrider.override(DefaultConfigurationService.instance())
                        .withConcurrentReconciliationThreads(numConcurrentReconciliationThreads)
                        .build());

But the end result still the same...I tried with multiple number of threads...20, 30, 100. Just to give an example, see this particular CR instance:

Name:         we-cr-06e75fc5-4083-489c-98b4-91b63e0ded81                                                                                                                                                                                                                     │
│ Namespace:    test                                                                                                                                                                                                                                                           │
│ Labels:       <none>                                                                                                                                                                                                                                                         │
│ Annotations:  <none>                                                                                                                                                                                                                                                         │                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
│ Metadata:                                                                                                                                                                                                                                                                    │
│   Creation Timestamp:  2021-09-17T06:46:29Z

This means this CR was created in the cluster 18 minutes ago and still the first line of the code in the operator didn't even executed. Something is locking it...

metacosm · 2021-09-17T07:18:31Z

Thanks for checking. I will look into it. May I inquire about your use case? It's always interesting to hear from users and what they use the SDK for… 😄

flavioschuindt · 2021-09-17T07:25:21Z

Sure. My use case is to build an operator that can start and monitor a Kubernetes Job execution on user behalf. In a nutshell:

User drop the CR in the cluster containing some metadata, including the job spec;
Operator receives the CR definition, reads the spec and start the job in the cluster;
Operator registers a SharedInformer so it can receive Job Events;
When the job is done (regardless of its status), operator receives the event from the informer and notify another system with the job execution results.

It's very simple and works well. But for the scenario that I am describing above with huge amount of CRs being added continuously I start to see these issues :)

Hope that makes sense and thanks again for the great support here!

metacosm · 2021-09-20T14:12:47Z

Based on the error you get, it seems that the ThreadPoolExecutor is getting shut down for some reason. I suspect an out of memory issue but it's hard to tell simply from this. Are you able to see such errors when you run your operator? Could you give #546 a try to see if this helps?

flavioschuindt · 2021-10-01T18:36:17Z

Hi, @metacosm.

Looking at grafana dashboards with data collected by Prometheus I don't see any memory issue. In which version of the library #546 is available?

metacosm · 2021-10-01T19:46:44Z

#546 should be available in 1.9.7. That said, if you're not seeing any memory issue then I'm not sure why the ThreadPoolExecutor is getting shut down… :(

csviri · 2021-10-04T20:40:54Z

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.

So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

flavioschuindt · 2021-10-04T20:52:24Z

Quick update: I was able to test with 1.9.7 this weekend and the previous test (30k jobs) passed this time. I am quite not sure what improved in 1.9.7 or if that was just coincidence, but at least that was the first time I was able to run my 30k CRs. I went further after this and tried two 30k CRs in parallel and then this didn't work tough. However, this one I think I exagerated and made the cluster in a really bad situation (API Server taking forever to respond etc...) So, not considering an issue...

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.

So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

I would agree if some evidence in the POD lifecycle would indeed point out to some memory indication. Memory is not a compressible resource, so in my understanding the OOMKiller inside the container (which is driven by the cgroups limits) would immeditalley kill the container. In my specific case, I didn't see any memory issue and container being killed tough. Not sure about yours.

metacosm · 2021-10-05T08:39:06Z

Quick update: I was able to test with 1.9.7 this weekend and the previous test (30k jobs) passed this time. I am quite not sure what improved in 1.9.7 or if that was just coincidence, but at least that was the first time I was able to run my 30k CRs. I went further after this and tried two 30k CRs in parallel and then this didn't work tough. However, this one I think I exagerated and made the cluster in a really bad situation (API Server taking forever to respond etc...) So, not considering an issue...

1.9.7 has switched to a shared thread pool across all controllers instead of having one per controller, which should make it easier to reason about threads and save some memory. Hopefully, there's also better logging as to what is going on when the executor is shut down. Do you have any more information in the logs?

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.
So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

I would agree if some evidence in the POD lifecycle would indeed point out to some memory indication. Memory is not a compressible resource, so in my understanding the OOMKiller inside the container (which is driven by the cgroups limits) would immeditalley kill the container. In my specific case, I didn't see any memory issue and container being killed tough. Not sure about yours.

Yes, it's difficult to tell why the executor gets shut down if you're not observing it as it happens… unfortunately.

jmrodri · 2021-10-14T15:44:45Z

@flavioschuindt Did the update to 1.9.7 resolve your issue with the 30k CustomResources? If not, could submit more logs or information.

flavioschuindt · 2021-10-15T07:57:58Z

@flavioschuindt Did the update to 1.9.7 resolve your issue with the 30k CustomResources? If not, could submit more logs or information.

Hey @jmrodri, yeah, as I said above looks like I have better results now. I will close this issue then and in case I observe any other issue related to this I can reopen it. Thank you guys for all the help!

metacosm · 2021-11-04T10:30:35Z

@flavioschuindt could you open an issue with the details regarding your most recent problem, please?

csviri mentioned this issue Sep 16, 2021

Using only one instance of ScheduledExecutorService #540

Closed

jmrodri added kind/bug Categorizes issue or PR as related to a bug. java Pull requests that update Java code labels Oct 14, 2021

flavioschuindt closed this as completed Oct 15, 2021

csviri mentioned this issue Oct 29, 2021

Performance and Longevity Tests #635

Closed

Operator task suddenly shutdowns and not receiving any events after that #533

Operator task suddenly shutdowns and not receiving any events after that #533

Comments

flavioschuindt commented Sep 13, 2021

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

csviri commented Sep 14, 2021

Uh oh!

flavioschuindt commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metacosm commented Sep 15, 2021

Uh oh!

metacosm commented Sep 15, 2021

Uh oh!

flavioschuindt commented Sep 16, 2021

Uh oh!

metacosm commented Sep 16, 2021

Uh oh!

flavioschuindt commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metacosm commented Sep 17, 2021

Uh oh!

flavioschuindt commented Sep 17, 2021

Uh oh!

metacosm commented Sep 20, 2021

Uh oh!

flavioschuindt commented Oct 1, 2021

Uh oh!

metacosm commented Oct 1, 2021

Uh oh!

csviri commented Oct 4, 2021

Uh oh!

flavioschuindt commented Oct 4, 2021

Uh oh!

metacosm commented Oct 5, 2021

Uh oh!

jmrodri commented Oct 14, 2021

Uh oh!

flavioschuindt commented Oct 15, 2021

Uh oh!

metacosm commented Nov 4, 2021

Uh oh!

flavioschuindt commented Sep 14, 2021 •

edited

Loading

flavioschuindt commented Sep 17, 2021 •

edited

Loading