Skip to content

Operator task suddenly shutdowns and not receiving any events after that #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flavioschuindt opened this issue Sep 13, 2021 · 18 comments
Closed
Labels
java Pull requests that update Java code kind/bug Categorizes issue or PR as related to a bug.

Comments

@flavioschuindt
Copy link

Bug Report

What did you do?

  • Created a simple operator that manages kubernetes jobs;
  • In the beginning, everything works good. Tried multiple performance tests: 3k, 5k, 8k and 15k CustomResource creations and operator worked very good;
  • Then, tried one last one with 30k CustomResource. Everything again works great, but then suddenly I see this on my operator stack trace:
2021-09-13 11:42:48.113 ERROR [-] [] 1 --- [/10.233.0.1/...] i.f.k.c.d.internal.AbstractWatchManager  : Unhandled exception encountered in watcher event handler

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@f74a29b[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@594c4226[Wrapped task = io.javaoperatorsdk.operator.processing.ExecutionConsumer@7b06f65d]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@4c40de55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 42]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705)
	at io.javaoperatorsdk.operator.processing.DefaultEventHandler.executeBufferedEvents(DefaultEventHandler.java:149)
	at io.javaoperatorsdk.operator.processing.DefaultEventHandler.handleEvent(DefaultEventHandler.java:128)
	at io.javaoperatorsdk.operator.processing.event.internal.CustomResourceEventSource.eventReceived(CustomResourceEventSource.java:102)
	at io.javaoperatorsdk.operator.processing.event.internal.CustomResourceEventSource.eventReceived(CustomResourceEventSource.java:29)
	at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.eventReceived(AbstractWatchManager.java:178)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:234)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:93)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.kt:290)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.kt:220)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.kt:100)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.kt:247)
	at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:181)
	at okhttp3.RealCall$AsyncCall.run(RealCall.kt:138)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
  • After this stack trace, I don't have new CR event coming to my operator code, simply does not work anymore. Only option is to restart it.

What did you expect to see?

I expect the operator to be up and running all the time even on high loads.

What did you see instead? Under which circumstances?

As per the logs, looks like the ExecutionConsumer task is rejected because the thread pool executor managed by java operator SDK is terminated.

Environment

Kubernetes cluster type:

Vanilla

$ Mention java-operator-sdk version from pom.xml file

1.9.6

$ java -version

Java 11

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.7", GitCommit:"b4455102ef392bf7d594ef96b97a4caa79d729d9", GitTreeState:"clean", BuildDate:"2020-06-17T11:32:20Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

Possible Solution

Additional context

The operator receives some contextual information and start and monitor a Kubernetes job on users behalf. 30,000 is a huge load and Kubernetes only allows to run ~ 100 pods per node. With smaller batches I see in parallel around ~200 CR at same time in the cluster and it just works fine. With 30k, I saw numbers like ~400 CR in the cluster at a single point in time. This is defitely because the cluster can't keep the pace of the amount of executions (lots of PODs in Pending state waiting for node capacity). Then, it seems to me that this makes the list of current active CR to increase which impacts the java operator sdk and therefore causes the thread pool to terminate and it never comes back again.

@csviri
Copy link
Collaborator

csviri commented Sep 14, 2021

hi @flavioschuindt , thanx for the bug report, we will try to take a look asap.

@flavioschuindt
Copy link
Author

flavioschuindt commented Sep 14, 2021

Thanks, @csviri. Just to add more information: I am running more tests here in batches of 10k. So far, I ran 3 batches of 10k, i.e., 30k CRs in total. All good. Now, I went to the fourth batch. I didn't see the above stack trace in this batch yet, but at certain point in time the events of the CR are super slow on my code, it takes forever...This really impacts the overall processing of the operator.

@metacosm
Copy link
Collaborator

Would it be possible to give us access to your testing code?

@metacosm
Copy link
Collaborator

Another thing: have you tried increasing the reconciliation pool size? You can configure it by changing the value returned by ConfigurationService.concurrentReconciliationThreads.

@flavioschuindt
Copy link
Author

Hi, @metacosm.

Unfortunatelly, I can't put the test code here as there is IP involved. But what I can tell you is that the code is simple and entirely based on the sample you guys provided in the repo. In a nutshell:

  • Same as highlighted code above from the link, create a bean that register the controllers;
  • Create a controller bean that implements ResourceController;
  • The controller bean then is my operator logic: It has a createOrUpdateResource that does my business logic which essentially creates K8s jobs on users behalfs and feed the result of it in a third-party system. This bean also has a deleteResource method that just return DeleteControl.DEFAULT_DELETE.

Then my performance test is just create this huge amount of CRs (~30k) and make sure that my operator is rock solid and can process all of them which is where I am facing the issues that I reported before.

I know it is not the best answer as would be much better for you to take a look in the code, but is what I can do for now.

About the ConfigurationService.concurrentReconciliationThreads where am I supposed to change it? I see that the operator accepts a ConfigurationService as second argument: Operator operator = new Operator(client, DefaultConfigurationService.instance());. Are you saying to create a ConfigurationService instance and pass as second argument? Could you please share an example? Thank you!

@metacosm
Copy link
Collaborator

I know it is not the best answer as would be much better for you to take a look in the code, but is what I can do for now.

No problem! 😃

About the ConfigurationService.concurrentReconciliationThreads where am I supposed to change it? I see that the operator accepts a ConfigurationService as second argument: Operator operator = new Operator(client, DefaultConfigurationService.instance());. Are you saying to create a ConfigurationService instance and pass as second argument? Could you please share an example? Thank you!

Yes, you would need to pass your own implementation of ConfigurationService in that case. This is indeed not trivial to do and should be made easier. I created a PR for this: #543 which also contains an example of how to use it.

If you're using the Quarkus extension, you can also configure this via application.properties using the quarkus.operator-sdk.concurrent-reconciliation-threads property, passing it the number of threads you want to use.

@flavioschuindt
Copy link
Author

flavioschuindt commented Sep 17, 2021

Thanks for the inputs, @metacosm. So, I was able to parametrize the number of threads:

int numConcurrentReconciliationThreads = k8OperatorsProperties.getOperator().getConcurrentReconciliationThreads();
Operator operator = new Operator(client,
                ConfigurationServiceOverrider.override(DefaultConfigurationService.instance())
                        .withConcurrentReconciliationThreads(numConcurrentReconciliationThreads)
                        .build());

But the end result still the same...I tried with multiple number of threads...20, 30, 100. Just to give an example, see this particular CR instance:

Name:         we-cr-06e75fc5-4083-489c-98b4-91b63e0ded81                                                                                                                                                                                                                     │
│ Namespace:    test                                                                                                                                                                                                                                                           │
│ Labels:       <none>                                                                                                                                                                                                                                                         │
│ Annotations:  <none>                                                                                                                                                                                                                                                         │                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
│ Metadata:                                                                                                                                                                                                                                                                    
│   Creation Timestamp:  2021-09-17T06:46:29Z

This means this CR was created in the cluster 18 minutes ago and still the first line of the code in the operator didn't even executed. Something is locking it...

@metacosm
Copy link
Collaborator

Thanks for checking. I will look into it. May I inquire about your use case? It's always interesting to hear from users and what they use the SDK for… 😄

@flavioschuindt
Copy link
Author

Sure. My use case is to build an operator that can start and monitor a Kubernetes Job execution on user behalf. In a nutshell:

  • User drop the CR in the cluster containing some metadata, including the job spec;
  • Operator receives the CR definition, reads the spec and start the job in the cluster;
  • Operator registers a SharedInformer so it can receive Job Events;
  • When the job is done (regardless of its status), operator receives the event from the informer and notify another system with the job execution results.

It's very simple and works well. But for the scenario that I am describing above with huge amount of CRs being added continuously I start to see these issues :)

Hope that makes sense and thanks again for the great support here!

@metacosm
Copy link
Collaborator

Based on the error you get, it seems that the ThreadPoolExecutor is getting shut down for some reason. I suspect an out of memory issue but it's hard to tell simply from this. Are you able to see such errors when you run your operator? Could you give #546 a try to see if this helps?

@flavioschuindt
Copy link
Author

Hi, @metacosm.

Looking at grafana dashboards with data collected by Prometheus I don't see any memory issue. In which version of the library #546 is available?

@metacosm
Copy link
Collaborator

metacosm commented Oct 1, 2021

#546 should be available in 1.9.7. That said, if you're not seeing any memory issue then I'm not sure why the ThreadPoolExecutor is getting shut down… :(

@csviri
Copy link
Collaborator

csviri commented Oct 4, 2021

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.

So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

@flavioschuindt
Copy link
Author

Quick update: I was able to test with 1.9.7 this weekend and the previous test (30k jobs) passed this time. I am quite not sure what improved in 1.9.7 or if that was just coincidence, but at least that was the first time I was able to run my 30k CRs. I went further after this and tried two 30k CRs in parallel and then this didn't work tough. However, this one I think I exagerated and made the cluster in a really bad situation (API Server taking forever to respond etc...) So, not considering an issue...

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.

So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

I would agree if some evidence in the POD lifecycle would indeed point out to some memory indication. Memory is not a compressible resource, so in my understanding the OOMKiller inside the container (which is driven by the cgroups limits) would immeditalley kill the container. In my specific case, I didn't see any memory issue and container being killed tough. Not sure about yours.

@metacosm
Copy link
Collaborator

metacosm commented Oct 5, 2021

Quick update: I was able to test with 1.9.7 this weekend and the previous test (30k jobs) passed this time. I am quite not sure what improved in 1.9.7 or if that was just coincidence, but at least that was the first time I was able to run my 30k CRs. I went further after this and tried two 30k CRs in parallel and then this didn't work tough. However, this one I think I exagerated and made the cluster in a really bad situation (API Server taking forever to respond etc...) So, not considering an issue...

1.9.7 has switched to a shared thread pool across all controllers instead of having one per controller, which should make it easier to reason about threads and save some memory. Hopefully, there's also better logging as to what is going on when the executor is shut down. Do you have any more information in the logs?

Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit.
So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen.

I would agree if some evidence in the POD lifecycle would indeed point out to some memory indication. Memory is not a compressible resource, so in my understanding the OOMKiller inside the container (which is driven by the cgroups limits) would immeditalley kill the container. In my specific case, I didn't see any memory issue and container being killed tough. Not sure about yours.

Yes, it's difficult to tell why the executor gets shut down if you're not observing it as it happens… unfortunately.

@jmrodri jmrodri added kind/bug Categorizes issue or PR as related to a bug. java Pull requests that update Java code labels Oct 14, 2021
@jmrodri
Copy link
Member

jmrodri commented Oct 14, 2021

@flavioschuindt Did the update to 1.9.7 resolve your issue with the 30k CustomResources? If not, could submit more logs or information.

@flavioschuindt
Copy link
Author

@flavioschuindt Did the update to 1.9.7 resolve your issue with the 30k CustomResources? If not, could submit more logs or information.

Hey @jmrodri, yeah, as I said above looks like I have better results now. I will close this issue then and in case I observe any other issue related to this I can reopen it. Thank you guys for all the help!

@metacosm
Copy link
Collaborator

metacosm commented Nov 4, 2021

@flavioschuindt could you open an issue with the details regarding your most recent problem, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
java Pull requests that update Java code kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants