-
Notifications
You must be signed in to change notification settings - Fork 220
Operator task suddenly shutdowns and not receiving any events after that #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi @flavioschuindt , thanx for the bug report, we will try to take a look asap. |
Thanks, @csviri. Just to add more information: I am running more tests here in batches of 10k. So far, I ran 3 batches of 10k, i.e., 30k CRs in total. All good. Now, I went to the fourth batch. I didn't see the above stack trace in this batch yet, but at certain point in time the events of the CR are super slow on my code, it takes forever...This really impacts the overall processing of the operator. |
Would it be possible to give us access to your testing code? |
Another thing: have you tried increasing the reconciliation pool size? You can configure it by changing the value returned by |
Hi, @metacosm. Unfortunatelly, I can't put the test code here as there is IP involved. But what I can tell you is that the code is simple and entirely based on the sample you guys provided in the repo. In a nutshell:
Then my performance test is just create this huge amount of CRs (~30k) and make sure that my operator is rock solid and can process all of them which is where I am facing the issues that I reported before. I know it is not the best answer as would be much better for you to take a look in the code, but is what I can do for now. About the |
No problem! 😃
Yes, you would need to pass your own implementation of If you're using the Quarkus extension, you can also configure this via |
Thanks for the inputs, @metacosm. So, I was able to parametrize the number of threads: int numConcurrentReconciliationThreads = k8OperatorsProperties.getOperator().getConcurrentReconciliationThreads();
Operator operator = new Operator(client,
ConfigurationServiceOverrider.override(DefaultConfigurationService.instance())
.withConcurrentReconciliationThreads(numConcurrentReconciliationThreads)
.build()); But the end result still the same...I tried with multiple number of threads...20, 30, 100. Just to give an example, see this particular CR instance: Name: we-cr-06e75fc5-4083-489c-98b4-91b63e0ded81 │
│ Namespace: test │
│ Labels: <none> │
│ Annotations: <none> │
│ Metadata: │
│ Creation Timestamp: 2021-09-17T06:46:29Z This means this CR was created in the cluster 18 minutes ago and still the first line of the code in the operator didn't even executed. Something is locking it... |
Thanks for checking. I will look into it. May I inquire about your use case? It's always interesting to hear from users and what they use the SDK for… 😄 |
Sure. My use case is to build an operator that can start and monitor a Kubernetes Job execution on user behalf. In a nutshell:
It's very simple and works well. But for the scenario that I am describing above with huge amount of CRs being added continuously I start to see these issues :) Hope that makes sense and thanks again for the great support here! |
Based on the error you get, it seems that the |
#546 should be available in 1.9.7. That said, if you're not seeing any memory issue then I'm not sure why the |
Today this happened to me, thus the executor was shut down. In this case it was in a integration test, and the reason was probably that to process running the unit test was about to exit. So it might be the case that K8S tries to kill the pod for some reason, on a sigterm this can happen. |
Quick update: I was able to test with 1.9.7 this weekend and the previous test (30k jobs) passed this time. I am quite not sure what improved in 1.9.7 or if that was just coincidence, but at least that was the first time I was able to run my 30k CRs. I went further after this and tried two 30k CRs in parallel and then this didn't work tough. However, this one I think I exagerated and made the cluster in a really bad situation (API Server taking forever to respond etc...) So, not considering an issue...
I would agree if some evidence in the POD lifecycle would indeed point out to some memory indication. Memory is not a compressible resource, so in my understanding the OOMKiller inside the container (which is driven by the cgroups limits) would immeditalley kill the container. In my specific case, I didn't see any memory issue and container being killed tough. Not sure about yours. |
1.9.7 has switched to a shared thread pool across all controllers instead of having one per controller, which should make it easier to reason about threads and save some memory. Hopefully, there's also better logging as to what is going on when the executor is shut down. Do you have any more information in the logs?
Yes, it's difficult to tell why the executor gets shut down if you're not observing it as it happens… unfortunately. |
@flavioschuindt Did the update to 1.9.7 resolve your issue with the 30k CustomResources? If not, could submit more logs or information. |
Hey @jmrodri, yeah, as I said above looks like I have better results now. I will close this issue then and in case I observe any other issue related to this I can reopen it. Thank you guys for all the help! |
@flavioschuindt could you open an issue with the details regarding your most recent problem, please? |
Bug Report
What did you do?
What did you expect to see?
I expect the operator to be up and running all the time even on high loads.
What did you see instead? Under which circumstances?
As per the logs, looks like the ExecutionConsumer task is rejected because the thread pool executor managed by java operator SDK is terminated.
Environment
Kubernetes cluster type:
Vanilla
$ Mention java-operator-sdk version from pom.xml file
1.9.6
$ java -version
Java 11
$ kubectl version
Possible Solution
Additional context
The operator receives some contextual information and start and monitor a Kubernetes job on users behalf. 30,000 is a huge load and Kubernetes only allows to run ~ 100 pods per node. With smaller batches I see in parallel around ~200 CR at same time in the cluster and it just works fine. With 30k, I saw numbers like ~400 CR in the cluster at a single point in time. This is defitely because the cluster can't keep the pace of the amount of executions (lots of PODs in Pending state waiting for node capacity). Then, it seems to me that this makes the list of current active CR to increase which impacts the java operator sdk and therefore causes the thread pool to terminate and it never comes back again.
The text was updated successfully, but these errors were encountered: