-
Notifications
You must be signed in to change notification settings - Fork 220
Handle configuration errors for Operators #1340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is more like a feature request not a bug. Copy pasting the initial thoughts from discord regarding how to check access right and relations to liveness probes: Startup probes would be useful in case for event sources would take long to sync. Thinking how it would be possible to automatically check if the service account has correct access rights. Probably this API would be best to use: However automatically determine what resources and what verbs are utilized by an operators is not that simple. It might be doable automatically for managed dependent resources (from the annotation), see: https://javaoperatorsdk.io/docs/dependent-resources#managed-dependent-resources Or for a lower layer, in case we implement event source definition with annotations, and make some assumptions about those resources: Maybe an other way would be just to have an annotation that provides this information explicitly. if there are more approaches we could take here, so currently we don't run a servlet container, and don't even want to in the core (probably). But rather could provide methods to query the operator for it's state, maybe just a method to |
Note that some of this is handled in the Quarkus extension, which provides support for MicroProfile Health support using https://github.com/smallrye/smallrye-health. There's also automated creation of RBACs. |
But RBAC is just for |
This is how it is done in go, but we could be more intelligent about this, since there are the annotations. dependent resource, event sources?, and maybe add some dedicated. |
No, we try to generate as much as possible. Dependent resources help but even without them we generate some of the RBACs. |
So I see here more issues:
|
Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… 😄 |
Yes, although, if generated not necessary means it is applied :) But good question if both needed, or worth it to implement. |
The operator already fails on startup if it doesn't have access to the configured custom resources, I think this is almost good enough but we need some error hooks to react to any errors that happen later while the operator is running, or the namespaces are reconfigured etc. That way it is straightforward to add a simple health check endpoint. |
Added this functionality in for v4.1: In #1594 a possibility is added to fine grain the liveness probe based on the health of the event sources / informers. These 2 should cover this issue too. We don't explicitly check the rbac, but based on issues with informers the behavior will be highly configurable. So after the PR is merged I intend to close this issue. So pls let me know if you see that there is additional functionality is needed. |
This is great, thank you @csviri |
Uh oh!
There was an error while loading. Please reload this page.
Feature Request
Working on the Flink Kubernetes Operator (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/), and we would like to extend the operator with probes detecting the health of the deployment.
We have observed cases recently where even though the operator pod was running, it could not do its job due to missing rolebindings or misconfigured dynamic namespaces.
What did you do?
Deployed the Flink Kubernetes Operator
What did you expect to see?
Currently they just see that the operator itself is running, but the deployments / jobs are not created as expected.
What did you see instead? Under which circumstances?
I would like to help the users detect the issue that the Flink Kubernetes Operator is not working correctly from status of the operator
Environment
Not sure about the exact environment ATM
Will try to collect the info
Java operator version: 3.0.3
Possible Solution
We are thinking about implementing probes (liveliness/readiness/statup)
Additional context
The text was updated successfully, but these errors were encountered: