Skip to content

Handle configuration errors for Operators #1340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pvary opened this issue Jul 14, 2022 · 11 comments · Fixed by #1594
Closed

Handle configuration errors for Operators #1340

pvary opened this issue Jul 14, 2022 · 11 comments · Fixed by #1594
Assignees
Labels
needs-discussion Issue needs to be discussed more before working on it
Milestone

Comments

@pvary
Copy link

pvary commented Jul 14, 2022

Feature Request

Working on the Flink Kubernetes Operator (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/), and we would like to extend the operator with probes detecting the health of the deployment.

We have observed cases recently where even though the operator pod was running, it could not do its job due to missing rolebindings or misconfigured dynamic namespaces.

What did you do?

Deployed the Flink Kubernetes Operator

What did you expect to see?

Currently they just see that the operator itself is running, but the deployments / jobs are not created as expected.

What did you see instead? Under which circumstances?

I would like to help the users detect the issue that the Flink Kubernetes Operator is not working correctly from status of the operator

Environment

Not sure about the exact environment ATM
Will try to collect the info

Java operator version: 3.0.3

Possible Solution

We are thinking about implementing probes (liveliness/readiness/statup)

Additional context

@csviri
Copy link
Collaborator

csviri commented Jul 14, 2022

This is more like a feature request not a bug. Copy pasting the initial thoughts from discord regarding how to check access right and relations to liveness probes:

Startup probes would be useful in case for event sources would take long to sync.

Thinking how it would be possible to automatically check if the service account has correct access rights. Probably this API would be best to use:
https://kubernetes.io/docs/reference/access-authn-authz/authorization/#checking-api-access
(I don't have XP with it).

However automatically determine what resources and what verbs are utilized by an operators is not that simple. It might be doable automatically for managed dependent resources (from the annotation), see: https://javaoperatorsdk.io/docs/dependent-resources#managed-dependent-resources

Or for a lower layer, in case we implement event source definition with annotations, and make some assumptions about those resources:
#1298 (pls let us know what do you think about this issue)

Maybe an other way would be just to have an annotation that provides this information explicitly.

if there are more approaches we could take here, so currently we don't run a servlet container, and don't even want to in the core (probably). But rather could provide methods to query the operator for it's state, maybe just a method to
checkAccessPermissionForReconciler Which could be utility outside of the operator (for this case now I lean to this, after 3 minutes of thinking 🙂 ) - so no running operator required.
An another approach is just simply don't run the operator, e.g. exit the process, and create and event that explains the problem (missing create permission for a deployment for the operator)

@metacosm
Copy link
Collaborator

Note that some of this is handled in the Quarkus extension, which provides support for MicroProfile Health support using https://github.com/smallrye/smallrye-health. There's also automated creation of RBACs.

@csviri
Copy link
Collaborator

csviri commented Jul 14, 2022

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

@csviri
Copy link
Collaborator

csviri commented Jul 14, 2022

Maybe an other way would be just to have an annotation that provides this information explicitly.

This is how it is done in go, but we could be more intelligent about this, since there are the annotations. dependent resource, event sources?, and maybe add some dedicated.

@metacosm
Copy link
Collaborator

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

No, we try to generate as much as possible. Dependent resources help but even without them we generate some of the RBACs.

@csviri
Copy link
Collaborator

csviri commented Jul 15, 2022

So I see here more issues:

  1. generating the RBAC
  2. checking if required access rights are present on startup
  3. integration with the probes.

@metacosm
Copy link
Collaborator

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… 😄

@csviri
Copy link
Collaborator

csviri commented Jul 15, 2022

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… smile

Yes, although, if generated not necessary means it is applied :)

But good question if both needed, or worth it to implement.

@csviri csviri added the needs-discussion Issue needs to be discussed more before working on it label Jul 19, 2022
@gyfora
Copy link

gyfora commented Aug 15, 2022

The operator already fails on startup if it doesn't have access to the configured custom resources, I think this is almost good enough but we need some error hooks to react to any errors that happen later while the operator is running, or the namespaces are reconfigured etc.

That way it is straightforward to add a simple health check endpoint.

@csviri
Copy link
Collaborator

csviri commented Nov 8, 2022

Added this functionality in for v4.1:
https://javaoperatorsdk.io/docs/patterns-best-practices#stopping-or-not-operator-in-case-of-informer-errors-and-cache-sync-timeouts

In #1594 a possibility is added to fine grain the liveness probe based on the health of the event sources / informers. These 2 should cover this issue too. We don't explicitly check the rbac, but based on issues with informers the behavior will be highly configurable. So after the PR is merged I intend to close this issue. So pls let me know if you see that there is additional functionality is needed.

@gyfora
Copy link

gyfora commented Nov 8, 2022

This is great, thank you @csviri

@csviri csviri closed this as completed Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion Issue needs to be discussed more before working on it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants