Handle configuration errors for Operators #1340

pvary · 2022-07-14T13:25:25Z

Feature Request

Working on the Flink Kubernetes Operator (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/), and we would like to extend the operator with probes detecting the health of the deployment.

We have observed cases recently where even though the operator pod was running, it could not do its job due to missing rolebindings or misconfigured dynamic namespaces.

What did you do?

Deployed the Flink Kubernetes Operator

What did you expect to see?

Currently they just see that the operator itself is running, but the deployments / jobs are not created as expected.

What did you see instead? Under which circumstances?

I would like to help the users detect the issue that the Flink Kubernetes Operator is not working correctly from status of the operator

Environment

Not sure about the exact environment ATM
Will try to collect the info

Java operator version: 3.0.3

Possible Solution

We are thinking about implementing probes (liveliness/readiness/statup)

Additional context

csviri · 2022-07-14T14:01:53Z

This is more like a feature request not a bug. Copy pasting the initial thoughts from discord regarding how to check access right and relations to liveness probes:

Startup probes would be useful in case for event sources would take long to sync.

Thinking how it would be possible to automatically check if the service account has correct access rights. Probably this API would be best to use:
https://kubernetes.io/docs/reference/access-authn-authz/authorization/#checking-api-access
(I don't have XP with it).

However automatically determine what resources and what verbs are utilized by an operators is not that simple. It might be doable automatically for managed dependent resources (from the annotation), see: https://javaoperatorsdk.io/docs/dependent-resources#managed-dependent-resources

Or for a lower layer, in case we implement event source definition with annotations, and make some assumptions about those resources:
#1298 (pls let us know what do you think about this issue)

Maybe an other way would be just to have an annotation that provides this information explicitly.

if there are more approaches we could take here, so currently we don't run a servlet container, and don't even want to in the core (probably). But rather could provide methods to query the operator for it's state, maybe just a method to
checkAccessPermissionForReconciler Which could be utility outside of the operator (for this case now I lean to this, after 3 minutes of thinking 🙂 ) - so no running operator required.
An another approach is just simply don't run the operator, e.g. exit the process, and create and event that explains the problem (missing create permission for a deployment for the operator)

metacosm · 2022-07-14T17:07:05Z

Note that some of this is handled in the Quarkus extension, which provides support for MicroProfile Health support using https://github.com/smallrye/smallrye-health. There's also automated creation of RBACs.

csviri · 2022-07-14T18:36:44Z

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

csviri · 2022-07-14T18:42:52Z

Maybe an other way would be just to have an annotation that provides this information explicitly.

This is how it is done in go, but we could be more intelligent about this, since there are the annotations. dependent resource, event sources?, and maybe add some dedicated.

metacosm · 2022-07-15T07:04:40Z

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

No, we try to generate as much as possible. Dependent resources help but even without them we generate some of the RBACs.

csviri · 2022-07-15T07:58:10Z

So I see here more issues:

generating the RBAC
checking if required access rights are present on startup
integration with the probes.

metacosm · 2022-07-15T08:19:40Z

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… 😄

csviri · 2022-07-15T08:33:35Z

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… smile

Yes, although, if generated not necessary means it is applied :)

But good question if both needed, or worth it to implement.

gyfora · 2022-08-15T08:46:00Z

The operator already fails on startup if it doesn't have access to the configured custom resources, I think this is almost good enough but we need some error hooks to react to any errors that happen later while the operator is running, or the namespaces are reconfigured etc.

That way it is straightforward to add a simple health check endpoint.

csviri · 2022-11-08T15:06:50Z

Added this functionality in for v4.1:
https://javaoperatorsdk.io/docs/patterns-best-practices#stopping-or-not-operator-in-case-of-informer-errors-and-cache-sync-timeouts

In #1594 a possibility is added to fine grain the liveness probe based on the health of the event sources / informers. These 2 should cover this issue too. We don't explicitly check the rbac, but based on issues with informers the behavior will be highly configurable. So after the PR is merged I intend to close this issue. So pls let me know if you see that there is additional functionality is needed.

gyfora · 2022-11-08T15:26:16Z

This is great, thank you @csviri

csviri added the needs-discussion Issue needs to be discussed more before working on it label Jul 19, 2022

gyfora mentioned this issue Aug 22, 2022

Allow operator to start even if not all watched namespaces are accessible #1405

Closed

csviri added this to the 4.2 milestone Nov 8, 2022

csviri mentioned this issue Nov 8, 2022

feat: runtime info for health probes #1594

Merged

csviri linked a pull request Nov 8, 2022 that will close this issue

feat: runtime info for health probes #1594

Merged

csviri self-assigned this Nov 8, 2022

csviri closed this as completed Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle configuration errors for Operators #1340

Handle configuration errors for Operators #1340

pvary commented Jul 14, 2022 •

edited

Loading

csviri commented Jul 14, 2022 •

edited

Loading

Uh oh!

metacosm commented Jul 14, 2022

Uh oh!

csviri commented Jul 14, 2022

Uh oh!

csviri commented Jul 14, 2022

Uh oh!

metacosm commented Jul 15, 2022

Uh oh!

csviri commented Jul 15, 2022

Uh oh!

metacosm commented Jul 15, 2022

Uh oh!

csviri commented Jul 15, 2022

Uh oh!

gyfora commented Aug 15, 2022

Uh oh!

csviri commented Nov 8, 2022

Uh oh!

gyfora commented Nov 8, 2022

Uh oh!

Handle configuration errors for Operators #1340

Handle configuration errors for Operators #1340

Comments

pvary commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature Request

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

csviri commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metacosm commented Jul 14, 2022

Uh oh!

csviri commented Jul 14, 2022

Uh oh!

csviri commented Jul 14, 2022

Uh oh!

metacosm commented Jul 15, 2022

Uh oh!

csviri commented Jul 15, 2022

Uh oh!

metacosm commented Jul 15, 2022

Uh oh!

csviri commented Jul 15, 2022

Uh oh!

gyfora commented Aug 15, 2022

Uh oh!

csviri commented Nov 8, 2022

Uh oh!

gyfora commented Nov 8, 2022

Uh oh!

pvary commented Jul 14, 2022 •

edited

Loading

csviri commented Jul 14, 2022 •

edited

Loading