Skip to content

CustomPluginMonitor should support multiple "permanent" rules for the same node condition. #664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jason1028kr opened this issue Apr 5, 2022 · 14 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jason1028kr
Copy link
Contributor

jason1028kr commented Apr 5, 2022

Hello,

This is to reopen plan/discussion around an issue where CustomPluginMonitor behaves in an unexpected way (#566).

In short, when there are multiple rules for the same condition in configuration, the NPD will generate events for all of them causing the condition to flip-flop.

Current behavior

Described in #566

Suggestions/Ideas

As discussed in the original issue, I believe it would be reasonable for CustomPluginMonitor to behave in this way.

  • Produce one event for each condition.
  • If different rules for a condition return both NonOK/True and OK/False, resolve to the "worst" case (condition is NonOK/True).
  • If multiple rules evaluate to NonOK/True, current condition's reason/rule (NonOK/True) takes precedence. If it just so happens that the particular reason/rule returns OK/False in current interval, we can take the first rule that evaluates to NonOK/True.
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2022
@jason1028kr
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022
@Random-Liu
Copy link
Member

In short, when there are multiple rules for the same condition in configuration, the NPD will generate events for all of them causing the condition to flip-flop.

In this case, instead of complicating the NPD logic, why couldn't the user consolidate the 2 plugins with 1 entrypoint, and only configure one rule instead?

@jason1028kr
Copy link
Contributor Author

jason1028kr commented Jul 28, 2022

Hi @Random-Liu! The idea was that for certain set of rule/plugins that are mutually exclusive, user would only observe one node condition/event per interval.

For instance, if the user wants to take different remediating action depending on a certain set of node condition, it would be ideal to keep the set of possible rules but resolve to only one result per interval.

@jason1028kr
Copy link
Contributor Author

If by consolidating you mean having 1 plugin script generating multiple results, I would probably have to expand result.ExitStatus >= cpmtypes.NonOK. But not sure how to do that in a generalized way that is not specific to 1 usecase.

@Random-Liu
Copy link
Member

Random-Liu commented Aug 1, 2022

Hm, in the example #566, it should just consolidate the 2 scripts.

And for other problems with different Reasons, you can just setup different conditions for them?

It would be good if we can support different reasons for a condition, but it is probably better to get that from the custom plugin directly, instead of relying on multiple different rules.

Or else, if we define multiple rules for the same condition, how do we consolidate different results, that introduces unnecessary complexity.

However, if you send a change to get formatted output from the custom plugin, we can do that. :)

@jason1028kr
Copy link
Contributor Author

jason1028kr commented Aug 1, 2022

ah I see, yes there should be some agreed way of consolidating different results.

For your comment "And for other problems with different Reasons, you can just setup different conditions for them?", this is actually our current set up. The only issue is that more than one (condition+reason) pair could resolve to true, and we want some conditions to be mutually exclusive. So we would like to logically group the conditions and reduce their number.

If by 'get that from the custom plugin directly' you mean the plugin script, the output is currently put in the 'message' field of the type 'Result' type Result struct { Rule *CustomRule ExitStatus Status Message string }.

Wouldn't changing this behavior a bigger design change, since we won't be taking the 'reason' field from the json, but instead from the script output.

@BusyOppo

This comment was marked as spam.

@jason1028kr
Copy link
Contributor Author

jason1028kr commented Aug 14, 2022

@BusyOppo I see, make sense. I'll make a pr on supporting the behavior. It'll change cpm behavior abit, but the right way to address the multiple rules issue.

@jason1028kr
Copy link
Contributor Author

jason1028kr commented Sep 4, 2022

@Random-Liu, does enabling the 'reasonMapping' above for multiple(3+) exitcode-reason pairing sound appropriate? I've wanted to confirm if that is the right direction before making a pr. I could make it optional and not break current behavior

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants