Skip to content

CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Genki-S opened this issue May 24, 2021 · 7 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Genki-S
Copy link

Genki-S commented May 24, 2021

Hello,

I realized, having multiple "permanent" rules for the same node condition could lead to the node condition to flip-flop.

This is not a bug report, but rather I aim to trigger some design discussion. Let me first describe what I see now, and then I'll write my questions / ideas.

Current behavior description

For example, Let's say I have the following configuration:

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "60s",
    "timeout": "60s"
  },
  "source": "test-custom-plugin-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "TestConditionHasProblem",
      "reason": "TestIsOK",
      "message": "test condition is OK"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "TestConditionHasProblem",
      "reason": "TestIsIntentionallyFailing",
      "path": "/custom-scripts/test_condition_fail.sh",
      "timeout": "60s"
    },
    {
      "type": "permanent",
      "condition": "TestConditionHasProblem",
      "reason": "TestIsIntentionallySucceeding",
      "path": "/custom-scripts/test_condition_succeed.sh",
      "timeout": "60s"
    }
  ]
}

/custom-scripts/test_condition_fail.sh has the following content:

#!/bin/sh
exit 1

/custom-scripts/test_condition_succeed.sh has the following content:

#!/bin/sh
exit 0

Now, when I run NPD with the above setup, I see the node condition TestConditionHasProblem to flip-flop. kubectl describe node NODE shows the following events:

Events:
  Type    Reason                      Age                    From                        Message
  ----    ------                      ----                   ----                        -------
  Normal  TestIsIntentionallyFailing  5s (x5 over 4m5s)  test-custom-plugin-monitor  Node condition TestConditionHasProblem is now: True, reason: TestIsIntentionallyFailing
  Normal  TestIsOK                    5s (x4 over 3m5s)  test-custom-plugin-monitor  Node condition TestConditionHasProblem is now: False, reason: TestIsOK

From the NPD log, I see it's generating new status for both of the above rules in each iteration:

I0524 03:58:21.417806       1 custom_plugin_monitor.go:276] New status generated: &{Source:test-custom-plugin-monitor Events:[{Severity:info Timestamp:2021-05-24 03:58:21.417741207 +0000 UTC m=+420.067800328 Reason:TestIsOK Message:Node condition TestConditionHasProblem is now: False, reason: TestIsOK}] Conditions:[{Type:TestConditionHasProblem Status:False Transition:2021-05-24 03:58:21.417741207 +0000 UTC m=+420.067800328 Reason:TestIsOK Message:test condition is OK}]}
I0524 03:58:21.419425       1 custom_plugin_monitor.go:276] New status generated: &{Source:test-custom-plugin-monitor Events:[{Severity:info Timestamp:2021-05-24 03:58:21.419371143 +0000 UTC m=+420.069430265 Reason:TestIsIntentionallyFailing Message:Node condition TestConditionHasProblem is now: True, reason: TestIsIntentionallyFailing}] Conditions:[{Type:TestConditionHasProblem Status:True Transition:2021-05-24 03:58:21.419371143 +0000 UTC m=+420.069430265 Reason:TestIsIntentionallyFailing Message:}]}

Questions / Ideas

I think making node condition to flip-flop is a bad thing. Is this reasonable?

If so, I'd like to see some improvements on NPD so that it'll be difficult for users to cause this flip-flop. Some ideas are:

  • Document that having multiple "permanent" rules for a same condition is a bad practice
  • Let NPD emit warn log (or even crash) when it detects multiple "permanent" rules for a same condition

I think we can even change NPD behavior. At first, I intuitively thought NPD will take the "worst" status of the rules. For example, if (like above example) one rule claims OK and the other rule claims NonOK for TestConditionHasProblem in one iteration, the NPD can mark the TestConditionHasProblem to True.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shanalily
Copy link

I wanted to reopen this since I've seen the same issue, and repro with the example.

I had the same assumption about the behavior, with NPD taking the "worst" status. In the example, I would expect the reason to alway be TestConditionHasProblem with status True, with no flapping between True and False.

One of the reasons this was my assumption is that the filelog plugin supports multiple permanent condition reasons, as shown in the KernelDeadLock config. I think this is because the status is only updated if the rule evaluates to true, though it's a little unclear to me how this gets set back to false. However, the custom monitor will update any time a rule is different from the current status, including the case where the current status is true but the rule evaluates to false/unknown.

Since multiple reasons per permanent custom condition doesn't work right now, I think changing the behavior is reasonable here since it should be possible without changing the behavior for a single reason.

What still needs to be defined would be the behavior when a) multiple rules that evaluate to something other than OK and b) the specific case where a rule that evaluates False and another rule that evaluates to Unknown.

My thoughts on these:
For a) I can see this either being the first rule evaluated, so that rules are in order of precedence in the config file, or the last rule so that an event for every rule that is not OK is generated and the last event shows the current condition status, which I think would also be consistent with the log monitor plugin.
For b) this could just use the precedence rules from whatever is decided on for a), otherwise I can see False taking precedence over Unknown.

@shanalily
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@shanalily: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants