CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

Genki-S · 2021-05-24T08:41:57Z

Hello,

I realized, having multiple "permanent" rules for the same node condition could lead to the node condition to flip-flop.

This is not a bug report, but rather I aim to trigger some design discussion. Let me first describe what I see now, and then I'll write my questions / ideas.

Current behavior description

For example, Let's say I have the following configuration:

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "60s",
    "timeout": "60s"
  },
  "source": "test-custom-plugin-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "TestConditionHasProblem",
      "reason": "TestIsOK",
      "message": "test condition is OK"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "TestConditionHasProblem",
      "reason": "TestIsIntentionallyFailing",
      "path": "/custom-scripts/test_condition_fail.sh",
      "timeout": "60s"
    },
    {
      "type": "permanent",
      "condition": "TestConditionHasProblem",
      "reason": "TestIsIntentionallySucceeding",
      "path": "/custom-scripts/test_condition_succeed.sh",
      "timeout": "60s"
    }
  ]
}

/custom-scripts/test_condition_fail.sh has the following content:

#!/bin/sh
exit 1

/custom-scripts/test_condition_succeed.sh has the following content:

#!/bin/sh
exit 0

Now, when I run NPD with the above setup, I see the node condition TestConditionHasProblem to flip-flop. kubectl describe node NODE shows the following events:

Events:
  Type    Reason                      Age                    From                        Message
  ----    ------                      ----                   ----                        -------
  Normal  TestIsIntentionallyFailing  5s (x5 over 4m5s)  test-custom-plugin-monitor  Node condition TestConditionHasProblem is now: True, reason: TestIsIntentionallyFailing
  Normal  TestIsOK                    5s (x4 over 3m5s)  test-custom-plugin-monitor  Node condition TestConditionHasProblem is now: False, reason: TestIsOK

From the NPD log, I see it's generating new status for both of the above rules in each iteration:

I0524 03:58:21.417806       1 custom_plugin_monitor.go:276] New status generated: &{Source:test-custom-plugin-monitor Events:[{Severity:info Timestamp:2021-05-24 03:58:21.417741207 +0000 UTC m=+420.067800328 Reason:TestIsOK Message:Node condition TestConditionHasProblem is now: False, reason: TestIsOK}] Conditions:[{Type:TestConditionHasProblem Status:False Transition:2021-05-24 03:58:21.417741207 +0000 UTC m=+420.067800328 Reason:TestIsOK Message:test condition is OK}]}
I0524 03:58:21.419425       1 custom_plugin_monitor.go:276] New status generated: &{Source:test-custom-plugin-monitor Events:[{Severity:info Timestamp:2021-05-24 03:58:21.419371143 +0000 UTC m=+420.069430265 Reason:TestIsIntentionallyFailing Message:Node condition TestConditionHasProblem is now: True, reason: TestIsIntentionallyFailing}] Conditions:[{Type:TestConditionHasProblem Status:True Transition:2021-05-24 03:58:21.419371143 +0000 UTC m=+420.069430265 Reason:TestIsIntentionallyFailing Message:}]}

Questions / Ideas

I think making node condition to flip-flop is a bad thing. Is this reasonable?

If so, I'd like to see some improvements on NPD so that it'll be difficult for users to cause this flip-flop. Some ideas are:

Document that having multiple "permanent" rules for a same condition is a bad practice
Let NPD emit warn log (or even crash) when it detects multiple "permanent" rules for a same condition

I think we can even change NPD behavior. At first, I intuitively thought NPD will take the "worst" status of the rules. For example, if (like above example) one rule claims OK and the other rule claims NonOK for TestConditionHasProblem in one iteration, the NPD can mark the TestConditionHasProblem to True.

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2021-08-22T09:30:39Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-09-21T10:02:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-10-21T10:25:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-10-21T10:25:22Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shanalily · 2022-03-15T22:01:16Z

I wanted to reopen this since I've seen the same issue, and repro with the example.

I had the same assumption about the behavior, with NPD taking the "worst" status. In the example, I would expect the reason to alway be TestConditionHasProblem with status True, with no flapping between True and False.

One of the reasons this was my assumption is that the filelog plugin supports multiple permanent condition reasons, as shown in the KernelDeadLock config. I think this is because the status is only updated if the rule evaluates to true, though it's a little unclear to me how this gets set back to false. However, the custom monitor will update any time a rule is different from the current status, including the case where the current status is true but the rule evaluates to false/unknown.

Since multiple reasons per permanent custom condition doesn't work right now, I think changing the behavior is reasonable here since it should be possible without changing the behavior for a single reason.

What still needs to be defined would be the behavior when a) multiple rules that evaluate to something other than OK and b) the specific case where a rule that evaluates False and another rule that evaluates to Unknown.

My thoughts on these:
For a) I can see this either being the first rule evaluated, so that rules are in order of precedence in the config file, or the last rule so that an event for every rule that is not OK is generated and the last event shows the current condition status, which I think would also be consistent with the log monitor plugin.
For b) this could just use the precedence rules from whatever is decided on for a), otherwise I can see False taking precedence over Unknown.

shanalily · 2022-03-15T22:01:39Z

/reopen

k8s-ci-robot · 2022-03-15T22:01:57Z

@shanalily: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 21, 2021

k8s-ci-robot closed this as completed Oct 21, 2021

notchairmk mentioned this issue Feb 13, 2022

Custom Plugin: Allow users to define condition startup behavior #645

Closed

This was referenced Apr 5, 2022

CustomPluginMonitor should support multiple "permanent" rules for the same node condition. #664

Closed

Support multiple reasons per condition for CustomPluginMonitor #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

Genki-S commented May 24, 2021

k8s-triage-robot commented Aug 22, 2021

k8s-triage-robot commented Sep 21, 2021

k8s-triage-robot commented Oct 21, 2021

k8s-ci-robot commented Oct 21, 2021

shanalily commented Mar 15, 2022

shanalily commented Mar 15, 2022

k8s-ci-robot commented Mar 15, 2022

CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

CustomPluginMonitor with multiple "permanent" rules for the same condition leads to confusing behavior #566

Comments

Genki-S commented May 24, 2021

Current behavior description

Questions / Ideas

k8s-triage-robot commented Aug 22, 2021

k8s-triage-robot commented Sep 21, 2021

k8s-triage-robot commented Oct 21, 2021

k8s-ci-robot commented Oct 21, 2021

shanalily commented Mar 15, 2022

shanalily commented Mar 15, 2022

k8s-ci-robot commented Mar 15, 2022