Node conditions status reset on problem detector restart. #304

eatwithforks · 2019-06-28T21:06:59Z

When node problem detector is restarted, all the node's conditions' status reset to False until the plugin's complete and resubmit the correct status.

This means there is a gap where node conditions are all False when problem detector restarts, leading to false reports.

Is there a way to make problem detector not reset a node's conditions on restart?

grosser · 2019-06-28T21:08:07Z

reset to False -> reset to their initial status

xueweiz · 2019-06-28T21:33:33Z

One idea is to use the problemclient.Client.GetConditions() function to retrieve condition's initial state upon NPD startup.

Basically each time when NPD starts up, it could call the GetConditions() function to determine what are the node conditions before NPD starts up. And use the result as the initial status for node conditions.

Would that work?

eatwithforks · 2019-06-28T21:38:58Z

sounds good, that would work.

grosser · 2019-06-28T21:39:26Z

only downside could be something like a log-status that then gets reset because the offending line is no longer in the log, but that's just a minor issue 🤷‍♂

xueweiz · 2019-06-28T23:10:53Z

/cc @andyxning
Hi Andy, do you think the idea above makes sense? If so, I'm happy to make a patch to implement it.

only downside could be something like a log-status that then gets reset because the offending line is no longer in the log

True. But if that's the case (i.e. some problem daemon resets condition when it can no longer detect the permanent problem), then even if NPD is not restarted, it will still reset that condition eventually (when the log got deleted/rotated).
So, in my opinion, I think the reset behavior (feature-or-bug) is not in the scope of this issue. :P

Do you have any suggestions @grosser ?

grosser · 2019-06-28T23:35:35Z

Good enough if it does not resets it on restart 👍
(also avoid the whole confusion with lastTransition getting updated when nothing actually happened)

grosser · 2019-07-02T23:06:33Z

@wangzhen127 got thoughts on this ? (since @andyxning seems to not be around)

wangzhen127 · 2019-07-02T23:31:26Z

Are you referring to the log monitor? NPD resets the condition because in the config, it has the default condition. For example:
https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L10

	"conditions": [
		{
			"type": "KernelDeadlock",
			"reason": "KernelHasNoDeadlock",
			"message": "kernel has no deadlock"
		},
		{
			"type": "ReadonlyFilesystem",
			"reason": "FilesystemIsNotReadOnly",
			"message": "Filesystem is not read-only"
		}
	],

The above line sets the default condition for KernelDeadlock.

In the code, this is done at:
https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L180
which reads and sets the default conditions.

I guess you can just remove the default conditions from the config for your use case.

grosser · 2019-07-05T22:00:39Z

I confirmed that the default condition will no longer appear when a new node comes up and it no longer updates lastTransitionTime on deploy either.

We'd prefer the best of both: default is set when node is new, but does not override when updating ... especially does not override (bump lastTransitionAt) when it is already set to exactly that reason/status.

wangzhen127 · 2019-07-09T04:49:09Z

That makes sense to me. Sounds good to me to make NPD check the conditions upon startup. @xueweiz and @grosser, I am happy to review your PR to fix this. :)

grosser · 2019-07-09T05:00:31Z

@xueweiz please make a PR if you can, if not I can try too but that might take a bit and be very clumsy :D

xueweiz · 2019-07-12T19:46:59Z

Hi @grosser , I'm happy to work on the PR :)
Sorry please let me submit #300 first. Because the initialization code of system-log-monitor is being changed a little bit in there. I'll wait for it to land, then work on this :)

Random-Liu · 2019-07-26T00:20:28Z

@wangzhen127 @xueweiz @grosser @eatwithforks
I think this is WAI initially for system log monitor.

If a node gets restarted, the problem is usually resolved. However, there is currently no way for system log monitor to set the condition back, so we always set the default condition when NPD comes up.

We can probably keep that behavior only in the system log monitor, because for custom plugin we have a well-defined way to set conditions back.

grosser · 2019-07-26T01:13:40Z

just to clarify: this is not about node restarting, but the detector itself crashing / getting restarted manually

grosser · 2019-07-26T01:15:01Z

also "for custom plugin we have a well-defined way to set conditions back", I'd want for the plugin to not reset on restart if no default condition is given ... but that's blocked by #306

Random-Liu · 2019-07-29T07:03:29Z

just to clarify: this is not about node restarting, but the detector itself crashing / getting restarted manually

The problem is that NPD doesn't know whether it is process restart or node reboot today. We can probably put some state into the /run directory to distinguish that.

grosser · 2019-07-29T07:30:50Z

I think if the "do not set condition when custom plugin does not define condition" would already mostly solve this too :)
#306

fejta-bot · 2019-10-27T07:36:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.