Skip to content

Node conditions status reset on problem detector restart. #304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eatwithforks opened this issue Jun 28, 2019 · 26 comments
Closed

Node conditions status reset on problem detector restart. #304

eatwithforks opened this issue Jun 28, 2019 · 26 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@eatwithforks
Copy link

When node problem detector is restarted, all the node's conditions' status reset to False until the plugin's complete and resubmit the correct status.

This means there is a gap where node conditions are all False when problem detector restarts, leading to false reports.

Is there a way to make problem detector not reset a node's conditions on restart?

@grosser
Copy link
Contributor

grosser commented Jun 28, 2019

reset to False -> reset to their initial status

@xueweiz
Copy link
Contributor

xueweiz commented Jun 28, 2019

One idea is to use the problemclient.Client.GetConditions() function to retrieve condition's initial state upon NPD startup.

Basically each time when NPD starts up, it could call the GetConditions() function to determine what are the node conditions before NPD starts up. And use the result as the initial status for node conditions.

Would that work?

@eatwithforks
Copy link
Author

sounds good, that would work.

@grosser
Copy link
Contributor

grosser commented Jun 28, 2019

only downside could be something like a log-status that then gets reset because the offending line is no longer in the log, but that's just a minor issue 🤷‍♂

@xueweiz
Copy link
Contributor

xueweiz commented Jun 28, 2019

/cc @andyxning
Hi Andy, do you think the idea above makes sense? If so, I'm happy to make a patch to implement it.

only downside could be something like a log-status that then gets reset because the offending line is no longer in the log

True. But if that's the case (i.e. some problem daemon resets condition when it can no longer detect the permanent problem), then even if NPD is not restarted, it will still reset that condition eventually (when the log got deleted/rotated).
So, in my opinion, I think the reset behavior (feature-or-bug) is not in the scope of this issue. :P

Do you have any suggestions @grosser ?

@grosser
Copy link
Contributor

grosser commented Jun 28, 2019

Good enough if it does not resets it on restart 👍
(also avoid the whole confusion with lastTransition getting updated when nothing actually happened)

@grosser
Copy link
Contributor

grosser commented Jul 2, 2019

@wangzhen127 got thoughts on this ? (since @andyxning seems to not be around)

@wangzhen127
Copy link
Member

Are you referring to the log monitor? NPD resets the condition because in the config, it has the default condition. For example:
https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L10

	"conditions": [
		{
			"type": "KernelDeadlock",
			"reason": "KernelHasNoDeadlock",
			"message": "kernel has no deadlock"
		},
		{
			"type": "ReadonlyFilesystem",
			"reason": "FilesystemIsNotReadOnly",
			"message": "Filesystem is not read-only"
		}
	],

The above line sets the default condition for KernelDeadlock.

In the code, this is done at:
https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L180
which reads and sets the default conditions.

I guess you can just remove the default conditions from the config for your use case.

@grosser
Copy link
Contributor

grosser commented Jul 5, 2019

I confirmed that the default condition will no longer appear when a new node comes up and it no longer updates lastTransitionTime on deploy either.

We'd prefer the best of both: default is set when node is new, but does not override when updating ... especially does not override (bump lastTransitionAt) when it is already set to exactly that reason/status.

@wangzhen127
Copy link
Member

That makes sense to me. Sounds good to me to make NPD check the conditions upon startup. @xueweiz and @grosser, I am happy to review your PR to fix this. :)

@grosser
Copy link
Contributor

grosser commented Jul 9, 2019

@xueweiz please make a PR if you can, if not I can try too but that might take a bit and be very clumsy :D

@xueweiz
Copy link
Contributor

xueweiz commented Jul 12, 2019

Hi @grosser , I'm happy to work on the PR :)
Sorry please let me submit #300 first. Because the initialization code of system-log-monitor is being changed a little bit in there. I'll wait for it to land, then work on this :)

@Random-Liu
Copy link
Member

Random-Liu commented Jul 26, 2019

@wangzhen127 @xueweiz @grosser @eatwithforks
I think this is WAI initially for system log monitor.

If a node gets restarted, the problem is usually resolved. However, there is currently no way for system log monitor to set the condition back, so we always set the default condition when NPD comes up.

We can probably keep that behavior only in the system log monitor, because for custom plugin we have a well-defined way to set conditions back.

@grosser
Copy link
Contributor

grosser commented Jul 26, 2019

just to clarify: this is not about node restarting, but the detector itself crashing / getting restarted manually

@grosser
Copy link
Contributor

grosser commented Jul 26, 2019

also "for custom plugin we have a well-defined way to set conditions back", I'd want for the plugin to not reset on restart if no default condition is given ... but that's blocked by #306

@Random-Liu
Copy link
Member

Random-Liu commented Jul 29, 2019

just to clarify: this is not about node restarting, but the detector itself crashing / getting restarted manually

The problem is that NPD doesn't know whether it is process restart or node reboot today. We can probably put some state into the /run directory to distinguish that.

@grosser
Copy link
Contributor

grosser commented Jul 29, 2019

I think if the "do not set condition when custom plugin does not define condition" would already mostly solve this too :)
#306

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2019
@grosser
Copy link
Contributor

grosser commented Nov 26, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 26, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2020
@grosser
Copy link
Contributor

grosser commented Feb 24, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
7 participants