-
Notifications
You must be signed in to change notification settings - Fork 651
Node conditions status reset on problem detector restart. #304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
One idea is to use the Basically each time when NPD starts up, it could call the Would that work? |
sounds good, that would work. |
only downside could be something like a log-status that then gets reset because the offending line is no longer in the log, but that's just a minor issue 🤷♂ |
/cc @andyxning
True. But if that's the case (i.e. some problem daemon resets condition when it can no longer detect the permanent problem), then even if NPD is not restarted, it will still reset that condition eventually (when the log got deleted/rotated). Do you have any suggestions @grosser ? |
Good enough if it does not resets it on restart 👍 |
@wangzhen127 got thoughts on this ? (since @andyxning seems to not be around) |
Are you referring to the log monitor? NPD resets the condition because in the config, it has the default condition. For example:
The above line sets the default condition for In the code, this is done at: I guess you can just remove the default conditions from the config for your use case. |
I confirmed that the default condition will no longer appear when a new node comes up and it no longer updates lastTransitionTime on deploy either. We'd prefer the best of both: default is set when node is new, but does not override when updating ... especially does not override (bump lastTransitionAt) when it is already set to exactly that reason/status. |
@xueweiz please make a PR if you can, if not I can try too but that might take a bit and be very clumsy :D |
@wangzhen127 @xueweiz @grosser @eatwithforks If a node gets restarted, the problem is usually resolved. However, there is currently no way for system log monitor to set the condition back, so we always set the default condition when NPD comes up. We can probably keep that behavior only in the system log monitor, because for custom plugin we have a well-defined way to set conditions back. |
just to clarify: this is not about node restarting, but the detector itself crashing / getting restarted manually |
also "for custom plugin we have a well-defined way to set conditions back", I'd want for the plugin to not reset on restart if no default condition is given ... but that's blocked by #306 |
The problem is that NPD doesn't know whether it is process restart or node reboot today. We can probably put some state into the |
I think if the "do not set condition when custom plugin does not define condition" would already mostly solve this too :) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When node problem detector is restarted, all the node's conditions' status reset to False until the plugin's complete and resubmit the correct status.
This means there is a gap where node conditions are all False when problem detector restarts, leading to false reports.
Is there a way to make problem detector not reset a node's conditions on restart?
The text was updated successfully, but these errors were encountered: