Skip to content

database,redis: More verbosity in OnRetryableError #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 9, 2025

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Apr 15, 2025

database,redis: More verbosity in OnRetryableError

The OnRetryableError functions came with conditions to only log the
error if it had changed since the last iteration. While this keeps the
log file clean, it hides reoccurring errors and makes it harder to
identify specific errors.

The changed version always logs the reoccurring error and also includes
the duration since the start and the current attempt.

@cla-bot cla-bot bot added the cla/signed CLA is signed by all contributors of a PR label Apr 15, 2025
@oxzi oxzi changed the title Logging small rework Small Logging Rework: Reduce error wrapping in retry.WithBackoff and more verbosity in OnRetryableError Apr 15, 2025
@oxzi
Copy link
Member Author

oxzi commented Apr 16, 2025

Removing "retry deadline exceeded" from the error message also results in more useful error messages for users. Take Icinga/icingadb#900 for an example.

@oxzi oxzi requested a review from lippserd May 7, 2025 07:32
@oxzi oxzi added this to the 0.6.4 milestone May 7, 2025
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry.WithBackoff: Reduce error wrapping
...
But this additional layer does not really carry any useful information.

This helps to easily distinguish non-retryable from "timeout" errors.

Removing "retry deadline exceeded" from the error message also results in more useful error messages for users. Take Icinga/icingadb#900 for an example.

That's a different "retry deadline exceeded" error:

https://github.com/Icinga/icingadb/blob/main/pkg/icingadb/ha.go#L166

That doesn't mean that I don't want to change anything but I'm not a fan of just removing the wrapping. The ha:166 error could get more context, a more understandable message or whatever. And if wrapping is such a problem, I would change the renderer, output the error messages first and then the stack trace(s).

database/redis: More verbosity in OnRetryableError
...
The changed version always logs the reoccurring error and also includes
the duration since the start and the current attempt.

If I am not mistaken, this leads to a large number of error messages in the first few seconds, as there is not much time between retries. Can this be optimized, for example, to only log every after at least 10 seconds if the error message has not changed?

@oxzi
Copy link
Member Author

oxzi commented May 9, 2025

retry.WithBackoff: Reduce error wrapping
...
But this additional layer does not really carry any useful information.

This helps to easily distinguish non-retryable from "timeout" errors.

Fair enough. I just dislike having another entry in the stack trace carrying the position of this wrapping. I am going to remove this.

Removing "retry deadline exceeded" from the error message also results in more useful error messages for users. Take Icinga/icingadb#900 for an example.

That's a different "retry deadline exceeded" error:

https://github.com/Icinga/icingadb/blob/main/pkg/icingadb/ha.go#L166

You are totally right. I have successfully confused myself.

database/redis: More verbosity in OnRetryableError
...
The changed version always logs the reoccurring error and also includes
the duration since the start and the current attempt.

If I am not mistaken, this leads to a large number of error messages in the first few seconds, as there is not much time between retries. Can this be optimized, for example, to only log every after at least 10 seconds if the error message has not changed?

Yes, this results in more WARNING log entries but this was my intention.

IMO, if something is broken, it should be shown and not hidden. The current behavior only shows an error once, hiding it for up to five minutes before crashing. The suggested behavior would always show the error, resulting in a few entries recurring entries at first - but only if the error is not recovered directly.

The OnRetryableError functions came with conditions to only log the
error if it had changed since the last iteration. While this keeps the
log file clean, it hides reoccurring errors and makes it harder to
identify specific errors.

The changed version always logs the reoccurring error and also includes
the duration since the start and the current attempt.
@oxzi oxzi force-pushed the logging-small-rework branch from aa60419 to 0c7ca17 Compare May 9, 2025 08:59
@oxzi oxzi changed the title Small Logging Rework: Reduce error wrapping in retry.WithBackoff and more verbosity in OnRetryableError database,redis: More verbosity in OnRetryableError May 9, 2025
@oxzi oxzi requested a review from lippserd May 9, 2025 08:59
@lippserd
Copy link
Member

lippserd commented May 9, 2025

retry.WithBackoff: Reduce error wrapping
...
But this additional layer does not really carry any useful information.

This helps to easily distinguish non-retryable from "timeout" errors.

Fair enough. I just dislike having another entry in the stack trace carrying the position of this wrapping. I am going to remove this.

I completely understand your point. The wrapping helps us to understand errors better, but it can be confusing for users since the messages end up on multiple lines. But as I said, I think we can improve this by adjusting how we render the errors.

database/redis: More verbosity in OnRetryableError
...
The changed version always logs the reoccurring error and also includes
the duration since the start and the current attempt.

If I am not mistaken, this leads to a large number of error messages in the first few seconds, as there is not much time between retries. Can this be optimized, for example, to only log every after at least 10 seconds if the error message has not changed?

Yes, this results in more WARNING log entries but this was my intention.

IMO, if something is broken, it should be shown and not hidden. The current behavior only shows an error once, hiding it for up to five minutes before crashing. The suggested behavior would always show the error, resulting in a few entries recurring entries at first - but only if the error is not recovered directly.

I thought the log gets cluttered because of the frequent retries in the first few seconds, but there are only six messages in the first 10 seconds, so that's perfectly fine:

2025-05-09T11:33:17.565+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "1.30975ms", "attempt": 1}
2025-05-09T11:33:17.717+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "153.499917ms", "attempt": 2}
2025-05-09T11:33:18.184+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "620.013209ms", "attempt": 3}
2025-05-09T11:33:19.001+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "1.436866667s", "attempt": 4}
2025-05-09T11:33:20.775+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "3.211682125s", "attempt": 5}
2025-05-09T11:33:22.843+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "5.279387875s", "attempt": 6}
2025-05-09T11:33:30.280+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "12.716074292s", "attempt": 7}
2025-05-09T11:33:45.382+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "27.817010459s", "attempt": 8}
2025-05-09T11:34:12.283+0200	WARN	database	Can't connect to database. Retrying	{"error": "dial tcp 127.0.0.1:5432: connect: connection refused", "after": "54.717846s", "attempt": 9

@oxzi oxzi merged commit b05aeed into main May 9, 2025
15 checks passed
@oxzi oxzi deleted the logging-small-rework branch May 9, 2025 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed CLA is signed by all contributors of a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants