-
Notifications
You must be signed in to change notification settings - Fork 388
Log can be flooded by warning messages #2218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fro me, when tarantool starts "spam" with these messages, CPU becomes 100% and tarantool refuses to response at all |
To fix this issue, we need to introduce
|
I think we should raise priority of this issues, because I'm seeing customers complaining to logs being quickly inflated to dozens of gigabytes, because of millions of messages like this: 2018-11-12 03:44:02.144 [20] main/420404/main vy_quota.c:245 W> waited for 760 bytes of vinyl memory quota for too long: 6.874 sec This makes the problem really difficult to investigate. We should suppress similar messages, something like this: 2018-11-12 03:44:23.360 [20] iproto iproto.cc:521 W> stopping input on connection fd 68, aka 10.0.0.1:3800, readahead limit is reached |
Also, printing those useless messages seem to eat quite a bit of cpu time. |
'readahead limit' also makes logs unreadable. Completely. |
FFmpeg is able to compress similar/same messages like this:
It looks rather simple when considering the case when the messages are all exactly the same, because then you can just keep the last message string in-memory at all times, along with some counter, and when an attempt is made to log the same message a second time, just increment the counter and suppress the message.
then print out Obviously this gets much more complicated when filtering similar-but-not-equal messages. |
We will use it to limit the rate of log messages. Needed for #2218
There are a few warning messages that can easily flood the log, making it more difficult to figure out what causes the problem. Those are - too long WAL write - waited for ... bytes of vinyl memory quota for too long - get/select(...) => ... took too long - readahead limit is reached - net_msg_max limit is reached Actually, it's pointless to print each and every of them, because all messages of the same kind are similar and don't convey any additional information. So this patch limits the rate at which those messages may be printed. To achieve that, it introduces say_ratelimited() helper, which works exactly like say() except it does nothing if too many messages of the same kind have already been printed in the last few seconds. The implementation is trivial - say_ratelimited() defines a static ratelimit state variable at its call site (it's a macro) and checks it before logging anything. If the ratelimit state says that an event may be emitted, it will log the message, otherwise it will skip it and eventually print the total number of skipped messages instead. The rate limit is set to 10 messages per 5 seconds for each kind of a warning message enumerated above. Here's how it looks in the log: 2018-12-11 18:07:21.830 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:26.851 [30404] iproto iproto.cc:524 W> 9635 messages suppressed Closes #2218
We will use it to limit the rate of log messages. Needed for #2218
There are a few warning messages that can easily flood the log, making it more difficult to figure out what causes the problem. Those are - too long WAL write - waited for ... bytes of vinyl memory quota for too long - get/select(...) => ... took too long - readahead limit is reached - net_msg_max limit is reached Actually, it's pointless to print each and every of them, because all messages of the same kind are similar and don't convey any additional information. So this patch limits the rate at which those messages may be printed. To achieve that, it introduces say_ratelimited() helper, which works exactly like say() except it does nothing if too many messages of the same kind have already been printed in the last few seconds. The implementation is trivial - say_ratelimited() defines a static ratelimit state variable at its call site (it's a macro) and checks it before logging anything. If the ratelimit state says that an event may be emitted, it will log the message, otherwise it will skip it and eventually print the total number of skipped messages instead. The rate limit is set to 10 messages per 5 seconds for each kind of a warning message enumerated above. Here's how it looks in the log: 2018-12-11 18:07:21.830 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached 2018-12-11 18:07:26.851 [30404] iproto iproto.cc:524 W> 9635 messages suppressed Closes #2218 (cherry picked from commit e6ebd5e)
For instance, when tarantool hits the limit on open files, the log is flooded with thousands messages like below, which makes it difficult to read the log.
We need to come up with a way of limiting the rate at which messages can be printed to the log.
The text was updated successfully, but these errors were encountered: