Skip to content

Log can be flooded by warning messages #2218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
locker opened this issue Mar 21, 2017 · 6 comments
Closed

Log can be flooded by warning messages #2218

locker opened this issue Mar 21, 2017 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@locker
Copy link
Member

locker commented Mar 21, 2017

For instance, when tarantool hits the limit on open files, the log is flooded with thousands messages like below, which makes it difficult to read the log.

2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files
2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files
2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files
2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files
2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files
2017-03-18 10:09:40.788 [19872] iproto sio.cc:251 !> SystemError accept, called on fd 13, aka 0.0.0.0:3301: Too many open files

We need to come up with a way of limiting the rate at which messages can be printed to the log.

@locker locker self-assigned this Mar 21, 2017
@kostja kostja added this to the 1.7.5 milestone Mar 28, 2017
@Igorjan94
Copy link

which makes it difficult to read the log

Fro me, when tarantool starts "spam" with these messages, CPU becomes 100% and tarantool refuses to response at all

@kostja kostja added feature A new functionality prio3 labels Apr 17, 2017
@locker
Copy link
Member Author

locker commented Aug 17, 2017

To fix this issue, we need to introduce say_ratelimited(), which takes a rate limit state as an additional argument and suppresses the message if the rate limit is exceeded, similarly to how it is done here. This way, for each particular say() call site, we can decide whether it needs rate limiting or not and if it does how many messages can be printed per unit of time tops.

SocketError, which is the source of log spamming in this issue, should have a rate limit state shared among all objects of this class. The class then should override Exception::log() method so that it uses say_ratelimited() with the shared rate limit state instead of say(). This should resolve the log spamming problem described here and lay the foundation for fixing similar issues that may be reported in future.

@rtsisyk rtsisyk modified the milestones: 1.7.6, 1.7.5 Aug 22, 2017
@rtsisyk rtsisyk modified the milestones: 1.7.6, 1.7.7 Oct 21, 2017
@kostja kostja modified the milestones: 1.7.7, 1.7.8, 1.8.6, wishlist Oct 24, 2017
@kyukhin kyukhin removed the prio3 label Nov 14, 2018
@kyukhin kyukhin modified the milestones: wishlist, 1.10.3 Nov 14, 2018
@locker
Copy link
Member Author

locker commented Nov 14, 2018

I think we should raise priority of this issues, because I'm seeing customers complaining to logs being quickly inflated to dozens of gigabytes, because of millions of messages like this:

2018-11-12 03:44:02.144 [20] main/420404/main vy_quota.c:245 W> waited for 760 bytes of vinyl memory quota for too long: 6.874 sec
2018-11-12 03:44:04.273 [20] main/402386/main txn.c:295 W> too long WAL write: 1 rows at LSN 1216903745: 0.823 sec
2018-11-12 03:44:04.508 [20] iproto/101/main iproto.cc:532 W> stopping input on connection fd 646, aka 10.0.0.1:3800, peer of 10.0.0.3:52656, net_msg_max limit is reached
2018-11-12 03:44:23.360 [20] iproto iproto.cc:521 W> stopping input on connection fd 68, aka 10.0.0.1:3800, readahead limit is reached

This makes the problem really difficult to investigate. We should suppress similar messages, something like this:

2018-11-12 03:44:23.360 [20] iproto iproto.cc:521 W> stopping input on connection fd 68, aka 10.0.0.1:3800, readahead limit is reached
2018-11-12 03:44:23.360 [20] iproto iproto.cc:521 W> 1021 messages suppressed

@locker
Copy link
Member Author

locker commented Nov 14, 2018

Also, printing those useless messages seem to eat quite a bit of cpu time.

@Gerold103
Copy link
Collaborator

'readahead limit' also makes logs unreadable. Completely.

@gdrbyKo1
Copy link
Contributor

FFmpeg is able to compress similar/same messages like this:

[mpeg2video @ 0x7faf3b824200] Invalid frame dimensions 0x0.
Last message repeated 14 times

It looks rather simple when considering the case when the messages are all exactly the same, because then you can just keep the last message string in-memory at all times, along with some counter, and when an attempt is made to log the same message a second time, just increment the counter and suppress the message.
Continue until either:

  • a different message shows up
  • some arbitrary time passes, like 2 seconds or so

then print out Last message repeated n times, reset the counter, and set the stored message to the most recent one.

Obviously this gets much more complicated when filtering similar-but-not-equal messages.

@locker locker added bug Something isn't working and removed feature A new functionality labels Nov 23, 2018
locker added a commit that referenced this issue Dec 11, 2018
We will use it to limit the rate of log messages.

Needed for #2218
locker added a commit that referenced this issue Dec 11, 2018
There are a few warning messages that can easily flood the log, making
it more difficult to figure out what causes the problem. Those are
 - too long WAL write
 - waited for ... bytes of vinyl memory quota for too long
 - get/select(...) => ... took too long
 - readahead limit is reached
 - net_msg_max limit is reached

Actually, it's pointless to print each and every of them, because all
messages of the same kind are similar and don't convey any additional
information.

So this patch limits the rate at which those messages may be printed.
To achieve that, it introduces say_ratelimited() helper, which works
exactly like say() except it does nothing if too many messages of
the same kind have already been printed in the last few seconds.
The implementation is trivial - say_ratelimited() defines a static
ratelimit state variable at its call site (it's a macro) and checks it
before logging anything. If the ratelimit state says that an event may
be emitted, it will log the message, otherwise it will skip it and
eventually print the total number of skipped messages instead.

The rate limit is set to 10 messages per 5 seconds for each kind of
a warning message enumerated above.

Here's how it looks in the log:

2018-12-11 18:07:21.830 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:26.851 [30404] iproto iproto.cc:524 W> 9635 messages suppressed

Closes #2218
locker added a commit that referenced this issue Dec 12, 2018
We will use it to limit the rate of log messages.

Needed for #2218
@locker locker closed this as completed in e6ebd5e Dec 12, 2018
locker added a commit that referenced this issue Dec 12, 2018
We will use it to limit the rate of log messages.

Needed for #2218

(cherry picked from commit c1198e2)
locker added a commit that referenced this issue Dec 12, 2018
There are a few warning messages that can easily flood the log, making
it more difficult to figure out what causes the problem. Those are
 - too long WAL write
 - waited for ... bytes of vinyl memory quota for too long
 - get/select(...) => ... took too long
 - readahead limit is reached
 - net_msg_max limit is reached

Actually, it's pointless to print each and every of them, because all
messages of the same kind are similar and don't convey any additional
information.

So this patch limits the rate at which those messages may be printed.
To achieve that, it introduces say_ratelimited() helper, which works
exactly like say() except it does nothing if too many messages of
the same kind have already been printed in the last few seconds.
The implementation is trivial - say_ratelimited() defines a static
ratelimit state variable at its call site (it's a macro) and checks it
before logging anything. If the ratelimit state says that an event may
be emitted, it will log the message, otherwise it will skip it and
eventually print the total number of skipped messages instead.

The rate limit is set to 10 messages per 5 seconds for each kind of
a warning message enumerated above.

Here's how it looks in the log:

2018-12-11 18:07:21.830 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.831 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:21.832 [30404] iproto iproto.cc:524 W> stopping input on connection fd 15, aka 127.0.0.1:12345, peer of 127.0.0.1:59212, readahead limit is reached
2018-12-11 18:07:26.851 [30404] iproto iproto.cc:524 W> 9635 messages suppressed

Closes #2218

(cherry picked from commit e6ebd5e)
@kyukhin kyukhin added the tmp label Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants