In trying to debug problems this week, I have found that both log01 and the server itself do not have all the logs that should have been reported. In most cases they were in the journal but had been dropped
Dec 02 00:25:10 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting Dec 02 00:30:13 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: 25919 messages lost due to rate-limiting Dec 02 00:34:21 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting Dec 02 00:40:14 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: 29598 messages lost due to rate-limiting Dec 02 00:45:03 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting Dec 02 00:48:04 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: journal reloaded... [v8.24.0-57.el7_9 try http://www.rsyslog.com/e/0 ] Dec 02 00:48:04 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: journal reloaded... [v8.24.0-57.el7_9 try http://www.rsyslog.com/e/0 ]
We need to track down why the journal and rsyslogd are throttling each other.
https://access.redhat.com/solutions/1417483
may be helpful
Metadata Update from @smooge: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Metadata Update from @smooge: - Issue untagged with: medium-trouble - Issue tagged with: high-trouble
Metadata Update from @smooge: - Issue untagged with: medium-gain - Issue tagged with: high-gain
In general we also should try and make our servers log less stuff. koji is particularly bad...
FYI, this only appears to be a small handfull of hosts:
119 notifs-backend02.iad2.fedoraproject.org 118 notifs-backend01.iad2.fedoraproject.org 118 bastion01.iad2.fedoraproject.org 10 oci-registry02.iad2.fedoraproject.org 9 oci-registry01.iad2.fedoraproject.org 4 pkgs01.iad2.fedoraproject.org 2 mailman01.iad2.fedoraproject.org 2 autosign01.iad2.fedoraproject.org 1 koji02.iad2.fedoraproject.org
We think this might be a good ticket to have someone research and see what our options are.
@austinpowered said he might be able to look
According to the Knowlegebase article referenced in the ticket opening comments:
If rsyslogd is reporting imjournal messages like: "imjournal: begin to drop messages due to rate-limiting", modify $imjournalRatelimitInterval and/or $imjournalRatelimitBurst in /etc/rsyslog.conf
imjournal: begin to drop messages due to rate-limiting
/etc/rsyslog.conf
From /usr/share/doc/rsyslog-doc-7.4.7/imjournal.html provided by the rsyslog-doc package:
/usr/share/doc/rsyslog-doc-7.4.7/imjournal.html
rsyslog-doc
$imjournalRatelimitInterval (legacy directive) equivalent to: ratelimit.interval seconds (default: 600) Specifies the interval in seconds onto which rate-limiting is to be applied. If more than ratelimit.burst messages are read during that interval, further messages up to the end of the interval are discarded. The number of messages discarded is emitted at the end of the interval (if there were any discards). Setting this to value zero turns off ratelimiting. Note that it is not recommended to turn of ratelimiting, except that you know for sure journal database entries will never be corrupted. Without ratelimiting, a corrupted systemd journal database may cause a kind of denial of service (we are stressing this point as multiple users have reported us such problems with the journal database - information current as of June 2013). $imjournalRatelimitBurst (legacy directive) equivalent to: ratelimit.burst messages (default: 20000) Specifies the maximum number of messages that can be emitted within the ratelimit.interval interval. For futher information, see description there. Lowering the interval from the 600 second default (to e.g., 300 secs) will allow more messages through Raising the burst from the default of 20000 (to e.g., 30000) will allow more messages through As with systemd-journald, disabling rate-limiting altogether is not recommended. Note the following: Warning: Some versions of journald have problems with database corruption, which leads to the journal returning the same data endlessly in a tight loop. This can result in a large number of log messages getting duplicated inside the logs managed by rsyslog. It can also lead to denial-of-service if this results in 100% CPU or memory usage. There are two ways to specify these module directives, however you can use only one of the configuration syntax. If you chose to use new-style, make sure to comment options related to legacy syntax. Assuming rsyslog.conf hasn't been modified, simply use the legacy configuration directives Example excerpt: # File to store the position in the journal $IMJournalStateFile imjournal.state $imjournalRatelimitInterval 300 $imjournalRatelimitBurst 30000 Otherwise if rsyslog.conf has been modified to use new-style module-loading syntax, well, stick with that Example excerpt: module(load="imjournal" StateFile="/var/lib/rsyslog/imjournal.state" ratelimit.interval="300" ratelimit.burst="30000")
$imjournalRatelimitInterval (legacy directive) equivalent to: ratelimit.interval seconds (default: 600) Specifies the interval in seconds onto which rate-limiting is to be applied. If more than ratelimit.burst messages are read during that interval, further messages up to the end of the interval are discarded. The number of messages discarded is emitted at the end of the interval (if there were any discards). Setting this to value zero turns off ratelimiting. Note that it is not recommended to turn of ratelimiting, except that you know for sure journal database entries will never be corrupted. Without ratelimiting, a corrupted systemd journal database may cause a kind of denial of service (we are stressing this point as multiple users have reported us such problems with the journal database - information current as of June 2013).
ratelimit.interval
$imjournalRatelimitBurst (legacy directive) equivalent to: ratelimit.burst messages (default: 20000) Specifies the maximum number of messages that can be emitted within the ratelimit.interval interval. For futher information, see description there.
ratelimit.burst
Lowering the interval from the 600 second default (to e.g., 300 secs) will allow more messages through
Raising the burst from the default of 20000 (to e.g., 30000) will allow more messages through
As with systemd-journald, disabling rate-limiting altogether is not recommended. Note the following: Warning: Some versions of journald have problems with database corruption, which leads to the journal returning the same data endlessly in a tight loop. This can result in a large number of log messages getting duplicated inside the logs managed by rsyslog. It can also lead to denial-of-service if this results in 100% CPU or memory usage.
systemd-journald
journald
rsyslog
There are two ways to specify these module directives, however you can use only one of the configuration syntax. If you chose to use new-style, make sure to comment options related to legacy syntax.
Assuming rsyslog.conf hasn't been modified, simply use the legacy configuration directives Example excerpt:
rsyslog.conf
# File to store the position in the journal $IMJournalStateFile imjournal.state $imjournalRatelimitInterval 300 $imjournalRatelimitBurst 30000
# File to store the position in the journal
$IMJournalStateFile imjournal.state
$imjournalRatelimitInterval 300
$imjournalRatelimitBurst 30000
Otherwise if rsyslog.conf has been modified to use new-style module-loading syntax, well, stick with that Example excerpt:
module(load="imjournal" StateFile="/var/lib/rsyslog/imjournal.state" ratelimit.interval="300" ratelimit.burst="30000")
TL/DR Given the guidance above, perhaps editing the appropriate roles/base/files/rsyslog/rsyslog.conf.* files in the ansible repo to adjust the rate-limiting defaults will help with this issue.
roles/base/files/rsyslog/rsyslog.conf.*
If you would like me to proceed along this path, let me know and I can prepare a PR for review.
Thoughts?
Thanks for diging into this!
Yes, please do propose a PR
Just posted fedora-infra/ansible PR#873
Note: The previous comment includes a link to the closed PR#873 on the fedora-infrastructure repo.
The proposed solution is a PR on the fedora-infra/ansible repo.
Merged PR and rolled it out. Thanks!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.