#9512 various Fedora servers are not logging correctly.
Closed: Fixed 3 years ago by kevin. Opened 4 years ago by smooge.

Describe what you would like us to do:


In trying to debug problems this week, I have found that both log01 and the server itself do not have all the logs that should have been reported. In most cases they were in the journal but had been dropped

Dec 02 00:25:10 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting
Dec 02 00:30:13 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: 25919 messages lost due to rate-limiting
Dec 02 00:34:21 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting
Dec 02 00:40:14 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: 29598 messages lost due to rate-limiting
Dec 02 00:45:03 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: begin to drop messages due to rate-limiting
Dec 02 00:48:04 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: journal reloaded... [v8.24.0-57.el7_9 try http://www.rsyslog.com/e/0 ]
Dec 02 00:48:04 mailman01.iad2.fedoraproject.org rsyslogd[1016]: imjournal: journal reloaded... [v8.24.0-57.el7_9 try http://www.rsyslog.com/e/0 ]

When do you need this to be done by? (YYYY/MM/DD)


We need to track down why the journal and rsyslogd are throttling each other.

https://access.redhat.com/solutions/1417483

may be helpful


Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

4 years ago

Metadata Update from @smooge:
- Issue untagged with: medium-trouble
- Issue tagged with: high-trouble

4 years ago

Metadata Update from @smooge:
- Issue untagged with: medium-gain
- Issue tagged with: high-gain

4 years ago

In general we also should try and make our servers log less stuff. koji is particularly bad...

FYI, this only appears to be a small handfull of hosts:

    119 notifs-backend02.iad2.fedoraproject.org
    118 notifs-backend01.iad2.fedoraproject.org
    118 bastion01.iad2.fedoraproject.org
     10 oci-registry02.iad2.fedoraproject.org
      9 oci-registry01.iad2.fedoraproject.org
      4 pkgs01.iad2.fedoraproject.org
      2 mailman01.iad2.fedoraproject.org
      2 autosign01.iad2.fedoraproject.org
      1 koji02.iad2.fedoraproject.org

We think this might be a good ticket to have someone research and see what our options are.

@austinpowered said he might be able to look

According to the Knowlegebase article referenced in the ticket opening comments:

If rsyslogd is reporting imjournal messages like: "imjournal: begin to drop messages due to rate-limiting", modify $imjournalRatelimitInterval and/or $imjournalRatelimitBurst in /etc/rsyslog.conf

From /usr/share/doc/rsyslog-doc-7.4.7/imjournal.html provided by the rsyslog-doc package:

$imjournalRatelimitInterval (legacy directive) equivalent to:
ratelimit.interval seconds (default: 600)
Specifies the interval in seconds onto which rate-limiting is to be applied. If more than
ratelimit.burst messages are read during that interval, further messages up to the end of
the interval are discarded. The number of messages discarded is emitted at the end of the
interval (if there were any discards). Setting this to value zero turns off ratelimiting.
Note that it is not recommended to turn of ratelimiting, except that you know for sure
journal database entries will never be corrupted. Without ratelimiting, a corrupted systemd
journal database may cause a kind of denial of service (we are stressing this point as
multiple users have reported us such problems with the journal database - information
current as of June 2013).

$imjournalRatelimitBurst (legacy directive) equivalent to:
ratelimit.burst messages (default: 20000)
Specifies the maximum number of messages that can be emitted within the ratelimit.interval
interval. For futher information, see description there.

  • Lowering the interval from the 600 second default (to e.g., 300 secs) will allow more messages through

  • Raising the burst from the default of 20000 (to e.g., 30000) will allow more messages through

  • As with systemd-journald, disabling rate-limiting altogether is not recommended. Note the following:
    Warning: Some versions of journald have problems with database corruption, which leads to the journal returning the same data endlessly in a tight loop. This can result in a large number of log messages getting duplicated inside the logs managed by rsyslog. It can also lead to denial-of-service if this results in 100% CPU or memory usage.

  • There are two ways to specify these module directives, however you can use only one of the configuration syntax. If you chose to use new-style, make sure to comment options related to legacy syntax.

    • Assuming rsyslog.conf hasn't been modified, simply use the legacy configuration directives
      Example excerpt:

      # File to store the position in the journal
      $IMJournalStateFile imjournal.state
      $imjournalRatelimitInterval 300
      $imjournalRatelimitBurst 30000

    • Otherwise if rsyslog.conf has been modified to use new-style module-loading syntax, well, stick with that
      Example excerpt:

      module(load="imjournal" StateFile="/var/lib/rsyslog/imjournal.state" ratelimit.interval="300" ratelimit.burst="30000")

TL/DR
Given the guidance above, perhaps editing the appropriate roles/base/files/rsyslog/rsyslog.conf.* files in the ansible repo to adjust the rate-limiting defaults will help with this issue.

If you would like me to proceed along this path, let me know and I can prepare a PR for review.

Thoughts?

Thanks for diging into this!

Yes, please do propose a PR

Note: The previous comment includes a link to the closed PR#873 on the fedora-infrastructure repo.

The proposed solution is a PR on the fedora-infra/ansible repo.

Merged PR and rolled it out. Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog