Went to see why fedoraplanet was not updating and found that no logs to /var/log/ had been written since reboot. Found that rsyslogd was dieing on reboot due to too many open files. After much searching found that it was because the number of user journals it is trying to open and other files are too much.
rsyslogd: imjournal: rename() failed for new path: '/var/lib/rsyslog/imjournal.state': No such file or directory [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/0 ] rsyslogd: imjournal: rename() failed for new path: '/var/lib/rsyslog/imjournal.state': No such file or directory [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/0 ] rsyslogd: imjournal: rename() failed for new path: '/var/lib/rsyslog/imjournal.state': No such file or directory [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/0 ] rsyslogd: imjournal: rename() failed for new path: '/var/lib/rsyslog/imjournal.state': No such file or directory [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/0 ] rsyslogd: imjournal: rename() failed for new path: '/var/lib/rsyslog/imjournal.state': No such file or directory [v8.24.0-52.el7_8.2 try http://www.rsyslog.com/e/0 ] # audit2allow -a #============= syslogd_t ============== allow syslogd_t var_run_t:file { read unlink };
Checked restorecon and other variables and it did not 'fix' anything. ls -lZ showed all files have the same appropriate selinux context.
@smooge could you give more context about 'monitor rsyslogd is running on all hosts' did you mean check status of deamon ? I have a restricted access to people02, which does not allow me to know how it behaved rsyslogd during the incident.
people02
Metadata Update from @mohanboddu: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
@smooge did you see my last comment ?
sorry I am on vacation for a while. I am just looking for a status check that rsyslogd is running on the system.
So, I think we want something like the check_varnish_proc but a check_rsyslog_proc
And then we want it to run on all machines, so look at how say the 'mail_queue' check is set...
ok i see .. i will work on it and open a PR
@smooge @kevin which threshold ranges you would like to use for ? the same for varnishd deamon ?
Same as we use for varnishd I think:
-c 1:2
ie, if it's not running it's critical...
This PR was merged and I rolled it out. There's a few hosts to fix for various reasons, but overall it's working and good!
Thanks for the PR!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.