Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue. Monitoring should be improved so that we are notified about this kind of issues sooner.
I would like to work on it.
Metadata Update from @smooge: - Issue assigned to nasirhm - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do OnFailure to unit ! for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers
OnFailure
We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for: - presence of word "fedora" under https://registry.fedoraproject.org/v2/_catalog (that covers v2 API) - presence of word "rawhide" under https://registry.fedoraproject.org/repo/fedora/tags/ (covers web interface)
Hint: the relevant file in ansible.git is roles/nagios_server/templates/nagios/services/websites.cfg.j2
roles/nagios_server/templates/nagios/services/websites.cfg.j2
Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.
Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP
Hi,
Could you give any update ?
I can work on that if needed
Monitoring for container registry is still needed, patches are welcome. Please let me know if you need any help with implementing this.
Related PR fedora-infra/ansible#321 has been merged
The change has been deployed and can be seen eg. here and here. Thank you for your contribution. This issue is resolved.
Metadata Update from @mizdebsk: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.