fedora-infrastructure

#10671 Alert admins of the openshift apps when the pod crashes

Closed: Fixed a year ago by kevin. Opened 2 years ago by zlopez.

Describe what you would like us to do:

When the pod in OpenShift crashes there is currently no way for the maintainer of that OpenShift project to know (except for crashloops). It would be nice to have e-mail sent (ideally with logs) to maintainer each time the pod crashes and being restarted. For crashloops it should be enough to get just a summary with the log from first crash.

When do you need this to be done by? (YYYY/MM/DD)

Not urgent, just a nice thing to have for openshift apps.

kevin commented 2 years ago

CC: @dkirwan

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

dkirwan commented 2 years ago

It's possible to determine if a pod has crashed/restarted a number of times within a short period etc. We can create PrometheusRule objects which track such things. We can target specific triggers using the severity field [1].

Alertmanager within the cluster can be configured to do something custom when an alert fires with a custom severity, eg: the trigger can be configured to send email to an email list containing the admins of the openshift-app etc.

Lets not attach logs to emails, the alert emails can link receivers back to the openshift web console, where they must first authenticate before being able to read the logs etc.

[1] https://github.com/davidkirwan/asset_monitoring/blob/master/kubernetes/prometheusrules.yaml#L19

zlopez commented 2 years ago

This sounds like something that could be done. Do we have mailing list that contains owners of openshift-apps? Could we instead use e-mail that would be configured somewhere in OpenShift project directly?

Link instead of attached docs sounds fine, how long till the link will expire and the log will be hard to find? For restarting pod you don't usually have access to old logs, just the current and previous run, which doesn't help :-/

dkirwan commented 2 years ago

This sounds like something that could be done. Do we have mailing list that contains owners of openshift-apps? Could we instead use e-mail that would be configured somewhere in OpenShift project directly?

Yep think we have an easy way of configuring app owners groups to have an email list. Eg we use this feature for the openshift-sysadmin group currently to receieve all alerts on the clusters.

Link instead of attached docs sounds fine, how long till the link will expire and the log will be hard to find? For restarting pod you don't usually have access to old logs, just the current and previous run, which doesn't help :-/

The link I'm thinking of would be just to the project. Users with the permissions to log in and access the pods would know which pod is having trouble via the original alert, then should be able to retrieve the logs from the crashing pods [1]. As to how long they stick around or we might lose the original error, yep could and does happen! There are many solutions we could look at here depending on how much resources we want to throw at it.

[1] https://docs.openshift.com/container-platform/4.7/support/troubleshooting/investigating-pod-issues.html

kevin commented 2 years ago

I think another list might be overkill here. Openshift knows the app owners by username, couldn't the alert just send all those usernames @fedoraproject.org ?

dkirwan commented 2 years ago

I think another list might be overkill here. Openshift knows the app owners by username, couldn't the alert just send all those usernames @fedoraproject.org ?

I'll have to double check if its possible to send multiple emails an alert at the same time, might be possible!

zlopez commented 2 years ago

@dkirwan Did you have time to check this?

seddik commented 2 years ago

i think we have a way to do this, sending same alert for multiple receivers through prometheus ..
@dkirwan @kevin right ?

dkirwan commented a year ago

sorry folks was on PTO missed all these pings. Yep definitely possible to setup multiple receivers, and it seems to be possible to send multiple email addresses in one receiver too [1].

[1] https://access.redhat.com/solutions/5324191

dkirwan commented a year ago

Think you have the basic technical requirements figured out, but would be good to get a POC in place, and document the process that other app owners can then follow to begin implementing.

Basically then need:

app owners/maintainers to create PrometheusRules objects in the namespace where their app runs
this might require app owners/maintainers be given new permissions for CRUD operations on PrometheusRules within their namespace, not sure. Ideally they should do this operation alongside the ansible/role which deployed the app, so they have the required permissions.
specify a unique name in the severity field of the PrometheusRules [1].
for each app, we need to create a receiver which listens for that unique severity created previously for prometheus/alertmanager, and add the list of emails which expect alerts.
[1] https://github.com/davidkirwan/asset_monitoring/blob/master/kubernetes/prometheusrules.yaml#L19

Edited a year ago by dkirwan

kevin commented a year ago

@darknao worked on this and came up with some simple alerting rules that we can generically apply to all applications. ;)

This should alert/send emails to appowners (or to alert_users if that is set).

It should alert on cronjobs failing, pods crashing, etc. (See ./roles/openshift/project/templates/prometheusRules.yml ).

I am going to mail the infra list a heads up about this and then run playbooks for at least the projects we manage.

Thanks @darknao!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Backlog

Related Pull Requests

#1316 Merged a year ago
#1299 Merged a year ago
#1296 Merged a year ago

fedora-infrastructure

Source Code

#10671 Alert admins of the openshift apps when the pod crashes Closed: Fixed a year ago by kevin. Opened 2 years ago by zlopez.

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

OpenShift ops medium-gain medium-trouble

Boards 1

Related Pull Requests

#10671 Alert admins of the openshift apps when the pod crashes

Closed: Fixed a year ago by kevin. Opened 2 years ago by zlopez.