Issue #7588: OpenShift app monitoring with Nagios - fedora-infrastructure

fedora-infrastructure

#7588 OpenShift app monitoring with Nagios

Closed: Initiative Worthy 3 years ago by smooge. Opened 5 years ago by mizdebsk.

I would like to implement monitoring for OpenShift apps using Nagios. I know there are some plans to replace Nagios with something else, but that hasn't happened yet and Nagios is already there. For me this is a blocker for moving Koschei to OpenShift - I'm not feeling comfortable having production Koschei without monitoring that is integrated with our existing alert system (email/IRC notifications).

I would like to start with monitoring number of pods. Nagios would check number of pods matching configured selector and compare it with configured range of expected numbers. Result of the check would be defined as follows:

if the number of pods is within the expected range: OK
if there the number of pods is equal to zero: CRITICAL
otherwise: WARNING

Example:

configured selector: pods in namespace "koschei" with label "service: frontend", in state "running"
configured expected number of pods: range from 2 to 3
0 matching pods -> CRITICAL
1 matching pod -> WARNING
2 to 3 matching pods -> OK
4 or more matching pods -> WARNING

Implementation: Nagios plugin, non-NRPE. There would be a service account created for Nagios. The account would have minimal privileges that would allow it to list pods, but nothing else. Credentials for the account would be stored on noc01 and noc02. Nagios plugin would use Kubernetes REST API to communicate with OpenShift. noc01 would talk directly to each of masters using internal addresses/names. noc02 would talk to OpenShift over public interface.

What do you think about this idea?

smooge commented 5 years ago

This sounds like a good idea. The plugins I looked at was:

https://github.com/appuio/nagios-plugins-openshift

Another example was

https://github.com/jmferrer/nagios-openshift

kevin commented 5 years ago

Sounds good to me. Either a basic script or leveraging one of those plugins...

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk

5 years ago

mizdebsk commented 5 years ago

This sounds like a good idea. The plugins I looked at was: https://github.com/appuio/nagios-plugins-openshift
Another example was https://github.com/jmferrer/nagios-openshift

From the two above plugins I like nagios-plugins-openshift better. The approach it uses is almost the same as mine - one difference is that they use oc command to communicate with OpenShift, while I would use curl. If we want to have this plugin used then I can try to package it and build for epel7-infra (I don't want to maintain this package in EPEL 7 myself). Or I can write my own plugin and put it in ansible.git. We can talk about this during one of future meetings.

Metadata Update from @mizdebsk:
- Issue priority set to: Waiting on Assignee (was: Next Meeting)

5 years ago

mizdebsk commented 5 years ago

Nagios is frozen. I'll try to work on this ticket after final freeze (F30 GA).

Metadata Update from @mizdebsk:
- Issue tagged with: unfreeze

5 years ago

mizdebsk commented 5 years ago

Update: the freeze is over now, I am planning to work on this issue some time next week.

Metadata Update from @mizdebsk:
- Issue untagged with: unfreeze

5 years ago

mizdebsk commented 5 years ago

Currently I don't have time to work on this due to different priorities and upcoming vacation. Lack of monitoring is still blocking Koschei from moving to OpenShift and therefore I would still like this feature to be implemented, but it will need to wait a few months, unless someone else wants to work on this.

cverna commented 4 years ago

@mizdebsk I believe you have done that for Koschei, is there a small "How to" to do that for other applications ?

Metadata Update from @cverna:
- Assignee reset

4 years ago

smooge commented 3 years ago

Going to close as we aren't moving on this and it should be rolled into the monitoring initiative

Metadata Update from @smooge:
- Issue close_status updated to: Initiative Worthy
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

fedora-infrastructure

Source Code

#7588 OpenShift app monitoring with Nagios Closed: Initiative Worthy 3 years ago by smooge. Opened 5 years ago by mizdebsk.

Metadata

monitoring

#7588 OpenShift app monitoring with Nagios

Closed: Initiative Worthy 3 years ago by smooge. Opened 5 years ago by mizdebsk.