As an ARC team initiative we want to investigate Prometheus and Zabbix as our new monitoring and metrics solutions, by:
In process we want to be able to answer the questions posed in the latest mailing thread and by the end have a setup that can lead directly into mirating us away from nagios. The questions (mostly from Kevin):
How can we provision both of them automatically from ansible? Ideally when we add some new host we just run something and it configures the needed places.
can we get zabbix to pull from prometheus? It might be nice if all the alerting at least could be in zabbix
Can zabbix handle our number of machines? I know a long time ago when we tried to deploy zabbix it couldn't keep up. So perhaps some kind of load testing? or adding all builders to it or something?
How flexable is the alerting. I think we may want to revisit things
from our current nagios setup. I think we have some good things: (alerts always happen on irc first so if someone sees it they can look) and some bad things (alerts get acked and problem gets worse, checks get disabled and never turned back on, etc). It would be nice to divy up things into some big at least SLE's... so mirrorlists being down would wake the world, but badges would just send email/irc until someone looked.
can zabbix/prometheus do any of our metrics needs?
To do this we will need:
to be able to install operator that will install/configure prometheus, requires configuring clusterroles and clusterrolebindings
Can this be done via ansible?
The idea being the usual: if openshift's server goes down, we get new hardware, run the playbook and everything comes back to life.
Yes, the instalation of the prometheus operator is automated already and porting the shell script to ansible shouldn't be hard.
Metadata Update from @smooge: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops, request-for-resources
I was added in sysadmin-noc and have the access to correct vm This can be closed (from my perspective)
Cool, thanks!
Metadata Update from @pingou: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.