fedora-infrastructure

#11393 Replace Nagios with Zabbix in Fedora Infrastructure

Opened 11 months ago by zlopez. Modified 2 months ago

Describe what you would like us to do:

Currently Fedora infrastructure is using Nagios for monitoring services. We want to switch to zabbix, because it's better maintained than Nagios.

When do you need this to be done by? (YYYY/MM/DD)

No rush

Metadata Update from @zlopez:
- Issue assigned to dkirwan

11 months ago

dkirwan commented 11 months ago

Ansible role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/zabbix/zabbix_server
Server deployed staging for testing: https://zabbix.stg.fedoraproject.org/

Currently testing with the wiki01.stg.iad2.fedoraproject.org host, updating and debugging the zabbix_agentd.conf to get it working with the deployed server. Hoping to get this config added to ansible role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/zabbix/zabbix_agent

smooge commented 11 months ago

@darknao and @aheath1992 had been looking at zabbix in the past. I think they are already talking with @dkirwan but wanted to mention them in the ticket.

Also :100:

seddik commented 10 months ago

@dkirwan we cannot connect with FAS account ??

dkirwan commented 10 months ago

Not yet! @seddik I'll look at getting that configured with accounts.stg asap though!

Metadata Update from @dkirwan:
- Issue unmarked as blocking: #11245

10 months ago

dkirwan commented 10 months ago

Debuged the zabbix_agentd configuration, clients will auto register and begin sharing system information with the server. Will get this into the zabbix_agent role.

Metadata Update from @dkirwan:
- Issue marked as blocking: #11245

10 months ago

dkirwan commented 10 months ago

Updated zabbix_agent role with the latest config. It's configured to auto enrol instances.

currently only configured on the following hosts in staging:

playbooks/groups/oci-registry.yml
playbooks/groups/wiki.yml

Edited 10 months ago by dkirwan

jsteffan commented 9 months ago

@dkirwan I'm available to help work on this. I've been looking for something that will get me up to speed on the FI systems/tools and provide some value. I have extensive experience with monitoring solutions (and anisble) so would only need guidance on how you'd like to see the rest of the work accomplished, what needs doing, and ensuring I've got the needed access to things. Right now, I only have sysadmin-devel level access. Please let me know if this is of interest to you.

dkirwan commented 9 months ago

Nice :D so whats blocking me at the moment, I'm trying to figure out currently how to get Zabbix hooked into our FAS system. It needs to use SAML [1]. I'm currently experimenting before testing out a configuration on the staging ipsilon.

Once this works complete, and we're able to authenticate members, then the real work can begin, we need to start migrating everything from nagios over to zabbix.... to start getting familiar with the nagios_server and nagios_client roles.

At the moment it might require a lot of deep diving and research, find out how something is monitored currently, how it might be best done in Zabbix ecosystem. Once we have the answer we can make the change via ansible and test it out in staging.

[1] SAML https://www.zabbix.com/documentation/6.0/en/manual/web_interface/frontend_sections/administration/authentication?hl=SAML
[2] nagios server: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_server
[3] nagios client: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/nagios_client

darknao commented 9 months ago

I can help with the SAML2 part if needed. Just let me know :)

dkirwan commented 9 months ago

@darknao yes please! ;D

darknao commented 9 months ago

Here is the config on Zabbix side to set up:

IdP entity ID: https://id.fedoraproject.org/saml2/metadata
SSO service URL: https://id.fedoraproject.org/saml2/SSO/Redirect
SLO service URL: leave empty
Username attribute: username
SP entity ID: https://zabbix.stg.fedoraproject.org
SP name ID format: urn:oasis:names:tc:SAML:2.0:nameid-format:transient
Sign: untick all
Encrypt: untick all

darknao commented 9 months ago

SAML2 configuration is complete.
Note that in 6.0, a user must exist in Zabbix before being able to login with FAS.

dkirwan commented 9 months ago

So awesome thanks for that @darknao

Ok regarding the users needing to be created, I've a few ideas in mind how to manage that. Let me write up something on discussion including the report on the current state of Zabbix in staging and then invite feedback from everyone.

dkirwan commented 8 months ago

Ok, we've rolled out the zabbix agent to the majority of the staging instances, theres a few here and there where we can't easily meet dependencies eg the rhel7 boxes, some staging instances are not technically staging, or at least not iad2 based.

We've also 2 instances that are showing as inaccessible after the fact so will troubleshoot these quickly before we start writing up the SOPs to cover what we have so far.

dkirwan commented 8 months ago

All instances showing accessible in staging now, added SOPs related to zabbix to the Fedora Infra sysadmin guide.

Work completed, after code freeze, should be ready to deploy a server in production. Then begin the work of replacing Nagios monitoring service by service.

seddik commented 8 months ago

I would love to help on this, replacing nagios services by zabbix.

dkirwan commented 7 months ago

Just waiting until F39 full release and end of freeze before getting Zabbix running in production.

Currently debugging some issues showing in the staging instance related to network and disk load on the bvmhost machines in staging.

In order to make it easier for others to contribute, also need to break the workload down into small size, and perhaps have tickets for every service being migrated from nagios.

Using releng as a prototype as its a green field situation, there is little to no monitoring currently in place its a green field. We'll implement Zabbix checks, and document how we did it, it can become a reference then for others wanting to get involved and take on some of this work later.

Can follow along in the following ticket: https://pagure.io/fedora-infrastructure/issue/11577

dkirwan commented 6 months ago

With freeze over, started work on the creation of the Zabbix VMs for production. We've created the VM, now debugging the networking/tls via Apache/Haproxy.

Once completed will deploy Zabbix using our playbook/roles already developed.

dkirwan commented 6 months ago

Prod instance deployed: https://zabbix.fedoraproject.org

Need to debug some issues with user sync for the sysadmin-noc group. In the meantime Guest access is also enabled if you want to login for a look.

seddik commented 6 months ago

We sill cannot connect with FAS account ? Guest access allowed for contributors ?

dkirwan commented 6 months ago

@seddik currently members of the group sysadmin-noc have accounts created on the zabbix server with elevated privileges and can then login via FAS, but everyone may login via Guest user.

Will soon have a SOP that contributors can follow along as a reference if they wish to contribute to adding Zabbix monitoring to the various services in Fedora Infra.

dkirwan commented 4 months ago

Just before christmas, opened tickets with RH IT, to open up zabbix server and agent ports between the networks.

Managed to the the rabbitmq production hosts to auto enroll with the server. Still debugging some iptable rule issues.

dkirwan commented 4 months ago

All prod hosts now have agents running, but not all accessible! Few remaining firewall issues to debug, on the releng hosts and ipsilon.

dkirwan commented 4 months ago

Just capturing some requirements @kevin listed on the releng ticket:

Awesome. :)

So, a few other things:

Can you nuke that vnet* rule? I see those still alerting... and as we figured in staging, due to whatever quirk, bridges show up as 10MB connections so it alerts on them all the time.

All the iad2 hosts are in, but we still have all the non iad2 ones. ;) There's several ways we could address those so we should discuss options. We could just connect to them over our vpn (all the external hosts should be on the vpn I think). We would need to add a vpn endpoint on the zabbx01 vm and then add all the other ones with '$name.vpn.fedoraproject.org'. Or I think zabbix has proxy / spoke things? we could look at setting up some new vm's in each external dc and have those monitor the local machines and phone back to the hub. That sounds like it would be more work to me, but I'm not sure how those work entirely.

I think @ryanlerch found some matrix integration for zabbix. We should look at that and what it would take to set it up.

and there's likely a bunch of old one off special nagios checks we need to consider. I am not sure how best to approach that. Perhaps we just make a list of all hosts, then go through them slowly over time one by one and see if there's any 'non standard' checks? Or we could try and untangle those out of the nagios ansible stuff. Ideas welcome on that. ;)

Thanks for all the work on this...

Oh, and thinking about it more, the external hosts we probibly should just directly connect to. If we use the vpn, then all monitoring on all of those will depend on the vpn being up and working. Currently nagios directly connects to them I am pretty sure. So, thats just adding the '$name.fedoraproject.org' hosts that nagios has (all all the external ones not in iad2).

dkirwan commented 4 months ago

Regarding the vnet* rule, nuking it is tricky, the mechanism which autodiscovers and then applies is pretty complex, but I'll keep looking at it! The way we configure this on the centos side is very different! Id rather not just delete everything and implement it the same way as we would lose a lot of other fancy things! For the moment I've reduced that particular alerts severity to Information, and @ryanlerch has already figured out how to prevent such alerts firing via the matrix bot :)

I'll have to figure out how to determine which iface type corresponds with these bridges then can probably modify the alert rule to ignore if match occurs.

Edited 4 months ago by dkirwan

dkirwan commented 2 months ago

Ok, got the vnet* rules sorted, no longer showing up. I found exactly where it can be configured in the base template applied to all the current hosts:

Linux by Zabbix agent active > macros >

{$NET.IF.IFNAME.MATCHES}
{$NET.IF.IFNAME.NOT_MATCHES}

I think I have the vpn configured on the zabbix01, but might be best to go with direct connections in that case. The zabbix agent should already be installed on the proxies outside iad2 too, so I shouldn't need a freeze break request thankfully. I'll see if I can add one and get it working from the server side with a direct connection.

Metadata

Assignee

dkirwan

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Backlog

Attachments 2

Screenshot_from_2023-07-04_10-17-36.png

Attached 10 months ago View Comment

Screenshot_2024-01-17_at_11.28.16.png

Attached 4 months ago View Comment

fedora-infrastructure

Source Code

#11393 Replace Nagios with Zabbix in Fedora Infrastructure Opened 11 months ago by zlopez. Modified 2 months ago

Close issue as:

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

high-gain high-trouble ops monitoring

Boards 1

Attachments 2

#11393 Replace Nagios with Zabbix in Fedora Infrastructure

Opened 11 months ago by zlopez. Modified 2 months ago