From d34d5e053ce6ff97de0468d0174702fc409cbef7 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Jul 12 2024 13:41:16 +0000 Subject: bugzilla2fedmsg SOP updated Signed-off-by: David Kirwan --- diff --git a/modules/sysadmin_guide/pages/bugzilla2fedmsg.adoc b/modules/sysadmin_guide/pages/bugzilla2fedmsg.adoc index 8091cda..ee72a81 100644 --- a/modules/sysadmin_guide/pages/bugzilla2fedmsg.adoc +++ b/modules/sysadmin_guide/pages/bugzilla2fedmsg.adoc @@ -12,60 +12,55 @@ Owner:: Contact:: #fedora-apps, #fedora-fedmsg, #fedora-admin, #fedora-noc Servers:: - bugzilla2fedmsg01 + STG/PROD Openshift Clusters Purpose:: Rebroadcast bugzilla events on our bus. == Description -bugzilla2fedmsg is a small service running as the 'moksha-hub' process -which receives events from bugzilla via the RH "unified messagebus" and -rebroadcasts them to our fedmsg bus. +bugzilla2fedmsg is a small service running as a container in Openshift in the `bugzilla2fedmsg` project which receives events from bugzilla via the RH "unified messagebus" and rebroadcasts them to our fedmsg bus. -[NOTE] -==== -Unlike _all_ of our other fedmsg services, this one runs as the -'moksha-hub' process and not as the 'fedmsg-hub'. -==== +== Resources -The bugzilla2fedmsg package provides a plugin to the moksha-hub that -connects out over the STOMP protocol to a 'fabric' of JBOSS activemq -FUSE brokers living in the Red Hat DMZ. We authenticate with a cert/key -pair that is kept in _/etc/pki/fedmsg/_. Those brokers should push -bugzilla events over STOMP to our moksha-hub daemon. When a message -arrives, we query bugzilla about the change to get some 'more -interesting' data to stuff in our payload, then we sign the message -using a fedmsg cert and fire it off to the rest of our bus. +- [1] Ansible Playbook: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/bugzilla2fedmsg.yml +- [2] Ansible Role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/bugzilla2fedmsg +- [3] Code: https://github.com/fedora-infra/bugzilla2fedmsg -This service has no database, no memcached usage. It depends on those -STOMP brokers and being able to query bugzilla.rh.com. - -== Relevant Files +== Useful Commands -All managed by ansible, of course: +To look at logs, first authenticate with Openshift. Login to the console, and then retrieve a token. At the top right of the webconsole, click `copy login command`. +eg: .... -STOMP config: /etc/moksha/production.ini -fedmsg config: /etc/fedmsg.d/ -certs: /etc/pki/fedmsg -code: /usr/lib/python2.7/site-packages/bugzilla2fedmsg.py -.... +# Login with the token +oc login --token=sha256~_XXXXXXXXXXX --server=https://api.ocp.stg.fedoraproject.org:6443 -== Useful Commands +# Switch to the bugzilla2fedmsg project +oc project bugzilla2fedmsg +Now using project "bugzilla2fedmsg" on server "https://api.ocp.stg.fedoraproject.org:6443". -To look at logs, run: +# Retrieve a list of pods running in the project +oc get pods +NAME READY STATUS RESTARTS AGE +bugzilla2fedmsg-32-58px2 1/1 Running 0 43h +# Retrieve the logs from the bugzilla2fedmsg-32-58px2 pod +oc logs -f bugzilla2fedmsg-32-58px2 .... -$ journalctl -u moksha-hub -f -.... + To restart the service, run: .... -$ systemctl restart moksha-hub +# List the deploymentconfigs in the bugzilla2fedmsg project +oc get dc +NAME REVISION DESIRED CURRENT TRIGGERED BY +bugzilla2fedmsg 32 1 1 config,image(bugzilla2fedmsg:latest) + +# Start a rollout of the deploymentconfig +oc rollout start bugzilla2fedmsg .... == Internal Contacts -If we need to contact someone from the RH internal "unified messagebus" -team, search for "unified messagebus" in source. +If we need to contact someone from the RH internal "unified messagebus" team, search for "unified messagebus" in source. diff --git a/modules/sysadmin_guide/pages/hardware_troubleshooting_power.adoc b/modules/sysadmin_guide/pages/hardware_troubleshooting_power.adoc new file mode 100644 index 0000000..481476e --- /dev/null +++ b/modules/sysadmin_guide/pages/hardware_troubleshooting_power.adoc @@ -0,0 +1,88 @@ +== Hardware Troubleshooting Power Issue + + +=== Overview +This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950 + +Symptoms: +- This server is not responding at all, and will not power on. +- To get to mgmt of RDU2-CC devices it’s a bit trickier than IAD2. We have a private management vlan there, but it’s only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the ‘sshuttle’ package/command/app to transparently forward my traffic to devices on that network. That looks something like: `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` + + The devices are all in the 172.23.1 network. There’s a list of them in `ansible-private/docs/rdu-networks.txt` but this host is: `172.23.1.105`. + In the Bitwarden Vault, the management password can be obtained. +- Logs show issues with voltages not being in the correct range. +- At RDU2-CC we have a contact: `James Gibson`. + + +=== Contact Information + +Owner:: + Fedora Infrastructure Team +Contact:: + #fedora-admin, sysadmin-main +Purpose:: + Provide basic orientation and introduction to the sysadmin group + + +=== Requirements + +- sshuttle to access the network at RDU2-CC +- Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials. +- Access to ansible-private repo. + + +=== Troubleshooting Steps + +.Connect to the management VLAN for the RDU2-CC network: +This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` + +.SSH to the batcave01 and retrieve the ip address for this machine +Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the `docs/rdu-networks.txt` + +.Open the Management Console +With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/ + +.Retrieve the username and password from Bitwarden +This is a prod machine so use the username and password from Bitwarden to login. + +.Once Logged in, retrieve the service tag for this server +Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty. + +.Open a tech support ticket with Dell +Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate + +.Collect logs from the server for Dell +https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support. + +.Dell requested firmware updates on the idrac and server, along with reseat of OCP card to be carried out. +Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card. +Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on. + +.OCP reseat carried out +James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU. + +.Request to reupload logs +First request was to get the zip TSR logs generated and forwarded to Dell. +Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload +This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form. + +.Swap PSU1 with PSU2 +Dell requested the following check be carried out: +Please Swap PSU1 with PSU2 and check if the server will power up. +if the issue persisit, test PSU2 on slot 1 and confirm +Once completed collect logs and share so we can proceed with action. + +.Both PSUs seem functional +James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next. + +.Dell suggest use different power point to plug hardware into +Since both ports has been test, I'm thinking this could be an external issue or a configuration issue. +Are the PSUs set to redundant? +When plugged at the same time, are them being plug to the same outlet/UPS? +If so, can we test by plugging them to different outlets/UPS ? + +.This appears to have resolved our issue. +Forwarded information to James Gibson to see what he thinks. +We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now. +Closed the ticket with Dell. + diff --git a/modules/sysadmin_guide/pages/index.adoc b/modules/sysadmin_guide/pages/index.adoc index affe60f..6b637f5 100644 --- a/modules/sysadmin_guide/pages/index.adoc +++ b/modules/sysadmin_guide/pages/index.adoc @@ -80,6 +80,7 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:blockerbugs.adoc[Blockerbugs Infrastructure] * xref:bodhi-deploy.adoc[Bodhi Infrastructure - Deployment] * xref:bodhi.adoc[Bodhi Infrastructure - Releng] +* xref:bugzilla2fedmsg.adoc[Bugzilla 2 Fedmsg] * xref:bugzilla2fedmsg.adoc[bugzilla2fedmsg] * xref:collectd.adoc[Collectd] * xref:compose-tracker.adoc[Compose Tracker] @@ -116,6 +117,7 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:guestdisk.adoc[Guest Disk Resize] * xref:guestedit.adoc[Guest Editing] * xref:haproxy.adoc[Haproxy Infrastructure] +* xref:hardware_troubleshooting_power.adoc[Hardware Troubleshoot Power Issue] * xref:hotfix.adoc[HOTFIXES] * xref:hotness.adoc[The New Hotness] * xref:infra-git-repo.adoc[Infrastructure Git Repos] @@ -169,7 +171,6 @@ xref:developer_guide:sops.adoc[Developing Standard Operating Procedures]. * xref:scmadmin.adoc[SCM Admin] * xref:selinux.adoc[SELinux Infrastructure] * xref:sigul-upgrade.adoc[Sigul servers upgrades/reboots] -* xref:sop_hardware_troubleshooting_power.adoc[Hardware Troubleshoot Power Issue SOP] * xref:sshaccess.adoc[SSH Access Infrastructure] * xref:sshknownhosts.adoc[SSH known hosts Infrastructure] * xref:ssl-certificates.adoc[SSL Certificates] diff --git a/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc b/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc deleted file mode 100644 index 481476e..0000000 --- a/modules/sysadmin_guide/pages/sop_hardware_troubleshooting_power.adoc +++ /dev/null @@ -1,88 +0,0 @@ -== Hardware Troubleshooting Power Issue - - -=== Overview -This SOP shows some of the steps required to troubleshoot and diagnose a power issue with one of our servers. A ticket was opened Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11950 - -Symptoms: -- This server is not responding at all, and will not power on. -- To get to mgmt of RDU2-CC devices it’s a bit trickier than IAD2. We have a private management vlan there, but it’s only reachable via cloud-noc-os01.rdu-cc.fedoraproject.org. I usually use the ‘sshuttle’ package/command/app to transparently forward my traffic to devices on that network. That looks something like: `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` - - The devices are all in the 172.23.1 network. There’s a list of them in `ansible-private/docs/rdu-networks.txt` but this host is: `172.23.1.105`. - In the Bitwarden Vault, the management password can be obtained. -- Logs show issues with voltages not being in the correct range. -- At RDU2-CC we have a contact: `James Gibson`. - - -=== Contact Information - -Owner:: - Fedora Infrastructure Team -Contact:: - #fedora-admin, sysadmin-main -Purpose:: - Provide basic orientation and introduction to the sysadmin group - - -=== Requirements - -- sshuttle to access the network at RDU2-CC -- Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials. -- Access to ansible-private repo. - - -=== Troubleshooting Steps - -.Connect to the management VLAN for the RDU2-CC network: -This is only required because this server is not in IAD2 datacenter. Use sshuttle to make a connection to the 172.23.1.0/24 (from your laptop directly, not from the batcave01 to the management network). `sshuttle 172.23.1.0/24 -r cloud-noc-os01.rdu-cc.fedoraproject.org` - -.SSH to the batcave01 and retrieve the ip address for this machine -Ssh to the batcave01, access the ansible-private repo and read the IP address for this machine from the `docs/rdu-networks.txt` - -.Open the Management Console -With the IP address, visit https://IP in browser to access the idrac management console. Like so: https://172.23.1.105/ - -.Retrieve the username and password from Bitwarden -This is a prod machine so use the username and password from Bitwarden to login. - -.Once Logged in, retrieve the service tag for this server -Get the service tag: XXXXXXX its on the summary page on the management console. This is required in order to prove to Dell tech support that the server is under warranty. - -.Open a tech support ticket with Dell -Open a ticket with tech support chat: https://www.dell.com/support/incidents-online/en-ie/ContactUs/Dynamic?spestate - -.Collect logs from the server for Dell -https://www.dell.com/support/kbdoc/en-us/000126308/export-a-supportassist-collection-via-idrac9 how to collect logs for tech support. - -.Dell requested firmware updates on the idrac and server, along with reseat of OCP card to be carried out. -Contacted James Gibson internally and opened a ticket in servicenow. Requested that he arrange a trip to the datacenter in order to reseat this OCP card. -Updated the firmware on the idrac itself successfully, but failed to update the firmware on the server obviously as it wont turn on. - -.OCP reseat carried out -James finally managed to get out to the rdu-2 data center and carry out this work. Reseating the OCP had no effect, however he did troubleshoot further and removed one PSU, and still rebooting cycle, reattached and removed the other, and the server is booting fine. So we think we have identified a faulty PSU. - -.Request to reupload logs -First request was to get the zip TSR logs generated and forwarded to Dell. -Use the following site to upload the TSR as it might be too big to attach to email https://tdm.dell.com/file-upload -This requires a service request, so be sure to ask the Dell technician for a service request number in order to use this form. - -.Swap PSU1 with PSU2 -Dell requested the following check be carried out: -Please Swap PSU1 with PSU2 and check if the server will power up. -if the issue persisit, test PSU2 on slot 1 and confirm -Once completed collect logs and share so we can proceed with action. - -.Both PSUs seem functional -James Gibson, swapped the PSU units in this server on Friday, and the server is powering on as normal. So appears both PSU units are in fact working, perhaps something wrong with the chassis the units are going into ? Informed Dell just waiting on update to see what to troubleshoot next. - -.Dell suggest use different power point to plug hardware into -Since both ports has been test, I'm thinking this could be an external issue or a configuration issue. -Are the PSUs set to redundant? -When plugged at the same time, are them being plug to the same outlet/UPS? -If so, can we test by plugging them to different outlets/UPS ? - -.This appears to have resolved our issue. -Forwarded information to James Gibson to see what he thinks. -We have moved the power to different power points, with the 2nd PSU reattached and the server appears to be working correctly now. -Closed the ticket with Dell. -