From emails and configs it looks like sdh and sdb are failed on the hardware. Putting in a ticket as it could cause a big outage which will need to be tracked.
This is an automatically generated mail message from mdadm running on bvmhost-x86-03.iad2.fedoraproject.org A Fail event had been detected on md device /dev/md/0. It could be related to component device /dev/sdb1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdh2[7](F) sdg2[6] sdf2[5] sde2[4] sdc2[2] sda2[0] sdd2[3] 488384 blocks super 1.0 [8/6] [U_UUUUU_] bitmap: 0/1 pages [0KB], 65536KB chunk bvmhost-x86-03.iad2.fedoraproject.org:compose-iot01.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:compose-rawhide01.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:koji02.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:mbs-backend01.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:oci-registry02.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:odcs-backend01.iad2.fedoraproject.org:running:1 bvmhost-x86-03.iad2.fedoraproject.org:sign-bridge01.iad2.fedoraproject.org:running:1
I'll try and call dell tomorrow on this.
One drive is completely gone (doesn't even show in mgmt). The other one shows fine in mgmt, but errors in linux.
Ideally they would replace both drives, but failing that, they would replace the one thats completely off line and we reboot and readd it, then the other.
In the event of doom, all the vm's on this server could be redeployed on another, none of them should have local data.
Metadata Update from @zlopez: - Issue tagged with: medium-gain, medium-trouble, ops
Dell is sending a tech to swap 2 drives and some memory. This will happen later today.
we will try and keep downtime to a min...
Drive 1 was replaced. Drive 7 actually didn't produce errors on reboot, so we think it might have just been in a bad state or the memory issue was affecting it.
Memory was replaced.
Machine is back up and running.
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.