Planned Outage - openqa / openqa-lab - 2024-10-08 21:00UTC
There will be an outage starting at 2024-10-08 21:00UTC, which will last approximately 3 hours.
To convert UTC to your local time, take a look at http://fedoraproject.org/wiki/Infrastructure/UTCHowto or run:
date -d '2024-10-08 21:00UTC'
Reason for outage:
We will be reinstalling some openqa virthost and database hosts as well as reinstalling workers to use a common partitioning and networking setup.
Affected Services:
openqa / openqa-labs. During the outage, updates may not go stable waiting for testing. After the outage is over, openqa will test all pending updates ( no need to resubmit ).
Ticket Link:
https://pagure.io/fedora-infrastructure/issue/12206
Please join #fedora-admin or #fedora-noc on irc.libera.chat or #admin:fedoraproject.org / #noc:fedoraproject.org on matrix. Please add comments to the ticket for this outage above.
Updated status for this outage may be available at https://www.fedorastatus.org/
Things we should line up before this:
common kickstart / grub.cfg for installs so we pass the right network naming and use the same storage config.
Before the outage we need to save off all the libvirt xml from the guests on qvmhost's
Before upgrading the database server we need to stop it, copy off /var/lib/pgsql, install new, sync back and pg_upgrade it to the latest.
I don't know if we need to collect / do anything to be able to re-run tests once things are back up.
CC: @adamwill
the network stuff is a bit tricky because each host's host vars will need to be adjusted, there's a bunch of stuff in there to try and bring up only the appropriate interfaces, and some iptables config for the tap worker hosts. other than that, I think things should work again as soon as everything is back up and the networking is correct.
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Minor change before you email: s/reinstlaling/reinstalling/
oops. Already sent... oh wait, I can kill it from the mod queue.
Also, the 'X hours' needs changing to 3 hours
argh, already passed thouugh. Oh well, updated the initial comment here.
so, here's the current status of all worker hosts:
prod worker disks layout network naming openqa-x86-worker01 6 XFS-on-LVM-on-LUKS-on-md biosdevname openqa-x86-worker02 10 XFS-on-LVM-on-md biosdevname openqa-x86-worker06 10 btrfs-native-raid biosdevname openqa-a64-worker04 6n XFS-on-LVM-on-LUKS-on-md eth stg worker disks layout network naming openqa-a64-worker01 offline offline offline openqa-a64-worker02 1 XFS eth openqa-a64-worker03 1 XFS eth openqa-p09-worker01 8 btrfs-on-LUKS-on-md udev openqa-p09-worker02 8 XFS-on-LVM-on-md eth openqa-x86-worker04 10 btrfs-on-LUKS-on-md biosdevname openqa-x86-worker05 10 btrfs-on-LUKS-on-md udev openqa-x86-worker03 10 btrfs-on-LUKS-on-md udev
a64-worker01 is offline because of some bad memory.
So, obviously we can't use only one kickstart: we have 1-disk, 6-disk, 6-disk NVME, 8-disk and 10-disk cases to worry about.
For storage I think we should go with btrfs-native-raid-on-LUKS or btrfs-on-LUKS-on-md for all hosts, just for consistency. This will mean updating all the kickstarts we intend to use. For the single-disk hosts we can just go with btrfs-on-LUKS, I guess.
For network naming, I think we should go with udev on all hosts. This involves making sure biosdevname is not installed and making sure no net.ifnames=0 arg is passed at install time or to the installed system. Then we need to update the host vars for boxes not currently using udev naming.
net.ifnames=0
looking into it a bit: for btrfs-native-raid-on-LUKS I think we would use the openqa-worker-fedora-btrfsraid kickstart as a base, but add --encrypted --passphrase=(passphrase) on each of the part btrfs lines, and have 6, 6nvme, 8 and 10 variants of that kickstart, plus a slightly different 1-disk variant with no RAID for the old a64 workers (not sure what kickstart they were initially deployed with).
--encrypted --passphrase=(passphrase)
part btrfs
The downside there is that each btrfs partition is separately encrypted, I guess. I don't know if there's a better way to do that. (Also, I don't entirely know if that works, it's not a config we've tried before). If we want to avoid that, we can go with btrfs-on-LUKS-on-md, which I think means using openqa-worker-fedora-10disk as a base and having variants for the other disk sizes.
One additional thing here: I think btrfs raid5/6 is still not recommended? So raid10 ? but that won't work on ones with too few disks.
for openQA the purpose of the raid is primarily performance, we don't need redundancy (they're entirely transient deployments after all). so...whichever level gives best performance. although 0 might be a bit too dangerous with this many disks i guess.
Sure, also, it looks like there's really not that much space used on workers.
ok, All but the workers are done. We will be doing them over time since they won't cause outages (generally).
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
workers are now all done except a64-03, kevin is working on that one.
a64-03 failed on nbde stuff (which I guess makes sense since it's one disk not raid with md0)
ah. I guess we need to tweak it to work on a single disk setup. I'll maybe look at that tomorrow.
https://pagure.io/fedora-infra/ansible/pull-request/2299 should fix the nbde issue, does that look sound to you @kevin ?
yep looks good
ok, that worked, all is done here.
Log in to comment on this ticket.