Help with bisecting the kernel to find a bug.
Background:
openqa-x86-worker04.iad2.fedoraproject.org has an issue with recent kernels we're trying to track down. At least it seems to follow the kernel.
I posted an inquiry to the XFS upstream list, and XFS upstream developers would like to know the following: https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
@adamwill got me some of these things, but not all of them (the complete storage stack details, I can help with the commands to get all of this if someone wants to ping me).
But before we do that, I'm wondering:
Some notes on reprovisioning:
raid --chunk 64
However, since the problem is clearly binary: works with 5.11, doesn't with 5.12+, I'm skeptical that reprovisioning per above will do anything but make it less bad, maybe even harder to track down.
I also had an idea of using Btrfs, which would remove XFS and LVM and possibly mdadm from the storage stack layers. The problem still could be in the dm-crypt layer though. And then what if the problem doesn't happen, we haven't learned if it's XFS, LVM, or mdadm or some combination of suboptimal provisioning and bug.
My two cents is to keep the current configuration, and do the kernel bisect. And get all the details the XFS devs want from the XFS FAQ.
On the one hand, it's running kernel 5.11 fc34 and is working OK for now, so not urgent. On the other hand, it's running kernel 5.11 which is EOL for a long time, and means these setups are stuck on Fedora 34 until this is resolved.
I'm not sure there's any point having an issue as well as a bug. I'm responsible for the openQA worker machines, infra is only responsible for getting them on the network and giving me access to them.
There are several worker hosts spread across staging and production. The machines of each arch are mostly identical to each other.
The systems aren't stuck on Fedora 34. They are all running Fedora 35 already. I'm just running older kernels on F35.
Doing a kernel bisect is possible but it will take a large chunk of someone's time as it involves waiting for kernel builds, rebooting the system via a management interface - which takes about three minutes every time, because these are enterprise systems that take forever to get through the firmware stage of boot - and waiting to see if the problem reproduces.
Oops, I'll close this then.
Metadata Update from @chrismurphy: - Issue close_status updated to: Invalid - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.