Describe what you would like us to do:

Help with bisecting the kernel to find a bug.

Background:

openqa-x86-worker04.iad2.fedoraproject.org has an issue with recent kernels we're trying to track down. At least it seems to follow the kernel.

I posted an inquiry to the XFS upstream list, and XFS upstream developers would like to know the following:
https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

@adamwill got me some of these things, but not all of them (the complete storage stack details, I can help with the commands to get all of this if someone wants to ping me).

But before we do that, I'm wondering:

we need to confirm the first fedora kernel the problem appeared in, and then bisect; it's a sufficiently bad bug, and it still isn't fixed in 5.16 series, that we kinda ought to find it if we can. Is it possible to do a kernel bisect? I've done tons of these, I can help, but I don't have access to these hosts.
there's a production and staging setup, are they identical storage stacks? If not, since the problem happens on both, we should work on the one that has the simplest setup. What about the Power 9 system? If it's not reliably failing as often, then it's probably not as good a candidate.
these systems are effectively stuck on Fedora 34, since that's the only current Fedora release with a working 5.11 kernel for this workload: it probably needs to be figured out before upgrading or reprovisioning?

Some notes on reprovisioning:

use a kickstart specifying raid --chunk 64 so that the mdadm chunk size is 64 KiB. The default mdadm chunk size is 512 KiB and XFS devs have never liked that, and always complain about it, especially for heavy metadata workloads like VM's.
specify the target file system size in the kickstart, rather than resizing it later. XFS devs discovered from my December inquiry about this problem that the main volume was small at mkfs time, then grown, and now has too small a journal, which they consider a contributing factor in the issue. xfs_growfs doesn't have the ability to change the journal size, it's set only at mkfs time.

However, since the problem is clearly binary: works with 5.11, doesn't with 5.12+, I'm skeptical that reprovisioning per above will do anything but make it less bad, maybe even harder to track down.

I also had an idea of using Btrfs, which would remove XFS and LVM and possibly mdadm from the storage stack layers. The problem still could be in the dm-crypt layer though. And then what if the problem doesn't happen, we haven't learned if it's XFS, LVM, or mdadm or some combination of suboptimal provisioning and bug.

My two cents is to keep the current configuration, and do the kernel bisect. And get all the details the XFS devs want from the XFS FAQ.

When do you need this to be done by? (YYYY/MM/DD)

On the one hand, it's running kernel 5.11 fc34 and is working OK for now, so not urgent. On the other hand, it's running kernel 5.11 which is EOL for a long time, and means these setups are stuck on Fedora 34 until this is resolved.

adamwill commented 2 years ago

I'm not sure there's any point having an issue as well as a bug. I'm responsible for the openQA worker machines, infra is only responsible for getting them on the network and giving me access to them.

There are several worker hosts spread across staging and production. The machines of each arch are mostly identical to each other.

The systems aren't stuck on Fedora 34. They are all running Fedora 35 already. I'm just running older kernels on F35.

Doing a kernel bisect is possible but it will take a large chunk of someone's time as it involves waiting for kernel builds, rebooting the system via a management interface - which takes about three minutes every time, because these are enterprise systems that take forever to get through the firmware stage of boot - and waiting to see if the problem reproduces.

chrismurphy commented 2 years ago

Oops, I'll close this then.

Metadata Update from @chrismurphy:
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

Needs Review

fedora-infrastructure

Source Code

#10555 openqa-x86-worker04.iad2 eventually breaks on kernel 5.12+ Closed: Invalid 2 years ago by chrismurphy. Opened 2 years ago by chrismurphy.

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

#10555 openqa-x86-worker04.iad2 eventually breaks on kernel 5.12+

Closed: Invalid 2 years ago by chrismurphy. Opened 2 years ago by chrismurphy.