I updated my laptop to Fedora Silverblue 30 today, and it boots up successfully about half the time - the other half of the time falling back to the systemd recovery prompt. The basic symptom of the failure at that /sysroot and /var fail to mount, but the actual point of divergence between a successful and failed boot seems to occur a bit earlier.
As much as I could determine, during the switch root from initrd to the main system, there's a point where systemd sees all devices going away:
systemd[1]: Invoking unit catchup() handlers… systemd[1]: sys-devices-platform-serial8250-tty-ttyS2.device: Changed plugged -> dead [... many more ...]
When things succeed, the LUKS device goes away first, then systemd sees the underlying partition going away, adds "stop the LUKS device" to the list of tasks, and "stop the bind mounts from the LUKS device" to the list of tasks, realizes that the the LUKS device was already stopped (stopping the LUKS device is "redundant"), and unwinds the bind mounts (usr.mount, sysroot.mount) as well.
systemd[1]: dev-mapper-luks<partuuid>.device: Changed plugged -> dead systemd[1]: dev-disk-by\x2did-dm\x2dname\x2dluks<partuuid>.device: Changed plugged -> dead systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Trying to enqueue job systemd-cryptsetup@luks<partuuid>.service/stop/replace systemd[1]: Added job systemd-cryptsetup@luks<partuuid>.service/stop to transaction. systemd[1]: Pulling in dev-mapper-luks<partuuid>.device/stop from systemd-cryptsetup@luks<partuuid>.service/stop systemd[1]: Added job dev-mapper-luks<partuuid>.device/stop to transaction. systemd[1]: Pulling in sysroot.mount/stop from dev-mapper-luks<partuuid>.device/stop systemd[1]: Added job sysroot.mount/stop to transaction. systemd[1]: Pulling in usr.mount/stop from dev-mapper-luks<partuuid>.device/stop systemd[1]: Added job usr.mount/stop to transaction. systemd[1]: Found redundant job dev-mapper-luks<partuuid>.device/stop, dropping from transaction. systemd[1]: Garbage collecting job usr.mount/stop systemd[1]: Garbage collecting job sysroot.mount/stop
But when things fail, systemd first sees the underlying partitions going away, and before it sees the luks device going away, adds jobs to unmount the /usr and /sysroot bind mounts.
systemd[1]: Invoking unit catchup() handlers… systemd[1]: dev-sda2.device: Changed plugged -> dead systemd[1]: dev-disk-by\x2duuid-d3147813\x2dbee2\x2d4914\x2da8ff\x2d6768fb5822f3.device: Changed plugged -> dead systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Trying to enqueue job systemd-cryptsetup@luks<partuuid>.service/stop/replace systemd[1]: Added job systemd-cryptsetup@luks<partuuid>.service/stop to transaction. systemd[1]: Pulling in dev-mapper-luks<partuuid>.device/stop from systemd-cryptsetup@luks<partuuid>.service/stop systemd[1]: Added job dev-mapper-luks<partuuid>.device/stop to transaction. systemd[1]: Pulling in usr.mount/stop from dev-mapper-luks<partuuid>.device/stop systemd[1]: Added job usr.mount/stop to transaction. systemd[1]: Pulling in sysroot.mount/stop from dev-mapper-luks<partuuid>.device/stop systemd[1]: Added job sysroot.mount/stop to transaction. systemd[1]: Pulling in cryptsetup.target/stop from systemd-cryptsetup@luks<partuuid>.service/stop systemd[1]: Keeping job usr.mount/stop because of dev-mapper-luks<partuuid>.device/stop systemd[1]: Keeping job sysroot.mount/stop because of dev-mapper-luks<partuuid>.device/stop systemd[1]: Keeping job dev-mapper-luks<partuuid>.device/stop because of systemd-cryptsetup@luks<partuuid>.service/stop systemd[1]: usr.mount: Installed new job usr.mount/stop as 188 systemd[1]: sysroot.mount: Installed new job sysroot.mount/stop as 189 systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Installed new job systemd-cryptsetup@luks<partuuid>.service/stop as 186 systemd[1]: dev-mapper-luks<partuid>.device: Installed new job dev-mapper-luks<partuuid>.device/stop as 187
but only sees the LUKS device going later:
systemd[1]: dev-mapper-luks<partuuid>: Changed plugged -> dead systemd[1]: dev-mapper-luks<partuuid>.device: Job 187 dev-mapper-luks<partuuid>.device/stop finished, result=done systemd[1]: Stopped /dev/mapper/luks-<partuuid>.
so the unmount of /usr runs and fails:
umount[672]: umount: /usr: target is busy. systemd[1]: usr.mount: Mount process exited, code=exited, status=32/n/a systemd[1]: Failed unmounting /usr.
and that cascades to the final failure. I'll attach the full logs as well - it's possible that the above is a red herring - it seemed to be repeatedly the first significant looking divergence between a successful and failed build - but that could be coincidence or it could be this is a symptom of the real problem.
@zbyszek @walters - any ideas why this would have started happening?
<img alt="gzip'ed log when it worked" src="/fedora-silverblue/issue/raw/files/0f47b279333f75749d86f1302352e22cb8532e8393d71e674e117fdab294935d-debug-worked.log.gz" /> <img alt="gzip'ed log when it didn't" src="/fedora-silverblue/issue/raw/files/bc76ce874d2a0eb3d449e71f3b64df594535f3f34e155c18dc9fd0f4df846e18-debug-failure.log.gz" />
What I determined from most of a day of looking at this is:
Filed https://github.com/systemd/systemd/issues/11997
Yu Watanabe submitted a pull upstream to fix the reported issue - https://github.com/systemd/systemd/pull/12013 - I rebuilt the F30 systemd package with that patch set and layered that on my system with rpm-ostree override replace; rpm-ostree initramfs --enable and the issue seems resolved.
rpm-ostree override replace; rpm-ostree initramfs --enable
@zbyszek - it would be great to have a backport of this in our systemd packages once it lands upstream - I don't know how widespread these problems are going to be, but I'm certainly not the only person using ostree on top of LUKS.
Metadata Update from @tpopela: - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.