#8 Fedora Silverblue 30 + LUKS + intermittent booting
Closed 3 years ago by tpopela. Opened 5 years ago by otaylor.

I updated my laptop to Fedora Silverblue 30 today, and it boots up successfully about half the time - the other half of the time falling back to the systemd recovery prompt. The basic symptom of the failure at that /sysroot and /var fail to mount, but the actual point of divergence between a successful and failed boot seems to occur a bit earlier.

As much as I could determine, during the switch root from initrd to the main system, there's a point where systemd sees all devices going away:

 systemd[1]: Invoking unit catchup() handlers…
 systemd[1]:  sys-devices-platform-serial8250-tty-ttyS2.device: Changed plugged -> dead
 [... many more ...]

When things succeed, the LUKS device goes away first, then systemd sees the underlying partition going away, adds "stop the LUKS device" to the list of tasks, and "stop the bind mounts from the LUKS device" to the list of tasks, realizes that the the LUKS device was already stopped (stopping the LUKS device is "redundant"), and unwinds the bind mounts (usr.mount, sysroot.mount) as well.

systemd[1]: dev-mapper-luks<partuuid>.device: Changed plugged -> dead
systemd[1]: dev-disk-by\x2did-dm\x2dname\x2dluks<partuuid>.device: Changed plugged -> dead
systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Trying to enqueue job systemd-cryptsetup@luks<partuuid>.service/stop/replace
systemd[1]: Added job systemd-cryptsetup@luks<partuuid>.service/stop to transaction.
systemd[1]: Pulling in dev-mapper-luks<partuuid>.device/stop from systemd-cryptsetup@luks<partuuid>.service/stop
systemd[1]: Added job dev-mapper-luks<partuuid>.device/stop to transaction.
systemd[1]: Pulling in sysroot.mount/stop from dev-mapper-luks<partuuid>.device/stop
systemd[1]: Added job sysroot.mount/stop to transaction.
systemd[1]: Pulling in usr.mount/stop from dev-mapper-luks<partuuid>.device/stop
systemd[1]: Added job usr.mount/stop to transaction.
systemd[1]: Found redundant job dev-mapper-luks<partuuid>.device/stop, dropping from transaction.
systemd[1]: Garbage collecting job usr.mount/stop
systemd[1]: Garbage collecting job sysroot.mount/stop

But when things fail, systemd first sees the underlying partitions going away, and before it sees the luks device going away, adds jobs to unmount the /usr and /sysroot bind mounts.

systemd[1]: Invoking unit catchup() handlers…
systemd[1]: dev-sda2.device: Changed plugged -> dead
systemd[1]: dev-disk-by\x2duuid-d3147813\x2dbee2\x2d4914\x2da8ff\x2d6768fb5822f3.device: Changed plugged -> dead
systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Trying to enqueue job systemd-cryptsetup@luks<partuuid>.service/stop/replace
systemd[1]: Added job systemd-cryptsetup@luks<partuuid>.service/stop to transaction.
systemd[1]: Pulling in dev-mapper-luks<partuuid>.device/stop from systemd-cryptsetup@luks<partuuid>.service/stop
systemd[1]: Added job dev-mapper-luks<partuuid>.device/stop to transaction.
systemd[1]: Pulling in usr.mount/stop from dev-mapper-luks<partuuid>.device/stop
systemd[1]: Added job usr.mount/stop to transaction.
systemd[1]: Pulling in sysroot.mount/stop from dev-mapper-luks<partuuid>.device/stop
systemd[1]: Added job sysroot.mount/stop to transaction.
systemd[1]: Pulling in cryptsetup.target/stop from systemd-cryptsetup@luks<partuuid>.service/stop
systemd[1]: Keeping job usr.mount/stop because of dev-mapper-luks<partuuid>.device/stop
systemd[1]: Keeping job sysroot.mount/stop because of dev-mapper-luks<partuuid>.device/stop
systemd[1]: Keeping job dev-mapper-luks<partuuid>.device/stop because of systemd-cryptsetup@luks<partuuid>.service/stop
systemd[1]: usr.mount: Installed new job usr.mount/stop as 188
systemd[1]: sysroot.mount: Installed new job sysroot.mount/stop as 189
systemd[1]: systemd-cryptsetup@luks<partuuid>.service: Installed new job systemd-cryptsetup@luks<partuuid>.service/stop as 186
systemd[1]: dev-mapper-luks<partuid>.device: Installed new job dev-mapper-luks<partuuid>.device/stop as 187

but only sees the LUKS device going later:

systemd[1]:  dev-mapper-luks<partuuid>: Changed plugged -> dead
systemd[1]: dev-mapper-luks<partuuid>.device: Job 187 dev-mapper-luks<partuuid>.device/stop finished, result=done
systemd[1]: Stopped /dev/mapper/luks-<partuuid>.

so the unmount of /usr runs and fails:

umount[672]: umount: /usr: target is busy.
systemd[1]: usr.mount: Mount process exited, code=exited, status=32/n/a
systemd[1]: Failed unmounting /usr.

and that cascades to the final failure. I'll attach the full logs as well - it's possible that the above is a red herring - it seemed to be repeatedly the first significant looking divergence between a successful and failed build - but that could be coincidence or it could be this is a symptom of the real problem.

@zbyszek @walters - any ideas why this would have started happening?


What I determined from most of a day of looking at this is:

  • The plugged -> dead -> plugged transition during Switch Root is not expected (according to @zbyszek) - so the details of how it happens as described above are irrelevant.
  • it occurs with systemd-241 and systemd-240 for both ostree and non-ostree Fedora (though it only obviously causes problems for ostree systems.)
  • It doesn't occur with systemd-239
  • The logs above are missing some lines at reexec because of kernel message throttling - (I didn't know I needed printk.devkmsg=on)
  • the initial call to device_enumerate() after reexec finds no devices (probably because /run/udev/tags doesn't exist?) - which is why all the devices that are deserialized get the plugged -> dead transition
  • the regression might plausibly be related to all the sd-device related refactoring that landed in systemd-240, but I couldn't immediately find suspect code.

Yu Watanabe submitted a pull upstream to fix the reported issue - https://github.com/systemd/systemd/pull/12013 - I rebuilt the F30 systemd package with that patch set and layered that on my system with rpm-ostree override replace; rpm-ostree initramfs --enable and the issue seems resolved.

@zbyszek - it would be great to have a backport of this in our systemd packages once it lands upstream - I don't know how widespread these problems are going to be, but I'm certainly not the only person using ostree on top of LUKS.

Metadata Update from @tpopela:
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata
Attachments 2