Issue #259: Rescue boot entry not updated on OS upgrades - fedora-workstation

fedora-workstation

#259 Rescue boot entry not updated on OS upgrades

Closed: Won't fix 2 years ago by chrismurphy. Opened 2 years ago by axels.

I recently upgraded to Fedora 35 and not long after tried to use the rescue mode: at the grub bootloader, the entry that boots without a graphical interface and lets you log in directly as root. I was able to boot into rescue mode, but found something surprising.

My rescue boot entry is labelled:
Fedora (0-rescue-2ec787e75c0843b69d1d8f66082910a6) 30 (Workstation Edition)

which is consistent with the fact the first version of Fedora installed on this laptop was Fedora 30.

So it appears that although the OS has been updated through 5 major versions, the rescue entry was never updated, and still relies on a 4.x kernel.

This doesn't seem right and should probably be fixed.

chrismurphy commented 2 years ago

I'm pretty sure this is working as intended. In the short term, maybe the kernel package script that creates the rescue vmlinuz+initramfs pair initially, can be rexecuted only during offline upgrades and just automatically keep it up to date?

I like the idea of reimagining it as an entry that can successfully boot and mount the root file system entirely read-only and put it on a volatile overlay, similar to LiveOS boots. Or even alternatively, boot from a small independent partition containing a copy of our Live installer, or some variant of it.

chrismurphy commented 2 years ago

@jforbes

jforbes commented 2 years ago

The script that makes a new rescue kernel does so on the first kernel install after the existing rescue kernel has been deleted. On a fresh install, this is easy because there never was a rescue kernel. Either you boot the system and it works, or it doesn't matter that the rescue kernel is bad because the system won't boot the installed image. For future updates it would be incredibly wise to ensure that a kernel works before you make it a rescue kernel, so you want to boot it first before you delete a known good rescue kernel. I don't see a real way to automate this. What happens if there is a regression in the new kernel, so you boot it and it seems to work initially, but has bigger problems long term. Any update to a rescue kernel needs to be intentional.

chrismurphy commented 2 years ago

I wonder if we could tie in with grub-boot-success.service somehow. If we get a successful boot, then delete the existing rescue vmlinuz+initramfs, and then run the rescue kernel script.

jforbes commented 2 years ago

I wonder if we could tie in with grub-boot-success.service somehow. If we get a successful boot, then delete the existing rescue vmlinuz+initramfs, and then run the rescue kernel script.

That doesn't particularly help if the kernel boots but something is horribly broken. There is no harm to a very old rescue in most cases. It would only really matter if you replace hardware. There can be real harm in a rescue that doesn't work. Remember, the only real purpose of the rescue kernel is to get your system out of something completely unusable. It isn't meant to be a full runtime.

I could see a case for tying something into gnome software that will perhaps prompt to replace the rescue kernel if something is super old and out of date, after a certain amount of uptime. But it would then need to call to create a new rescue kernel as soon as it is improved, before any package updates. It would also need to make sure that it was being done on the currently running kernel (even if that is not the newest).

chrismurphy commented 2 years ago

the only real purpose of the rescue kernel is to get your system out of something completely unusable. It isn't meant to be a full runtime.

True. But there is also a disconnect in how useful rescue kernel option is initially versus once the matching /usr/lib/modules are removed.

For some time after a clean install, the rescue entry boots to a working desktop. Once the installation time kernel version is uninstalled (due to dnf.conf installonly_limit=3), the system fails to mount local file systems. The vfat and zram modules aren't built-in to the kernel. Even though they're in the rescue initramfs, they're not available following switchroot for some reason. The user experience is they get a dracut prompt when choosing the rescue menu entry.

We could make vfat and zram modules built-in, but I'm not sure how much chasing we'd have to do to be able to reliably boot > 80% of systems to a graphical login. I guess we could try it? vfat and zram are used by default in Cloud, Server, and all desktops anyway.

e.g.

(filtered lsinitrd for rescue initramfs based on 5.15.5)

-rw-r--r--   1 root     root         7764 Oct 28 15:55 usr/lib/modules/5.15.5-200.fc35.x86_64/kernel/fs/fat/vfat.ko.xz
-rw-r--r--   1 root     root        10744 Oct 28 15:55 usr/lib/modules/5.15.5-200.fc35.x86_64/kernel/drivers/block/zram/zram.ko.xz

(filtered journal for failed boot once /usr/lib/modules is gone following removal of the 5.15.5 kernel packages)

[    2.968491] systemd[1]: Switching root.
...
[    4.123393] systemd[1]: systemd-modules-load.service: Main process exited, code=exited, status=1/FAILURE
[    4.123923] systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
[    4.131188] systemd-udevd[858]: Using default interface naming scheme 'v249'.
[    4.139971] systemd[1]: Failed to start Load Kernel Modules.
...
[    4.467088] mount[947]: mount: /boot/efi: unknown filesystem type 'vfat'.
[    4.467213] systemd[1]: Mounting /boot/efi...
[    4.467311] systemd[1]: boot-efi.mount: Mount process exited, code=exited, status=32/n/a
[    4.467404] audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journal-flush comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    4.467458] systemd[1]: boot-efi.mount: Failed with result 'exit-code'.
[    4.467519] systemd[1]: Failed to mount /boot/efi.
[    4.467564] systemd[1]: Dependency failed for Local File Systems.
...
[    4.468119] systemd[1]: Started Emergency Shell.
[    4.468158] systemd[1]: Reached target Emergency Mode.

Edited 2 years ago by chrismurphy

aday commented 2 years ago

We discussed this issue at today's WG meeting. The current behaviour isn't ideal. There are some ideas about how to improve it, but it's unclear exactly which one would be best, or precisely how to improve the rescue "experience" overall. @chrismurphy has kindly offered to start a discussion about this the devel list.

Metadata Update from @aday:
- Issue assigned to chrismurphy
- Issue tagged with: pending-action

2 years ago

jforbes commented 2 years ago

It might make more sense to change the dracut behavior in some way to never uninstall the rescue kernel. I am not a fan of going down the path of moving modules inline.

chrismurphy commented 2 years ago

Yeah I thought about somehow pinning the first installed kernel. I forgot to mention it in the devel@ thread I just started.

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/HP5V5THLK4W5S66UZZ4CBCTO5E4SST35/

Since the discussion is moved to devel@ and it's also not strictly a Workstation working group concern, I'll close this issue now.

Metadata Update from @chrismurphy:
- Issue untagged with: pending-action

2 years ago

Metadata Update from @chrismurphy:
- Issue close_status updated to: Won't fix
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata

Assignee

chrismurphy

Tags

None

Blocking

None

Depending on

None

Milestone

None

fedora-workstation

Source Code

#259 Rescue boot entry not updated on OS upgrades Closed: Won't fix 2 years ago by chrismurphy. Opened 2 years ago by axels.

Metadata

#259 Rescue boot entry not updated on OS upgrades

Closed: Won't fix 2 years ago by chrismurphy. Opened 2 years ago by axels.