Issue #163: 2m shutdown timer is too long - fedora-workstation

fedora-workstation

#163 2m shutdown timer is too long

Closed: Fixed a year ago by catanzaro. Opened 3 years ago by catanzaro.

When a user service hangs at shutdown, it can delay shutdown for up to 2 minutes. This is frustrating. (It's also difficult to debug, since the system is shutting down.)

I'd like to reduce the timeout to 20 seconds instead. If any service takes longer than 20 seconds to stop on Workstation, it deserves to be killed. Maybe the longer timeout is suitable for servers (but I'm skeptical).

chrismurphy commented 3 years ago

I think so too, but before that can happen we need to make certain that all file systems can be properly unmounted. Anything that's running from such real file systems (i.e. excluding initramfs or volatile device) that also exempts itself from being killed, can prevent clean unmounts from happening. For example, plymouth:

Jul 20 10:24:05 fmac.local systemd[1]: /usr/lib/systemd/system/plymouth-start.service:15: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.

Edited 3 years ago by chrismurphy

aday commented 3 years ago

I think so too, but before that can happen we need to make certain that all file systems can be properly unmounted. Anything that's running from such real file systems (i.e. excluding initramfs or volatile device) that also exempts itself from being killed, can prevent clean unmounts from happening. For example, plymouth:
Jul 20 10:24:05 fmac.local systemd[1]: /usr/lib/systemd/system/plymouth-start.service:15: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.

What are the action steps here? Links to required tracking issues would be welcome.

chrismurphy commented 3 years ago

Probably ask about the consequences/feasibility of doing this on the systemd-devel list. In particular how to isolate it to just the desktop/laptop use case. And is it really going to be a hard cut off, no matter what's running? What are the exceptions if any? And what happens if tools just wrongly exempt themselves?

And then a devel@ thread. And eventually a Fedora 34 feature proposal.

What's the time frame for plymouth no longer using killmode=none? And what's the time frame for the deprecation of killmode=none?

mclasen commented 3 years ago

We should not allow random services to block shutdown like this. It is just wrong.

chrismurphy commented 3 years ago

Yeah I've supported deprecating killmode=none for years. I don't know if other processes use it.

Nevertheless, I can't guess what services might legitimately need slow shutdown, more than 20 seconds. That might be vanishingly rare on the desktop. If someone's running a particularly large or complex data base, though? Is 20 seconds enough? I don't know. Yes it should be crash safe, but basically we'd be depending on crash safeness.

Quite a lot of services do have a wait timer, 90s to 2m. If that's not going to be honored, then what does this setting even mean anymore? Does it become a server only indication?

johannbg commented 3 years ago

The only application that I have seen blocking the shutdown was packagekit and the fact is end users will press a power button after what ever time they feel is longer than usual and one fun story about that is that I one time had to literally rescue a Fedora user that was stuck in a reboot loop.

Basically what happen was that shutdown took too long, he re-acted by hard pressing the power off button, that triggered a file system check on power on, Fedora did nothing to indicate to him that it was performing a filesystem check ( the spinner just kept spinning ) so after what he felt longer than usual bootup he of course took the same course of action and pressed the power button again and the filesystem check of course ran again since it never completed the check hence the course of action was repeated and he ended up "fighting" Fedora for an good hour ( which felt like a lifetime for him since he had to use his computer ) until he finally gave up and decided to call me before he threw the computer off his balcony. I of course pressed ESC ( which he had no idea he had to do ) saw what was going on, allowed the filesystem check to complete and everything was fine.

That said this 90s time-out of course is ridiculous from an end user usability perspective and regardless of what decision Fedora comes to, end users will press the power button after what ever they feel is taking longer than usual.

Now upstream wont budge on lowering this and given that Fedora is about to be preloaded on hw and thus will reach users that are neither administrators themselves or have one on speed dial, this issue ( along with the filesystem check being graphically displayed at bootup if it has not been addressed already ) will need to be addressed either by figuring out and fixing what's causing the delay in the first place ( start by checking culprits that might be remotely downloading or doing some sort of check for updates packagekit,fwupd,dnf-makecache ( this should just be disabled ),dbxtool ( this should also be disabled if users dont have secure boot enabled ), and anything that has anything to do with remote filesystem mounting ) or the workstation group can decide what ever it considers a reasonable time by dropping a configuration snippet into either /usr/lib/systemd/system.conf.d/ or /etc/systemd/system.conf.d/ with an override as in /etc/systemd/system.conf.d/system.conf
replace 90s with whatever is decided.
DefaultTimeoutStopSec=90s

Edited 3 years ago by johannbg

catanzaro commented 3 years ago

The only application that I have seen blocking the shutdown was packagekit and the fact is end users will press a power button after what ever time they feel is longer than usual and one fun story about that is that I one time had to literally rescue a Fedora user that was stuck in a reboot loop.

In my personal experience, it's usually a user session service blocking shutdown. (PackageKit is a system service.) It's impossible to know which one, because systemd only prints the exact service name when it's a system service. For user services, it only prints the user ID that's running the service that's blocking shutdown.

But I have seen PackageKit blocking shutdown far too often as well.

Basically what happen was that shutdown took too long, he re-acted by hard pressing the power off button, that triggered a file system check on power on, Fedora did nothing to indicate to him that it was performing a filesystem check ( the spinner just kept spinning ) so after what he felt longer than usual bootup he of course took the same course of action and pressed the power button again and the filesystem check of course ran again since it never completed the check hence the course of action was repeated and he ended up "fighting" Fedora for an good hour ( which felt like a lifetime for him since he had to use his computer ) until he finally gave up and decided to call me before he threw the computer off his balcony. I of course pressed ESC ( which he had no idea he had to do ) saw what was going on, allowed the filesystem check to complete and everything was fine.
That said this 90s time-out of course is ridiculous from an end user usability perspective and regardless of what decision Fedora comes to, end users will press the power button after what ever they feel is taking longer than usual.

I'm not surprised. This is a good example of why the timeout needs to be way lower.

figuring out and fixing what's causing the delay in the first place ( start by checking culprits that might be remotely downloading or doing some sort of check for updates packagekit,fwupd,dnf-makecache ( this should just be disabled ),dbxtool ( this should also be disabled if users dont have secure boot enabled ), and anything that has anything to do with remote filesystem mounting )

I think systemd should print the identity of which user service is blocking shutdown, otherwise we cannot plausibly solve this. As for system services, I only remember noticing PackageKit as problematic, but fixing PackageKit alone is not going to be enough.

or the workstation group can decide what ever it considers a reasonable time by dropping a configuration snippet into either /usr/lib/systemd/system.conf.d/ or /etc/systemd/system.conf.d/ with an override as in /etc/systemd/system.conf.d/system.conf
replace 90s with whatever is decided.
DefaultTimeoutStopSec=90s

I suggest 15s-20s.

chrismurphy commented 3 years ago

I think systemd should print the identity of which user service is blocking shutdown, otherwise we cannot plausibly solve this.

Yeah. I expect that service unit shutdown must happen before sysroot can be remounted read-only (functionally achieves the same result as a clean umount). Until it is remounted read-only, the journal can log ... something.

chrismurphy commented 3 years ago

@zbyszek @rstrode

Is 20s a reasonable aimpoint for force a reboot? What needs to happen to make it possible?

The old (and fixed) bug I'm thinking of had this pernicious side effect of disallowing read-only remount of sysroot, then systemd reboots anyway, so the file system journal was dirty necessitating log replay at next boot. While I'm not worried about that specific problem anymore, I'm wondering if 20s is really enough time to kill everything off so that remount read-only succeeds before reboot. At the moment plymouth is exempt, so it can still hold up the reboot, but plymouth in turn might be waiting on other things, correct?

zbyszek commented 3 years ago

Is 20s a reasonable aimpoint for force a reboot? What needs to happen to make it possible?

I think it is a reasonable goal. I would expect shutdown to happen within a few seconds on any fairly recent hardware. Anything more than that is bad.

That said, just reducing the timeout is not going to lead a good experience. There is a whole hierarchy of timeouts. From the top of my head: the hardware watchdog, pid1 job timeouts, pid1 watchdog timeouts (currently disabled in fedora by default), user manager job timeouts, internal application timeouts. For things to function reasonably each one must be lower then the previous one, with a lot of fudge, or it stops being useful.

So I think we should first figure out what services are slow, and why, and fix them. And if we are in a state where the timeouts are very rarely needed, then we could consider lowering them, since then they'd only be used when something is is really broken.

In particular, if policykit has this issue, we should figure out what it is doing when this happens: if it is installing packages, interrupting it is going to be bad and we should let it finish. If it is just downloading stuff or updating metadata, it should allow itself to be interrupted. Essentially, this is an application issue and making systemd forcibly kill the application would only be papering over the real problem.

It's impossible to know which one, because systemd only prints the exact service name when it's a system service. For user services, it only prints the user ID that's running the service that's blocking shutdown.

We could improve this. The user managers already communicate state to pid1 through sd-notify, and we could extend this to provide additional information about internal unit state. This could apply to other services too:
- Job user@1000.service/stop running (waiting for job packagekit.service/stop)
- Job systemd-journald.service/stop running (flushing data to disk)
- ...

chrismurphy commented 3 years ago

Is there a reason why sometimes there's no reporting what's holding things up? e.g.

A stop job is running for User Manager for UID 1000 ( 30s / 2min)

I don't know what that implicates, or how to troubleshoot it.

The PackageKit specific bug goes back years, but this bug is current and open:
stop job is running for PackageKit Daemon, holds up shutdown for 1m30s

johannbg commented 3 years ago

It implies that a stop job ( which is service that runs inside the user 1000 systemd instance ) is running for the user with the UID 1000.

To do debug this you need to enable the debug shell and reproduce the error and or do Alt-F9 and login in as the user with UID 1000 which is triggering the bug and run systemctl --user list-jobs to find the faulty process ( very novice end user friendly right o_O )

Dont be surprised with what you find since people are masking tracker service, removing imsettings disabling lvm service ( which should not be installed if your not using lvm ) since all kinds of fun stuff that seems to be blocking shutdown ( it's not just packagekit that might be triggering this ) and to make matter worse on the opposite side of the coin packagekit is segfaulting on bootup on F33 which is triggers systemd-coredump which in turn is delaying the boot to login process as a result of that, a bug which I think hughsie has already fixed upstream but does not seem to have propagated downstream...

chrismurphy commented 3 years ago

I've got a consistent reproducer, so far, in Rawhide
https://bugzilla.redhat.com/show_bug.cgi?id=1909556

catanzaro commented 3 years ago

I'd like to change the timeout to 5 seconds. Any service that needs longer than 5 seconds to quit is broken.

Edited 3 years ago by catanzaro

Metadata Update from @catanzaro:
- Issue tagged with: meeting-request

3 years ago

chrismurphy commented 3 years ago

If we have sufficient troubleshooting information in the console shown by ESC, about what's holding up the reboot/shutdown, we can probably whittle down services and get them to respond to SIGTERM in 5s or less.

SIGKILL, sysrq+b, reboot -f, and equivalents will result in data loss. So what's the mechanism for services to opt out? For example a VM? If we just force a reboot of a host, before the guest is properly shutdown, it could be bad. Bad as in inconsistent database, inconsistent file system, either of which might be non-repairable even aside from whatever in-flight data was lost.

Guaranteed 5s shutdown without data loss? That's hibernation.

germano commented 3 years ago

For your info, even if not strictly related to this ticket. Some days ago I opened this systemd request for enhancement
RFE: to accomodate for UPS add global clamp on unit stop timeouts, plus a time-scheduled "immediate" shutdown

Edited 3 years ago by germano

chrismurphy commented 3 years ago

This is from Workstation Rawhide, and appears to be a list of items that did not respond to SIGTERM for 2 minutes, then became subject to SIGKILL.

[  302.813904] systemd[1]: user@1000.service: State 'stop-sigterm' timed out. Killing.
[  302.824203] systemd[1]: user@1000.service: Killing process 1541 (systemd) with signal SIGKILL.
[  302.825409] systemd[1]: user@1000.service: Killing process 2719 (dbus-broker-lau) with signal SIGKILL.
[  302.826859] systemd[1]: user@1000.service: Killing process 2720 (dbus-broker) with signal SIGKILL.
[  302.828200] systemd[1]: user@1000.service: Killing process 2010 (pipewire) with signal SIGKILL.
[  302.828949] systemd[1]: user@1000.service: Killing process 2208 (pipewire-media-) with signal SIGKILL.
[  302.833455] systemd[1]: user@1000.service: Killing process 1772 (pipewire-pulse) with signal SIGKILL.
[  302.835513] systemd[1]: user@1000.service: Killing process 2025 (BluejeansHelper) with signal SIGKILL.
[  302.840053] systemd[1]: user@1000.service: Killing process 2060 (BluejeansHelper) with signal SIGKILL.
[  302.845167] systemd[1]: user@1000.service: Main process exited, code=killed, status=9/KILL
[  302.847460] systemd[1]: user@1000.service: Killing process 2010 (pipewire) with signal SIGKILL.
[  302.847953] systemd[1]: user@1000.service: Killing process 2208 (pipewire-media-) with signal SIGKILL.
[  302.851035] systemd[1]: user@1000.service: Killing process 1772 (pipewire-pulse) with signal SIGKILL.
[  302.855799] systemd[1]: user@1000.service: Failed with result 'timeout'.
[  302.931680] systemd[1]: Stopped User Manager for UID 1000.

It takes another 4s to shut down remaining services, that I guess were held up because the above weren't responding to SIGTERM?

We know the bluejeans helper can just be clobbered at 5 seconds after a restart is requested, but what if that were some kind of database or a VM? How do these services indicate they have or have not properly shutdown so that we're not causing data loss or corruption by forcing a shutdown at 5s?

Also, is there a way to get some kind of dependency chain? I can't tell from the journal whether pipewire is ignoring SIGTERM, which presumably would be a bug, or if it's waiting on something else to let go, like the bluejeans helper? In other words, I don't know how to file a useful bug report.

Edited 3 years ago by chrismurphy

Metadata Update from @catanzaro:
- Issue untagged with: meeting-request
- Issue tagged with: meeting

3 years ago

catanzaro commented 3 years ago

Action: Zbigniew will reduce the timeouts somewhat in systemd, but reducing too much will hide the underlying issues that need to be fixed:

PackageKit is clearly broken and must be fixed. Action: ??? We could ask Richard to investigate, but I don't think he has much time for PackageKit nowadays.
Some user unit is also broken, but we don't know what. systemd should tell us by propagating this information from the user manager to the session manager. Action: ??? find a volunteer to fix this?

Metadata Update from @catanzaro:
- Issue untagged with: meeting

3 years ago

zbyszek commented 3 years ago

systemd should tell us by propagating this information from the user manager to the session manager.

The general idea is to propagate this information from the user manager to the system manager using sd-notify calls. The system manager would then extend the information it is currently printing (the part A stop job is running for User Manager for UID 1000 ...). This shouldn't be too complicated, but the details need to be figure out, in particular how to fit all this information into a line of text that fits on 80 character terminals.

kparal commented 3 years ago

PackageKit is clearly broken and must be fixed. Action: ??? We could ask Richard to investigate, but I don't think he has much time for PackageKit nowadays.

I see PackageKit hanging very often when I log in and quickly (< 30s or so) reboot/shutdown. It might happen only when PK decides to refresh the repos or something, I don't know, but I do see it very often. Ping me if Richard has time to look into this and needs some debugging/logs, I can try to help.

chrismurphy commented 3 years ago

PackageKit is also part of our memory leak woes, it regularly and persistently is using 800+MB of real memory, almost none of that is swappable. It's going to be replaced by the DNF team, but I don't know if it's happening at the same time as DNF 5 in Fedora 35, or soon thereafter.

Therefore I wonder if this service could just be restarted periodically and as a side effect if it'd be more responsive to SIGTERM? I see no negative effects to just systemctl restart packagekit.service it never refuses or delays, and is pretty fast, < 1s. Launching gnome-software following this restart of pk, it too is immediately usable without the often refresh delays.

Maybe put it on a timer and restart it every hour? :P

catanzaro commented 3 years ago

Maybe put it on a timer and restart it every hour? :P

I think prepared upgrades will disappear for a few minutes after a PackageKit restart, which isn't great.

zbyszek commented 3 years ago

https://github.com/systemd/systemd/pull/18386

kparal commented 3 years ago

Maybe put it on a timer and restart it every hour? :P

I don't think that's a good idea, it would lead to hard to reproduce race conditions. Bad things can happen when the user is at the very same time interacting with gnome-software, or even a transaction is currently running.

aday commented 3 years ago

Adding pending-action tag for https://github.com/systemd/systemd/pull/18386 .

Metadata Update from @aday:
- Issue tagged with: pending-action

3 years ago

chrismurphy commented 3 years ago

Is the mechanism for isolating the reduce shutdown time to just desktops via the user session manager? Does the change need any communication to other editions to make sure they aren't surprised by the change in behavior, in particular Server?

zbyszek commented 3 years ago

Does the change need any communication to other editions to make sure they aren't surprised by the change in behavior, in particular Server?

I want to change the timeout, if at all, universally for the user session. But this shouldn't require any special handling from editions. The new settings should "just work".

https://github.com/systemd/systemd/pull/18386

Unfortunately the short timeout causes our functional tests to completely fail in Ubuntu's autopkgtests (equivalent to our autoqa). I'll need to figure out what is going wrong there.

chrismurphy commented 3 years ago

@feborges I wonder about GNOME Boxes, what's going to happen if a user mode VM is left running and the user reboots? Can Boxes or libvirtd prevent reboot until the guest is properly shutdown?

I see it under

 CGroup: /user.slice/user-1000.slice/user@1000.service
...
│ ├─dbus-:1.2-org.gnome.Boxes@0.service
             │ │ ├─2622 /usr/bin/gnome-boxes --gapplication-service
             │ │ └─2639 /usr/sbin/libvirtd --timeout=120

chrismurphy commented 3 years ago

Unfortunately the short timeout causes our functional tests to completely fail in Ubuntu's autopkgtests (equivalent to our autoqa). I'll need to figure out what is going wrong there.

Would it be helpful to change the kill timer to 20 seconds? Eventually the goal is shorter, in which case there isn't enough time for most users to ESC to console and take a cell photo for debugging. I think we'll need to depend on journalctl -b-1 output to show a list of user session processes that remain at kill time, in particular as the kill time is shortened. This would also obviate the need to: fit all this information into a line of text that fits on 80 character terminals - just don't even depend on the hidden console; ask the user to just please wait instead, and following reboot to post the full prior journal which should have sufficient information for diagnosis.

Edited 3 years ago by chrismurphy

tablepc commented 3 years ago

Perhaps a solution to prevent data loss would be to do two things. First shorten the time out and most importantly modify the Gnome software that receives the click for shutdown or restart so the code checks for incomplete transactions for storage and network. If a problem is found inform the user of a delay and and reason. After the situation is resolved Gnome can sent the usual signal for shutdown or reboot.

chrismurphy commented 3 years ago

There's a ~1 year old proposal on test@ list to have a release criterion for this, and it's been resurrected today (same thread).

criterion proposal: prevent services timing out on system shutdown

catanzaro commented 3 years ago

Action: Zbigniew's merge request is causing CI failures that need to be fixed.

feborges commented 3 years ago

@feborges I wonder about GNOME Boxes, what's going to happen if a user mode VM is left running and the user reboots? Can Boxes or libvirtd prevent reboot until the guest is properly shutdown?

I see it under
CGroup: /user.slice/user-1000.slice/user@1000.service ... │ ├─dbus-:1.2-org.gnome.Boxes@0.service │ │ ├─2622 /usr/bin/gnome-boxes --gapplication-service │ │ └─2639 /usr/sbin/libvirtd --timeout=120

AFAIK libvirt-guests is the proper way of doing this. It is provided by the libvirt-client package, which currently doesn't seem installed in the default workstation. I could look into making it a Boxes dependency (already requested for https://bugzilla.redhat.com/1868818)

See also $ less /etc/sysconfig/libvirt-guests

Metadata Update from @catanzaro:
- Issue assigned to zbyszek

3 years ago

aday commented 2 years ago

Any updates on this, @zbyszek ? It'd be good to have it fixed.

Metadata Update from @aday:
- Issue set to the milestone: Fedora 36

2 years ago

catanzaro commented 2 years ago

@zbyszek's pull request is approved upstream, but it seems the CI is still red.

aday commented 2 years ago

We discussed this ticket during yesterday's working group meeting. Two agreed actions there:

@mclasen has agreed to reach out to relevant parties regarding the systemd PR
@chrismurphy and others have agreed to try to reproduce the PackageKit shutdown delays that have been anecdotally reported

mclasen commented 2 years ago

I'll reach out to systemd one more time

Metadata Update from @aday:
- Issue untagged with: pending-action
- Issue set to the milestone: Fedora 37 (was: Fedora 36)

2 years ago

Metadata Update from @ngompa:
- Issue tagged with: experience

2 years ago

ngompa commented 2 years ago

For reference, a similar ticket is present on the Fedora KDE SIG tracker: fedora-kde/SIG#184

catanzaro commented 2 years ago

Agreed at today's meeting: for Fedora Workstation, the shutdown timer for both system AND user units must be no longer than 15 seconds. Anything that takes longer than that needs to be killed regardless of consequences.

CC @zbyszek, can you implement this upstream, or do you want us to provide a downstream patch to the systemd package?

Metadata Update from @catanzaro:
- Issue tagged with: pending-action

2 years ago

aday commented 2 years ago

Agreed at today's meeting: for Fedora Workstation, the shutdown timer for both system AND user units must be no longer than 15 seconds. Anything that takes longer than that needs to be killed regardless of consequences.

Note the "for Fedora Workstation" here: we are not proposing that this change be made for server or IoT.

catanzaro commented 2 years ago

Hi @zbyszek, let's try to implement this in time for F37 beta. Do you want to look into this yourself, or do you prefer that we provide a patch?

Note: if we provide a patch, it would probably be just changing the hardcoded numbers to 15 seconds, which would affect other Fedora editions. If you want the timer to be longer than 15 seconds on other Fedora editions, we'd need to figure out some way to do that.

aday commented 2 years ago

We discussed this issue briefly at today's WG meeting. We would really like this issue to be resolved for F37, but haven't had a response from upstream systemd.

@catanzaro is therefore going to propose a downstream systemd PR. Hopefully this will be accepted, but if doesn't get attention we can refer it to FESCo.

Metadata Update from @aday:
- Issue assigned to catanzaro (was: zbyszek)

2 years ago

aday commented 2 years ago

@catanzaro created the downstream PR here: https://src.fedoraproject.org/rpms/systemd/pull-request/85

This was done a week ago and hasn't been a response yet. We discussed this at yesterday's WG meeting and agreed that the next step is to refer it to FESCo.

aday commented 2 years ago

FESCo ticket: https://pagure.io/fesco/issue/2853

aday commented 2 years ago

FESCo would like to see a change proposal for this change, and it's too late to do that for F37. Unless the issue is fixed upstream, it looks like we'll need to push the change back to F38 and do a change proposal.

Metadata Update from @aday:
- Issue untagged with: pending-action

2 years ago

catanzaro commented a year ago

I dropped the ball here so Allan wound up drafting the change proposal. I've edited it a bit and we submitted it.

catanzaro commented a year ago

https://fedoraproject.org/wiki/Changes/Shorter_Shutdown_Timer

Metadata Update from @catanzaro:
- Issue tagged with: pending-action

a year ago

catanzaro commented a year ago

The FESCo ticket https://pagure.io/fesco/issue/2928 has been approved with a change to 45s instead of 15s. We're allowed to request a lower timeout again in the future. Allan wants to get down to at least 30s, and I agree that would be nice.

One other change in this proposal is that processes that don't quit in time will be crashed instead of killed, which was positively received.

catanzaro commented a year ago

Fixed in Fedora 38.

Metadata Update from @catanzaro:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Metadata

Assignee

catanzaro

Tags

Blocking

None

Depending on

None

Milestone

Fedora 37

fedora-workstation

Source Code

#163 2m shutdown timer is too long Closed: Fixed a year ago by catanzaro. Opened 3 years ago by catanzaro.

Metadata

experience pending-action

#163 2m shutdown timer is too long

Closed: Fixed a year ago by catanzaro. Opened 3 years ago by catanzaro.