Issue #2928: Change: Shorter Shutdown Timer - fesco

fesco

#2928 Change: Shorter Shutdown Timer

Closed: Accepted a year ago by zbyszek. Opened a year ago by bcotton.

A downstream configuration change to reduce the systemd unit timeout from 2 minutes to 15 seconds.

Owners, do not implement this work until the FESCo vote has explicitly ended.
The Fedora Program Manager will create a tracking bug in Bugzilla for this Change, which is your indication to proceed.
See the FESCo ticket policy and the Changes policy for more information.

ngompa commented a year ago

decathorpe commented a year ago

-1

I'd rather see misbehaving services fixed - or apply shorter, service-specific timeouts when that's not possible - rather than possibly causing issues for other components that shouldn't be terminated early.

I also asked about the possibility of showing systemd information for hung services at shutdown in plymouth (so users could see that something is actually happening, instead of just staring at the spinner spinning endlessly), but this question seems to have been ignored. Even if this proposal is accepted, doing that would be a nice improvement for user experience IMO.

PS: Feel free to also disregard ignore this -1 vote when it comes to adding the "meeting" tag. I won't change my mind on this, so discussing this during a meeting is unnecessary from my point of view.

catanzaro commented a year ago

I also asked about the possibility of showing systemd information for hung services at shutdown in plymouth (so users could see that something is actually happening, instead of just staring at the spinner spinning endlessly), but this question seems to have been ignored. Even if this proposal is accepted, doing that would be a nice improvement for user experience IMO.

Sorry for missing the comment regarding plymouth. Plymouth's graphical shutdown should remain free of technical information and honestly text in general (besides the Fedora logo and manufacturer logo). That's not a good place to report anything: even just displaying text is risky since it requires localization, fonts, and complex text rendering support, which I don't think currently exists in the initramfs? We don't want to display English text to a user whose locale is Chinese or Hindi, for example. And don't want to display service names ever. But informed users who know English can press Escape to peer behind plymouth and see what's going on.

I just updated the change proposal with a change to use SIGTERM -> SIGABRT -> SIGKILL instead of just SIGTERM -> SIGKILL. This means systemd will crash the service using SIGABRT if the timer is hit, so we can get a core dump and can therefore use gdb to show precisely what the misbehaving service was doing at the time it was crashed, plus a noisy report from ABRT on next boot so the problem cannot go unnoticed. I think it makes sense to do this even if the timout change is not approved (but we should really do both). No doubt this will result in a (hopefully temporary) increase in crash reports, but surfacing bugs seems better than hiding them.

Edited a year ago by catanzaro

dcantrell commented a year ago

-1

Echoing @decathorpe here. The correct thing to do is address the problem services. @ngompa did this 5 months ago explaining the issue around dnf and the underlying use of a GPG agent. We should be addressing the problem there rather than papering over the problem.

bcotton commented a year ago

After a week, the vote is (+1,0,-2). Tagging for next week's meeting.

Metadata Update from @bcotton:
- Issue tagged with: meeting

a year ago

zbyszek commented a year ago

This will be discussed during the meeting today.

@catanzaro, @aday ^

zbyszek commented a year ago

I'd rather see misbehaving services fixed

This should be happening already. There is nothing stopping people from making various services shut down quickly. This doesn't mean that we shouldn't reduce the overall timeout too.

chrismurphy commented a year ago

I don't think it's practical for change owners to become responsible for fixing problem services. The change owners are advocating for an overall improved user experience. Any risk of premature shutdown of services can be mitigated by that service inhibiting the shutdown on an opt-in basis, thus actual risk is low. And I expect most upstreams would say, "the program isn't misbehaving, and we don't want to commit resources to figuring out why quit takes while, just kill the process if you want it gone sooner..."

Like the dnf case, sometimes these things are rather complicated to fix, and yet at the same time there's low risk of just killing the offending service without a ceremony. And dnf being fixed is very much in the handwavy future, 1-2 years in the best case.

Therefore "fix the service" just becomes kick the can down the road indefinitely, as a form of risk aversion, which I don't find very compelling as a way to actually fix the problem. Instead, we need to mitigate legitimate concerns with (reboot/shutdown) inhibit, but otherwise actually perform the task the user requested which is a reboot/shutdown, without unnecessary delay.

Edited a year ago by chrismurphy

zbyszek commented a year ago

This was discussed during the meeting today:
AGREED: Change is approved with a timeout of 45 s and the caveat that editions must be able to override the change (+8,0,-1)

Metadata Update from @zbyszek:
- Issue close_status updated to: Accepted
- Issue status updated to: Closed (was: Open)

a year ago

rom1dep commented a year ago

I'm clearly out of my depth here, but couldn't a well-behaving service sometimes require more than 45s (or 2min, …) to shutdown? I can think of large business applications doing lots of disk/network writes upon shutdown (e.g. to persist their state, check data consistency when done, optimizing cache for next run, …)

Wouldn't it make sense to also consider the service activity (CPU/RAM/IO/…) as an indication of whether it's too early for it to be SIGKILLed?

catanzaro commented a year ago

I'm clearly out of my depth here, but couldn't a well-behaving service sometimes require more than 45s (or 2min, …) to shutdown? I can think of large business applications doing lots of disk/network writes upon shutdown (e.g. to persist their state, check data consistency when done, optimizing cache for next run, …)

Yes, but such cases are exceptions rather than the norm, so doesn't make sense to consider them when setting the default value. Here's what you can do:

As a service developer or package maintainer, you can (a) talk to systemd using systemd APIs to extend timeout (I'm told this is systemd-inhibit, but I'm not sure, is that really true? doesn't seem quite right?). Or you can (b) just change the TimeoutStopSec= for the service, which is easy. The default TimeoutStopSec will change to 45 seconds, but each service can still override the default with its own limit. E.g. Postgres and virt-manager both set the limit to 0 (infinity) to disable the timeout entirely.

As a sysadmin, you can configure it for either any service or configure the default timeout to something else. Sysadmin or computer owner always has full control.

Wouldn't it make sense to also consider the service activity (CPU/RAM/IO/…) as an indication of whether it's too early for it to be SIGKILLed?

I think so, because the activity could very easily be a logic error like a bad loop.

rom1dep commented a year ago

As a sysadmin, you can configure it for either any service or configure the default timeout to something else. Sysadmin or computer owner always has full control.

Perhaps a middle-ground could be found by having distro-shipped services having a distinct (and lower) timeout than the user-defined ones (which could remain with the current value)?

(sorry if I'm rehashing some arguments that were brought-up during previous discussions)

I think so, because the activity could very easily be a logic error like a bad loop.

It doesn't need to be perfect, it only needs to keep users on the safe side/avoid the worst case scenario (of killing prematurely a legit process and incurring irrecoverable loss)

catanzaro commented a year ago

(Correction to previous post: I meant libvirt, not virt-manager.)

As a sysadmin, you can configure it for either any service or configure the default timeout to something else. Sysadmin or computer owner always has full control.

Perhaps a middle-ground could be found by having distro-shipped services having a distinct (and lower) timeout than the user-defined ones (which could remain with the current value)?

(sorry if I'm rehashing some arguments that were brought-up during previous discussions)

Why would you expect user-defined services would want a longer timeout than distro-provided services? But also, systemd doesn't work this way.

I think so, because the activity could very easily be a logic error like a bad loop.

It doesn't need to be perfect, it only needs to keep users on the safe side/avoid the worst case scenario (of killing prematurely a legit process and incurring irrecoverable loss)

45 seconds is still extremely lax, in my opinion. I think we can expect admins to know how to use TimeoutStopSec when required. This should really be pretty rare: databases and hypervisors are the main users.