#358 systemd-oomd seems overly aggressive
Closed: Fixed 5 months ago by catanzaro. Opened a year ago by adamwill.

@catanzaro asked me to report this, so here I am!

Since systemd-oomd landed in Fedora personally I've found it to be way too aggressive, and turned it off on all my systems. I was running a system with 8G of RAM (6.5G effective, after 1.5G was eaten by the GPU...) as my main for a few months and it was especially unbearable there. My typical desktop session has Firefox, Evolution, a terminal, and gedit; very often adding anything at all intensive to this mix (e.g. trying to run a VM, or compile something) caused systemd-oomd to start killing stuff. Often the VM, which was fun when it was halfway through a test.

After turning it off completely, it's not like the system started getting completely deadlocked or anything. Usually in a similar situation with systemd-oomd disabled it would grind for a while (probably hitting swap) and then keep working OK. I much prefer that behaviour.

My primary is now a system with 16G of RAM again, but even there I ran into some systemd-oomd kills which really just didn't seem necessary when I first deployed it, so I turned it off again. I have not once hit an actual OOM situation since.

I'm sorry I don't have any more technical details, but it's never been at the top of my priority list to really establish the parameters and note the log messages and stuff.

I'm not alone, though - there are lots of similar reports in the wild:

https://ask.fedoraproject.org/t/how-to-configure-systemd-oomd-to-be-less-aggressive/31192
https://www.reddit.com/r/Fedora/comments/w2hl0k/systemdoomd_is_insanely_aggressive/
https://www.reddit.com/r/Fedora/comments/mbmiz1/how_do_i_permanently_disable_systemdoomd/
https://bugzilla.redhat.com/show_bug.cgi?id=1941170 (folks still reporting over-aggression there recently, even after the initial policy changes from early in the bug)
https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdOomdNowDisabled
https://news.ycombinator.com/item?id=33894469 (has some dumb systemd bashing, but also useful stuff: a claim that "It's not systemd-oomd that is the main problem here, it's Fedora's implementation/application of it to all the user@.service." and experiences like "Same here; been disabling it since it was introduced to Fedora... a release or two ago. My desktops have 64GB+ of memory typically, they're virtualization workhorses. I don't know what it is in oomd, but any time I make actual real use of it, things started getting killed.") https://news.ycombinator.com/item?id=33896729 has some useful references to other distros' choices.
* https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1972159 (an ubuntu issue suggesting "A user's browser, desktop session, or some other desktop application may be killed by systemd-oomd when SwapUsedLimit is reached, but system performance otherwise appears unaffected.")


So I guess the problem is likely our ManagedOOMMemoryPressureLimit=50%? That's probably too aggressive, and the solution would be to raise it to a substantially higher value? But I wonder what value would work well in practice....

Another thread full of complaints. It seems users don't understand that the behavior is configurable.

I mean, I knew it was configurable but my thought process was: I never have OOM problems anyway, why would I bother configuring this thing when I can just turn it off and go back to the previous behaviour which was causing me zero problems?

It is somewhat dependent on your hardware and swap situation.

I had removed the swap-based kill policy 5 months ago, so some of the older complaints from when systemd-oomd first came out are not necessarily valid today. The swap-based policy was indeed too aggressive since the Fedora's default swap configuration was small. At that time I also tweaked the per user slice memory pressure limit to 50% (when it used to be on the user manager as a whole). Both these changes should be in F37.

I think it is hard to strike the balance between how much pressure for how long is ideal. But if we want something even more conservative as the default I think we can switch the memory pressure config back to the top level user.slice (instead of at the user per slice level). That would require the user slice as a whole to have pressure instead of looking at the per slice pressure. Per slice is more likely to target specific applications.

Alternatively, upping the pressure duration from 20s to 5m would also make it more conservative since you need sustained pressure for 5m before a kill happens. But I think at that point users would probably reboot if it got really bad.

My experience with the 8G system was from about August to November 2022, using F37 Silverblue. From the logs on that system, on Sep 28 it killed Evolution, on Oct 28 it killed my entire desktop, so on Oct 31 I disabled it. Somehow it got turned on again and on Nov 19 it killed Firefox, so I turned it off again on Dec 14. I feel like there must be more to the story, because I can't find the detailed logs I remember seeing before about pressure levels and so on, and I remember more kills happening (and the dates support that; why would I wait several days between a kill happening, and disabling the service?) but can't find any more stuff in the logs. Possibly other events wound up killing systemd itself and the logs got lost? I dunno.

You don't think raising the ManagedOOMMemoryPressureLimit= to a higher value would be useful?

It is somewhat dependent on your hardware and swap situation.

We should optimize for the default case: small amount of swap on zram.

I think it is hard to strike the balance between how much pressure for how long is ideal. But if we want something even more conservative as the default I think we can switch the memory pressure config back to the top level user.slice (instead of at the user per slice level). That would require the user slice as a whole to have pressure instead of looking at the per slice pressure. Per slice is more likely to target specific applications.

I think I understand: this would allow memory pressure to slow down particular applications, but not allow it to slow down the entire user session. Is that correct?

Alternatively, upping the pressure duration from 20s to 5m would also make it more conservative since you need sustained pressure for 5m before a kill happens. But I think at that point users would probably reboot if it got really bad.

Yeah that's probably too conservative. We should still aim to kill something quickly when the system is truly out of control, before the system locks up.

In terms of desktops, I'm wondering whether we're expecting too much of oomd? There are examples of oomd being too aggressive and not aggressive enough, see systemd25596 and I'm not sure we can fix both?

Ostensibly oomd should kill well before kernel oomkiller, in all cases. And yet oomd should not kill at all when there's enough memory or swap available, in all cases. That sounds difficult. Am I wrong?

I think desktop users would like having a system that preserves a minimum amount of responsiveness , while merely being notified of cgroups/processes that are using excessive resources or behave suspiciously. I even wonder if these cgroups should be frozen instead of killed, and then give the user the option to unfreeze (including whatever consequences that entails, so be it) or to kill the resource hog.

I think I understand: this would allow memory pressure to slow down particular applications, but not allow it to slow down the entire user session. Is that correct?

Yes pretty much.

Ostensibly oomd should kill well before kernel oomkiller, in all cases. And yet oomd should not kill at all when there's enough memory or swap available, in all cases. That sounds difficult. Am I wrong?

I don't think we can guarantee that oomd will execute before the kernel oomkiller in all cases. But if we don't want any kills at all when there is e.g. > 5% memory and swap free that is possible.

I think desktop users would like having a system that preserves a minimum amount of responsiveness , while merely being notified of cgroups/processes that are using excessive resources or behave suspiciously. I even wonder if these cgroups should be frozen instead of killed, and then give the user the option to unfreeze (including whatever consequences that entails, so be it) or to kill the resource hog.

I like the idea of trying things like switching to a notification model. We could still include a more conservative configuration for killing while notifying about cgroups that have exceeded thresholds.

Let me take these 2 points back to our Resource Control team and see if they have thoughts on this.

Metadata Update from @catanzaro:
- Issue tagged with: meeting-request

a year ago

There's a related mailing list thread where systemd-oomd kills dnf during system upgrade when there is still RAM available.

Let me take these 2 points back to our Resource Control team and see if they have thoughts on this.

Hi Anita, any update?

I followed up the latest Fedora blocker and remembered to come back to this one with thoughts from the team.

So w.r.t. freezing cgroups: this could be hard to pull off successfully because you would need a good interface to unfreeze. If the whole desktop is frozen it could be harder to recover than just killing or restarting. And it's unclear if most apps can gracefully freeze and unfreeze. Dan Schatzberg made this great point:

My big worry is that freezer doesn't actually free memory, it just stops the processes from using it. This allows kernel reclaim to take over - so if you have ample swap or lots of file cache then it would actually result in reducing memory pressure. But Fedora also defaults to quite a small amount of swap, and I bet most OOM scenarios are due to anon memory. So I'm dubious that freezing would actually solve the pressure situation.

On the topic of notifying at lower thresholds: this idea was well received. We can do a dbus notification at a lower limit, and increase the kill threshold to a higher pressure limit. It would require the notification hook up in systemd(-oomd) to support this. I believe GNOME looks for the dbus notification for systemd-oomd kills and we can probably do the same for a new pressure notification too.

Tangentially, something that came up in discussions was how we can figure out what the "critical" system services are and how we can control the policy for omitting OOM kills on them. Omit/avoid exists in systemd-oomd but I don't believe anyone sets it.

On the topic of notifying at lower thresholds: this idea was well received. We can do a dbus notification at a lower limit, and increase the kill threshold to a higher pressure limit.

So we just need to decide on the limit. I will arbitrarily propose ManagedOOMMemoryPressureLimit=85, or 90, or 95, to make it much less likely to kill for memory pressure than it is currently. But I'm not sure how well that would actually work. Goal would be to allow the desktop to slow down but not allow it to become completely unresponsive....

Should I schedule this topic for next week's Workstation WG meeting at 10:00 EDT? Or do you think we can sort it out on the issue tracker?

It would require the notification hook up in systemd(-oomd) to support this. I believe GNOME looks for the dbus notification for systemd-oomd kills and we can probably do the same for a new pressure notification too.

A D-Bus notification would probably need to be integrated into low-memory-monitor. @hadess, any opinion?

Tangentially, something that came up in discussions was how we can figure out what the "critical" system services are and how we can control the policy for omitting OOM kills on them. Omit/avoid exists in systemd-oomd but I don't believe anyone sets it.

Looking at systemd-cgls output, I think @benzea's intent was that services directly underneath session.slice are supposed to be protected. But I'm not sure.

Hi @anitazha @zbyszek are you OK with setting ManagedOOMMemoryPressureLimit=90%? This value is completely arbitrary and I have no clue whether it's a good choice or not, but I bet it's closer to what we want than the current value 50%. Hopefully the system is still able to make progress and not get hung when memory pressure is at 90%? I assume that means everything is 10x slower, but not completely frozen?

(I'm aware of the recent changes, but I'm skeptical that will be enough to prevent premature killing of applications?)

Metadata Update from @catanzaro:
- Issue untagged with: meeting-request
- Issue tagged with: meeting

a year ago

I would be onboard with trying a pressure limit if we intend to keep the window time small (i.e. the way it is). Perhaps your initial numbers on the order of 80-85%?

Sure, let's try 80% then.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

a year ago

Metadata Update from @catanzaro:
- Issue tagged with: pending-action

a year ago

This change has been merged in rawhide.

Metadata Update from @catanzaro:
- Issue untagged with: pending-action

a year ago

Metadata Update from @aday:
- Issue assigned to catanzaro
- Issue set to the milestone: Fedora 39

a year ago

Change is present in systemd-253.4-1.fc38.

@adamwill please keep an eye out. We don't know how well this change will work in practice. Presumably systemd-oomd should kill things less, but whether it's sufficient to resolve this issue, I'm not sure. User feedback will be important here.

Metadata Update from @catanzaro:
- Issue tagged with: testing

a year ago

I'll try and check in on feedback in a while. I haven't been personally running into the issue lately as I'm not on the 8G system any more (current system is 16G, haven't had any issues with this on it).

Time to close this issue?

I'm not sure. I noticed another user complaining about systemd-oomd recently. That user was using Fedora 38, and the ManagedOOMMemoryPressureLimit change was already implemented there back in May, so this user almost certainly had our latest changes already. That's not a good sign.

I wonder if other users are still complaining.

If we aren't able to specify the conditions under which we would consider this issue to be fixed, I think that we should close it. I'll give it a week!

Well, it's been more than a week. Closing.

Metadata Update from @catanzaro:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 months ago

Login to comment on this ticket.

Metadata