Systemd upstream is getting close to releasing systemd-240. For the longest time, systemd was not updated in released Fedora versions, because there were always incompatible changes. This caused the need to backport patches for various bugs and made it hard to provide timely fixes for more complicated issues.
Upstream has been putting effort to keep things backwards compatible as much as possible and to provide contingency mechanisms for the cases where this is not possible. systemd-240 is the first version which should be essentially backwards compatible (some details below). The version of systemd which is currently in F29 already contains a number of patches (303 as of systemd-239-7.fc29), so some of the stuff in systemd-240 is already in F29. I think that updating in F29 would allow our users and developers to take advantage of the new features in systemd-240, without causing stability problems or incompatible changes. It will also allow more fixes to be pulled in from upstream. Thanks to fuzzing and other testing upstream, we now have a lot of hardening patches, but they cannot be reasonably backported to systemd-239 (and more are expected).
Even despite 300+ patches, there are many other fixes that haven't been backported, when the changes are more intrusive or build upon other functionality which would then have to be backported too. Some examples of not backported stuff: linking of nss modules is now much better (as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1284325), nscd cache is flushed when appropriate with nss-mymachines), much less entropy is used during boot (which should help with entropy-starved embedded devices), work-arounds for kernel api breakage (device node creation in user namespaces since 4.18, "bind"/"unbind" events in 4.15+, but driver dependent), systemd-resolved dns request routing fixes, etc.
Some potentially incompatible changes in systemd-240 and work-arounds:
the udev naming scheme has been updated again, for infiniband devices and onboard PCI with index=0. But we introduced a net.naming-scheme= switch to request the behaviour from a specific systemd version. The F29 package would be compiled with net.naming-scheme=239 to keep the names stable.
net.naming-scheme=
net.naming-scheme=239
global file descriptor limits have been neutered. There's a compatiblity configuration switch, which could be used in the Fedora package (bump-proc-sys-fs-file-max=no and bump-proc-sys-fs-nr-open=no). I'm not sure yet if this would be desired.
bump-proc-sys-fs-file-max=no
bump-proc-sys-fs-nr-open=no
NoNewPrivileges=yes has been set for all long-running services implemented by systemd. This requires a selinux policy update, which has been merged in Fedora, but not yet released. This change is fairly risky, even with the updated policy, so I'd just revert this particular commit, which in itself is very simple.
Of course, as with any systemd release, there are various new features. See https://github.com/systemd/systemd/blob/master/NEWS for an overview. They should be all be backwards compatible and mostly invisible if unused.
My plan would be to first create the package in rawhide as usual after upstream release. If no major issues are found (and this ticket is accepted), create a package for F29. As necessary, update both the F30 and F29-updates-testing packages based on feedback from testers. Once the F29 update has been in updates-testing for two-three weeks without reports of regressions, submit it for stable.
This ticket is just for this specific version. If it works out as I expect, I'll probably ask for a more general exception like for firefox and kernel.
This is still pretty scary. I wonder if we could perhaps get some extra testing from QA here?
I feel like this is a place where OSTree could be helpful... would it be possible to get this into Atomic/Silverblue first (where it's more easily reverted) and then into Fedora 29 proper later?
@dustymabe @sinnykumari As a more general question, I wonder if it might be possible (with Koji tagging) to get certain packages from Rawhide tagged for an Atomic stable release. That way we could try things like this in the real-world.
In our current setup we are just building updated ostrees from the updates bodhi repos + the release day repo. In the past we built updated Atomic Host artifacts in a separate compose and we could create new repos from other koji tags, but we moved away from that.
In general I think I'd just recommend an extended stay in updates-testing for something like this? I guess if you wanted it to just affect atomic host or silverblue updates testing you could put an 'excludepkg' directive into the yum.repo files for updates-testing, and that way dnf based systems following updates testing wouldn't pick up the new packages until we switched it back.
The problem with updates-testing is that no one ever uses it. The same very-limited set of people run tests on it there, but the majority of the Fedora user base will never see any of its contents until it's actually (presumed) stable. So I was thinking we could potentially roll that out via Atomic since at least there we have an easy way to back out bad updates.
We've always given a lot of leeway to maintainers to decide when it makes sense to update things in stable releases. And the kernel proves that it is possible to update major system components during a release successfully.
That being said, I don't think going to a model where multiple major system components rolling-release on an uncoordinated schedule is a good idea - I could easily imagine a scenario where the systemd in updates-testing works with NetworkManager in updates-testing but not with the one in updates, systemd goes stable first, and nobody can get on the network to update.
Without the ability to a) test the OS as a whole as users will see it b) automatically roll back to a working state - I think we should generally discourage mid-stream updates to system components. From the point of view of most users, F30 will be along in a blink of an eye.
I guess if you wanted it to just affect atomic host or silverblue updates testing you could put an 'excludepkg' directive into the yum.repo files for updates-testing, and that way dnf based systems following updates testing wouldn't pick up the new packages until we switched it back.
That sounds super confusing! :-)
I imagine the people who have chosen not to use updates-testing (i.e. the silverblue/atomic host users that are on the stable ref) would prefer to not be given a package on the stable ref that wasn't even good enough for updates-testing for a non-ostree user. I could see this being ok for the updates-testing ref, since they have an easy way to back out and had previously agreed to run an unstable ref by rebasing to it.
The problem with updates-testing is that no one ever uses it. The same very-limited set of people run tests on it there, but the majority of the Fedora user base will never see any of its contents until it's actually (presumed) stable. So I was thinking we could potentially roll that out via Atomic since at least there we have an easy way to back out bad updates. I imagine the people who have chosen not to use updates-testing (i.e. the silverblue/atomic host users that are on the stable ref) would prefer to not be given a package on the stable ref that wasn't even good enough for updates-testing for a non-ostree user. I could see this being ok for the updates-testing ref, since they have an easy way to back out and had previously agreed to run an unstable ref by rebasing to it.
While that seems like a reasonable middle-ground, we still have the same problem: a very small set of self-selected users that is willing to try out a preview. The reality we face is that that subset of users never catches all of the real-world problems.
I think my ideal approach here would be if we could somehow do a staged deployment, similar to how modern server deployments work. Basically, stuff would spend some time in updates-testing for everyone (maybe lower the time to a few days), then when it goes to "stable", we allow a subset of our users to get access to it. Then a few days later a larger subset gets it and so on until it's out for everyone. (With the caveat that we should have a mechanism for pushing urgent fixes to everyone immediately.)
That being said, maybe somewhere in the middle could work: reduce the minimum time in updates-testing, push things to Atomic first and then push to the general stable compose after 5-7 days out in the Atomic world? It's a slapdash approach to the above with the staged roll-out, but it ensures that at least the people who get it first are the ones most capable of reverting it safely and being able to continue operating while we fix things.
Metadata Update from @bowlofeggs: - Issue tagged with: meeting
We will discuss this in the FESCo meeting on Monday at 15:00UTC in #fedora-meeting-1 on irc.freenode.net.
It turns out that systemd-240 has some unexpected regressions. In particular, the udev work-around we merged to the bind/unbind breakage was broken (the package in rawhide now has that patch reverted). Also, there's some issue with selinux on live images (#1663040). For whatever reasons, those bugs are proving hard to figure out. Nevertheless, I expect that we'll resolvem them soon, and that the fixes will turn out to be simple. (Except for the bind/unbind issue, but we'd need to backport some fix for this anyway).
My request still stands, but it will probably apply to systemd-241, which we expect to release soonish as a bugfix release.
In today's discussion we decided to wait and see how systemd-241 does in Rawhide.
systemd-241
Metadata Update from @sgallagh: - Issue untagged with: meeting
It turns out that the release wasn't as smooth as I expected and v241 is still a few days out. Because of the delay with introducing v240/v241 in rawhide, all the security patches that I was hoping to avoid backporting have already been backported, so the motivation to rebase in F29 is partially gone. I'll close this for now. (I'm not giving up on the idea of rebasing in stable fully, but it seems that this is not the release to start the process with.)
Metadata Update from @zbyszek: - Issue close_status updated to: Rejected - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.