Issue #3084: Re-evaluate -fno-omit-frame-pointer compile flag for F40 - fesco

Considering that the deadline of 2 releases has passed and that we have not been pointed to a single way in which this flag has improved the performance of Fedora, whereas it is constantly causing a very real loss of performance for all Fedora users, it is time to finally drop this failed experiment.

ngompa commented 7 months ago

Actually, that is not the case.

We have two separate reports indicating that this feature was a major quality of life improvement:

Additionally, Phoronix benched Fedora Linux 38 after it was implemented and found that the performance difference between F37 and F38 was basically flat.

I think it's proving to be worth retaining going forward.

kkofler commented 7 months ago

The @rjones blog post is just theoretical, showing a better flame graph (though both versions have issues), but no concrete performance improvement resulting from it.

So we now have one report (the @hergertme one) of actual performance improvements obtained using the frame pointers. (And as far as I can tell, this was not posted on Planet Fedora, only on Planet GNOME, so technically "we have not been pointed to [it]" was correct at the time I wrote it, before you posted the link here.) That was posted 17 days ago. So it took almost the entire deadline for one developer reporting improvements. And there are no benchmarks proving that the performance improvements are actually noticeable at all, let alone that they are more noticeable than the performance regressions inherently coming from the frame pointers themselves. There is also no reason given why that one developer cannot just use rebuilt packages of the GTK stack from a Copr or some other third-party repository, as opposed to forcing all end users to use the debug-grade binaries.

As for the Phoronix benchmark, https://www.phoronix.com/review/fedora-38-beta-benchmarks/4 shows that GCC is up to 10% slower in Fedora 38, matching the reports from when the frame pointer feature was discussed. This means all our Koji package builds take longer, wasting a lot of maintainer time, and all local builds also take longer, wasting the time of both Fedora packagers and third-party developers.

zbyszek commented 7 months ago

https://www.phoronix.com/review/fedora-38-beta-benchmarks/4 shows that GCC is up to 10% slower in Fedora 38

IIUC, this benchmark compares gcc 12 and gcc 13, which means its not really relevant for us.

I think it would have been nicer if there was more publicly feedback about the feature. I was hoping for more benchmarks and more reports… Alas.

Status quo is that the estimates that were made when the feature was approved were proven to be correct: the performance impact is not significant and that this will be useful for benchmarking. Unless there's some more substantial feedback, I propose that we let status quo be.

rjones commented 7 months ago

I pushed a patch to qemu a few days ago which gives a 6% performance improvement: https://gitlab.com/qemu-project/qemu/-/commit/614c9466a238641480332b707a7a20a3593bdfb7

It led to a bunch of similar patches in qemu although we didn't quantify the benefit of those.

This change is great for profiling, please leave it alone.

rjones commented 7 months ago

& yes that was found by using flamegraphs which rely on having working frame pointers (as the other methods are simply broken). I described how here:

https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg02022.html

I've also spent many, many hours over the past few weeks looking at qemu flamegraphs to provide other optimizations which may in future give us gains there.

rjones commented 7 months ago

I just recompiled my qemu-system-riscv64 with -fomit-frame-pointer and ran the same tests I was using from above inside, and they run about 1% faster, although that might also be noise on the machine or in the test. So for qemu at least the impact of frame pointers is very small, while the benefits in visibility are larger.

I think your time would be better spent looking at perf output of GCC (or other programs you think are slow) to identify problems and submit fixes. There are easy wins to be had.

kkofler commented 7 months ago

@zbyszek:

https://www.phoronix.com/review/fedora-38-beta-benchmarks/4 shows that GCC is up to 10% slower in Fedora 38

IIUC, this benchmark compares gcc 12 and gcc 13, which means its not really relevant for us.

But where is the evidence that this is caused by the jump in GCC version? For now, this is just a first guess by the Phoronix author(s) and now also by you. But back when the F38 change was originally discussed, a GCC performance hit consistent with the above benchmark was found just from recompiling the exact same GCC with frame pointers. So to me, it looks a lot more likely that this is caused by the frame pointers than by the GCC version.

I think it would have been nicer if there was more publicly feedback about the feature. I was hoping for more benchmarks and more reports… Alas.

But when the feature was provisionally approved, exactly that (i.e., evidence that this helps improve performance more than it hurts it) was the condition for prolonging the experiment. So the proponents have failed to deliver and hence the experiment should automatically stop here.

Status quo is that the estimates that were made when the feature was approved were proven to be correct: the performance impact is not significant and that this will be useful for benchmarking.

I see the exact opposite: no significant performance gains from the profiling improvements, a huge performance hit to GCC, and a small but significant performance hit to globally everything.

Unless there's some more substantial feedback, I propose that we let status quo be.

Since the change was approved as a time-limited experiment, with the option to extend or permanently retain it if and only if its effectiveness is proven, the relevant status quo ought to be the status quo from before the change.

kkofler commented 7 months ago

@rjones:

I pushed a patch to qemu a few days ago which gives a 6% performance improvement: https://gitlab.com/qemu-project/qemu/-/commit/614c9466a238641480332b707a7a20a3593bdfb7

It shall be noted that this is for RISC-V software emulation. Software emulation is inherently very slow, you just managed to make it less slow. Your performance improvement will not help at all the nowadays most common use case of QEMU, which is hardware-assisted same-architecture virtualization using KVM.

Also, you get a 6% speedup, but (as you state yourself) the frame pointers cause a 1% slowdown on the host (and that does not even include the system libraries such as glibc, since you recompiled only QEMU for that particular benchmark) and, since you are using GCC to benchmark, probably up to 10% slowdown on the guest side.

This change is great for profiling, please leave it alone.

You have still not explained why you need the QEMU binaries delivered to end users built that way. For your profiling needs, only you need a binary built with frame pointers. That should be opt-in, for developers only. You are not collecting profiles from end users, so why do they need profiling-enabled binaries?

I think your time would be better spent looking at perf output of GCC (or other programs you think are slow) to identify problems and submit fixes. There are easy wins to be had.

perf is very bad at pinpointing the precise location of the hotspots compared to Callgrind/Cachegrind that can point us to the individual assembly instruction where the time is spent.

rjones commented 7 months ago

perf annotate shows the hotspots down to the instruction level.

You are making wild claims about slowdowns of 10% which are not justifiable. Show evidence please.

decathorpe commented 7 months ago

Guys, this is a ticket that was filed so FESCo doesn't forget to do the actual reevaluation in time for F40. Please move actual discussion to either the devel list or discussions.fp.o.

kkofler commented 7 months ago

I wrote:

But when the feature was provisionally approved, exactly that (i.e., evidence that this helps improve performance more than it hurts it) was the condition for prolonging the experiment. So the proponents have failed to deliver and hence the experiment should automatically stop here.

PS: IMHO, it is already too late to deliver evidence now. The idea was that the evidence should flow in during the year the experiment was running. This has not happened. It should not be the task of FESCo or Fedora in general to actually go and request and collect evidence. It would have been the job of the change proponents to deliver said evidence, and this has not happened at all.

kkofler commented 7 months ago

@rjones:

perf annotate shows the hotspots down to the instruction level.

Good to know, but this will be with a lot of measurement noise given perf's inherently stochastic nature.

You are making wild claims about slowdowns of 10% which are not justifiable. Show evidence please.

See the compilation times here:
https://www.phoronix.com/review/fedora-38-beta-benchmarks/4

rjones commented 7 months ago

Here's an older case where I used a variety of benchmarking techniques including perf with frame-pointers to identify problems in qemu's NBD layer. This work has since been taken over by Eric Blake and he's also using flamegraphs (hence frame pointers) to profile his more recent work.

https://lore.kernel.org/qemu-devel/20230309113946.1528247-1-rjones@redhat.com/

rjones commented 7 months ago

The Phoronix stuff is not evidence of anything, you don't need to keep bringing it up.

kkofler commented 7 months ago

Oh, and that 5-6% QEMU speedup you claim (you wrote 6% here, 5% in the commit message) is actually only relevant at all if a debugging feature (QOM cast debugging) is enabled, which should not be the case for release binaries. So there, too, the answer is to not deliver debug binaries to end users, same as for the frame pointers. If the end users are getting binaries with QOM cast debugging disabled, they will not notice any performance improvement at all from your patch, even if they use RISC-V software emulation on their x86_64 (or aarch64 or whatever) host.

Edited 7 months ago by kkofler

rjones commented 7 months ago

Here's a case where I used flamegraphs to analyse the performance of curl handles:

https://listman.redhat.com/archives/libguestfs/2023-February/030618.html

Here's a case where I used flamegraphs to study the performance of kTLS (not on a public mailing list unfortunately):
http://oirase.annexia.org/nbd-plaintext.svg
http://oirase.annexia.org/nbd-tls-no-ktls.svg
http://oirase.annexia.org/nbd-ktls.svg

Here's a case (similar to the original one above) where I examined the effect of gobject cast debugging on Fedora qemu:
https://listman.redhat.com/archives/virt-tools-list/2023-October/017812.html

kkofler commented 7 months ago

@rjones:

The Phoronix stuff is not evidence of anything, you don't need to keep bringing it up.

You asked for evidence of the slowdowns caused by frame pointers, so I provided it. If you prefer to keep living in denial and celebrating yourself for a QEMU speedup only relevant at all for QEMU developers (because it only affects builds with QOM cast debugging enabled), that is up to you.

rjones commented 7 months ago

We fixed the QOM cast debugging issue by making a single targeted change to the code, guided by profiles, so we benefit from the debugging in Fedora but get the performance gains.

kkofler commented 7 months ago

So you deliberately continue shipping debug builds of QEMU in Fedora? 🤦

Binaries delivered to end users should not have debugging features for developers enabled. Neither frame pointers, nor cast debugging in QEMU, nor any other debugging feature that slows things down and brings no visible benefit to the end user.

rjones commented 7 months ago

If the debugging doesn't cost anything - as we proved using frame pointers and fixing one hotspot - then we should keep the debugging in place as it provides early warning of problems which we can report upstream.

I'm looking forward to your careful study on the effects of enabling frame pointers on GCC. Until then I don't really have anything else to say.

kkofler commented 7 months ago

So Fedora users are again being abused as beta testers for RHEL.

zbyszek commented 7 months ago

@rjones Thanks for those examples. This is the kind of reports that I was looking for.

IIUC, this benchmark compares gcc 12 and gcc 13, which means its not really relevant for us.

But where is the evidence that this is caused by the jump in GCC version?

The jump in gcc version means that we can draw no conclusions either way. (I.e. gcc13 being slower than gcc12, frame pointers making gcc slower, and the combination of both things are all valid explanations for the observed change.)

no significant performance gains from the profiling improvements

Hmm, such claims make the discussion very tedious. Various examples of significant gains were linked and people who actually work on full-system profiling are saying that they find this useful. I have no idea what you want to gain by refusing to acknowledge this. You certainly are not going to convince anybody who has read the full discussion.

a huge performance hit to GCC

The feature was approved with the requirement that individual packages can opt out. Python3.11 did opt out, and Python 3.12 is opting back in. If you have measurements that show "a huge performance hit", then I suggest talking with the maintainers of gcc. I'm sure they'd be happy to take a patch.

churchyard commented 7 months ago

Guys, this is a ticket that was filed so FESCo doesn't forget to do the actual reevaluation in time for F40. Please move actual discussion to either the devel list or discussions.fp.o.

Please.

rjones commented 7 months ago

(Last thing I'll say here). GCC has been compiled without frame pointers
for the last 9 months, so any slowdown you've seen isn't caused by this.

ngompa commented 7 months ago

Another data point from @hergertme was the blog post on the improvements to VTE he made recently.

kkofler commented 7 months ago

@zbyszek:

Various examples of significant gains were linked

Various examples of gains were linked. Not of significant gains.

The only one that had actual numbers quantifying the gains is the QEMU QOM cast debugging one (where the debugging feature should never have shipped in Fedora to begin with, at least not since upstream made it optional a decade ago, with the expectation that it be turned off in release builds), and even that only for software emulation of RISC-V (a niche use case).

and people who actually work on full-system profiling are saying that they find this useful.

But that is a very tiny minority of Fedora users that would be served just as well by a dedicated targeted repository for this purpose, without degrading the performance for all other users.

hergertme commented 7 months ago

I hesitate to respond, because I don't feel like I need to justify my work, especially when it feels like the other side of the conversation is not interested in hearing us and nitpicking every detail.

Myself and others have fixed numerous things in GLib which will easily surpass the 1% overhead deep in the type system. This affects every single GObject based application including GNOME Shell to every GTK application.

GTK itself gets designed with Sysprof so every single new API is getting tested regularly.

Every fix in VTE was driven by Sysprof/perf, and that is in the 40%-50 performance improvement range while also doubling the frame rates. Given that the terminal is the number one used desktop application in GNOME, I would expect this to be valuable.

Sysprof itself was built using Sysprof, and I couldn't have even built it as well as I did without frame-pointers working across the stack. The sysprof in F39 in significantly better because of F38's change.

But the point here that seems to be completely lacking, is that you can nitpick anything that has been done as "well that's done, so who cares now?". The real value is the ability to make more improvements every cycle.

I couldn't even make use of the frame pointers effectively until the F39 beta came out and this was all done in a matter of weeks from that.

Edited 7 months ago by hergertme

hergertme commented 7 months ago

https://blogs.gnome.org/chergert/2023/07/28/how-to-use-sysprof-again/
https://blogs.gnome.org/chergert/2023/08/02/writing-fast-search/
https://blogs.gnome.org/chergert/2023/08/04/more-sysprofing/
https://blogs.gnome.org/chergert/2023/08/10/profiling-with-medium-aged-hardware/

all were related to having frame-pointers. So a bunch of double digit performance improvements in search providers across GNOME were due to F39 having frame pointers I could test with.

I should note that I don't regularly work on GLib, VTE, or search providers. I only came across these things while fixing other issues in other projects. So had they not shown up on profiles with quality frame pointers, I would have never seen them nor had the inkling to even go fix them.

On top of that, if we don't have quality frame pointers throughout the stack, it doesn't matter if we go compile our library or program with frame pointers. Because the stack unwinding will inevitably break causing all your recordings to be close to worthless. This is of course all laid out in my blog posts on frame-pointers from the beginning of the year.

Edited 7 months ago by hergertme

rjones commented 7 months ago

That stuff Christian is doing is amazing.

tstellar commented 7 months ago

Guys, this is a ticket that was filed so FESCo doesn't forget to do the actual reevaluation in time for F40. Please move actual discussion to either the devel list or discussions.fp.o.

Please.

Can we create an official thread for discussion?

tstellar commented 7 months ago

Voted on during the 2023-10-26 Meeting

AGREED: FESCo indicates that there is nothing that indicates we need to do anything for -fno-omit-frame-pointer change. (+7, 0, -1)

zbyszek commented 6 months ago

Profiling of systemd during boot, incl. a flamegraph: https://github.com/systemd/systemd/pull/29821

fesco

#3084 Re-evaluate -fno-omit-frame-pointer compile flag for F40

Closed: Accepted 7 months ago by tstellar. Opened 7 months ago by mhayden.

Metadata

fesco

Source Code

#3084 Re-evaluate -fno-omit-frame-pointer compile flag for F40 Closed: Accepted 7 months ago by tstellar. Opened 7 months ago by mhayden.

Metadata

meeting

#3084 Re-evaluate -fno-omit-frame-pointer compile flag for F40

Closed: Accepted 7 months ago by tstellar. Opened 7 months ago by mhayden.