Issue #2817: Change proposal: Add -fno-omit-frame-pointer to default compilation flags - fesco

I am open to the possibility that the system-wide performance impact may be negligible, but I am concerned that we do not really know if that is the case or not. Some benchmarks for similar changes are mentioned in the proposal, but they are specialized (Firefox JIT) and/or difficult to reproduce (proprietary internal workloads).

I am not sure we should commit to this change without something like a mini mass rebuild into COPR as suggested in the devel thread, with a well-documented benchmarking effort across various applications and workloads.

Allowing individual packagers to opt out after the real mass rebuild is a welcome and important part of the proposal, but it does not really address the risk of system-wide performance regressions—most individual packagers won’t be prepared to organize their own private package-specific benchmarks.

music commented 2 years ago

I’m just now catching up on devel list discussion regarding a recent Phoronix article on the topic. I haven’t formed an opinion on the results yet, but I’m happy to see some concrete benchmarking efforts and will continue to follow the discussion closely.

mhayden commented 2 years ago

The performance side worries me on this one, but it ends up being a trade off that we must weigh. A performance reduction, even a small one, affects every user in some way. I would venture to guess that profiling and debugging is done by a relatively small percentage of Fedora users.

If someone was having difficulty with debugging/profiling an application, couldn't they temporarily add -fno-omit-frame-pointer to the spec file and rebuild it? (I'm absolutely not an expert in compilers, so someone please correct me if I am missing something important.)

music commented 2 years ago

If someone was having difficulty with debugging/profiling an application, couldn't they temporarily add -fno-omit-frame-pointer to the spec file and rebuild it? (I'm absolutely not an expert in compilers, so someone please correct me if I am missing something important.)

I hope the change owners will correct me if I’m wrong, but as I understand it the goal is to better support profiling of large applications with perf and similar sampling profilers, without having to rebuild the entire dependency tree in order to get detailed insight into time spent in libraries.

mpolacek commented 2 years ago

On behalf of the Red Hat GCC team: we are strongly against this change (chiefly for performance reasons). Sorry.

mhayden commented 2 years ago

On behalf of the Red Hat GCC team: we are strongly against this change (chiefly for performance reasons). Sorry.

Your input is appreciated!

This is what I worry about most: a performance reduction that affects nearly everyone that makes debugging and profiling more difficult for a smaller number of users. I've been on both ends of these issues before and I lean towards maintaining our current performance.

bcotton commented 2 years ago

After a week, there are no votes, but the two FESCo members who have commented appear to be leaning toward -1, so I'm tagging this for the next meeting.

daandemeyer commented 2 years ago

On behalf of the Red Hat GCC team: we are strongly against this change (chiefly for performance reasons). Sorry.

It'd be great if you could provide some extra context here. Can you quantify the performance reasons? Is there a specific performance threshold where this change would be acceptable?

churchyard commented 2 years ago

After a week, there are no votes...

Let me explicitly vote -1 for now to avoid autoapproval.

vmakarov commented 2 years ago

It'd be great if you could provide some extra context here. Can you quantify the performance reasons? Is there a specific performance threshold where this change would be acceptable?

I believe the proof that -fno-omit-frame-pointer does not hurt generated code performance should be done by people who are proposing this feature. The performance impact should have been reported with the proposal from the start.

SPEC2017 is the most credible benchmark for compiler people to measure the performance. For the context, according to my observations 1% performance improvement for SPEC usually requires 1 year of work for all GCC community.

One can say that we can ignore 1% performance improvement because it is insignificant. I can give a counter-argument: assuming 10% of all electricity is consumed by computers 1% performance improvement in energy proportiinal computing would mean saving 20TWH annually.

zbyszek commented 2 years ago

This was discussed during today's FESCo meeting:
ACTION: DaanDeMeyer to add more information to the proposal

See the logs for lots of interesting discussion (https://meetbot.fedoraproject.org/fedora-meeting/2022-07-05/fesco.2022-07-05-17.00.log.html).

kkofler commented 2 years ago

Some benchmarks for similar changes are mentioned in the proposal, but they are specialized (Firefox JIT) and/or difficult to reproduce (proprietary internal workloads).

As @decathorpe has pointed out on the mailing list, the benchmarks also only build the leaf applications with frame pointers and not the entire distribution, so they do not actually measure the complete performance impact (but in fact most likely significantly underestimate it).

daandemeyer commented 2 years ago

Added more details to the proposal on the different use cases. Will focus on a more thorough benchmark of Fedora next, now that I've finished trying to reproduce the phoronix results.

daandemeyer commented 2 years ago

Status update, I'm working on the benchmarks but I'll likely need until next week to finish them properly (https://github.com/DaanDeMeyer/fpbench)

catanzaro commented 2 years ago

In addition to his comments on the mailing list, Christian (sysprof maintainer) posted a very concise summary of the problem for profilers in this comment.

kevin commented 2 years ago

Shall we check in on this in the meeting? If there's no new info we can revisit another week out.

daandemeyer commented 2 years ago

I'm still working on the benchmarks, I'm going to do my best to finish them by next week's meeting

decathorpe commented 2 years ago

Given that @daandemeyer needs more time, but the F37 mass rebuild is scheduled to start tomorrow, would it make sense to withdraw this Change proposal for Fedora 37, and resubmit it for Fedora 38, with more complete data wrt/ performance impact? I don't think it would make much sense to push a change in default compiler flags after the mass rebuild.

daandemeyer commented 2 years ago

That seems like the best way to go, I still plan to finish the benchmarks, but the proposal itself can be moved to F38.

kevin commented 2 years ago

Will remove meeting keyword and review this when more data is available.

bcotton commented 2 years ago

It's been two months since the last update. Any progress?

daandemeyer commented 2 years ago

I finished the benchmarks but didn't find time to post the results yet. I'm trying to run them again on Fedora 37 but running into some issues. Planning to post the final results once I get the Fedora 37 copr working

daandemeyer commented 2 years ago

I updated the change proposal with the benchmark results

kkofler commented 2 years ago

Direct link to the benchmark results: https://github.com/DaanDeMeyer/fpbench

kkofler commented 2 years ago

I see up to almost 10% slowdowns! (9.5% on scimark_sparse_mat_mult.) I do not see how that can be anywhere near acceptable.

bcotton commented 2 years ago

Now that benchmark results are available, I've removed the "stalled" tag. FESCo members, please provide your votes and feedback on this proposal.

decathorpe commented 2 years ago

I assume cryptography and compression libraries use hand-tuned assembly for core operations? That might explain why they aren't affected too badly (but that means they also won't really benefit from this change, either ...).

But everything else shows slow-downs mostly between 2% and 10%, which looks really bad to me - particularly because it seems to affect things that people will notice.

For example, I don't think people will be happy if stuff in Python will be noticably slower - the json module, regular expressions, rendering Jinja2 templates, XML parser - all stuff that's widely used - between 4-8% slower :(
And the scimark Python benchmarks look particularly bad, with all results showing as 5-10% slower, which would make the data science people very sad, and probably make them either download other Python binaries, or switch distro entirely.

So while I appreciate that the change owner has now invested the time to provide benchmark results, I don't think I can in good conscience approve of this change. -1

catanzaro commented 2 years ago

GNOME people are asking what happens to the results if python is built without frame pointers while everything else uses -fno-omit-frame-pointer. I wonder if the python slowdown is caused by building python itself with frame pointers, or if it's from something lower level. If the poor results are not encountered outside python, and rebuilding just python alone without frame pointers is enough to fix it, then we could plausibly use -fno-omit-frame-pointer for most of the distro, except python. Edit: Actually, the suggestion is to build the entire python stack without frame pointers as a starting point, not just the python package alone. It looks like all the non-python results are satisfactory.

GNOME performance folks are still watching this issue with concern because (a) serious application profiling on Fedora is not possible currently unless you build all of your application's dependencies yourself, which is a very strong disincentive to use Fedora for performance-related development, (b) desktop-level profiling is not possible at all, but it's essential to improving performance of Fedora as a whole, and (c) our performance experts still do not believe that suggestions to develop profiling tools that do not depend on frame pointers could be workable.

Edited 2 years ago by catanzaro

ngompa commented 2 years ago

I'm in the same boat as @decathorpe, unfortunately.

-1

kkofler commented 2 years ago

serious application profiling on Fedora is not possible currently unless you build all of your application's dependencies yourself

As I had already pointed out in the mailing list threads, Valgrind + Callgrind/Cachegrind + KCachegrind works pretty well for application-level profiling. Yes, it will slow down your application a lot. But its performance model does not actually depend on wallclock time, so the slowdown will not break the profiling. And the model is also independent of the exact CPU on which you happen to run the profiling.

Of course, it is not a solution for systemwide profiling, but for applications, Callgrind does the job.

fche commented 2 years ago

(Could gnome- or gnome-adjacent folks refresh us on a specific benchmark scenario of profiling that is 'not currently possible' because of poor performance? There may be significant performance improvements possible at the tooling side.)

sgallagh commented 2 years ago

While I appreciate wanting to improve debug/profiling capabilities, I think the benchmarks show that the performance hit is too significant.

-1

catanzaro commented 2 years ago

I think this mail and this mail more or less summarize the problem here.

catanzaro commented 2 years ago

While I appreciate wanting to improve debug/profiling capabilities, I think the benchmarks show that the performance hit is too significant.

-1

I'm confused here. The benchmarks look pretty good to me, except for the impact on python. We already know not everything will be able to use -fno-omit-frame-pointer: there will need to be some exceptions.

Edited 2 years ago by catanzaro

kkofler commented 2 years ago

What is good about the benchmarks? Everyday applications such as Blender and GCC have a performance hit around 2%. Scientific benchmarks up to 10%. If you really think this is in any way specific to Python (I doubt it), then we need benchmarks for scientific code in C or C++. The pyperformance benchmarks are the only scientific computing ones in the results that the Change owner has published.

kkofler commented 2 years ago

And I would argue that slowing down GCC compilations by 2.4% is also unacceptable.

kkofler commented 2 years ago

then we need benchmarks for scientific code in C or C++

E.g., try benchmarking the C version of SciMark:
https://math.nist.gov/scimark2/download_c.html
instead of the Python port bundled in pyperformance.

mpolacek commented 2 years ago

And I would argue that slowing down GCC compilations by 2.4% is also unacceptable.

Yes, definitely.

anakryiko commented 2 years ago

then we need benchmarks for scientific code in C or C++

E.g., try benchmarking the C version of SciMark:
https://math.nist.gov/scimark2/download_c.html
instead of the Python port bundled in pyperformance.

I did try on my dev machine (so not completely sterile environment, but no other workload was running). There is a bunch of variability in results, but both -fno-omit-frame-pointer and -fomit-frame-pointer versions seems to be very close, where frame pointer wins sometimes, and no frame pointer wins some time.

NO-FP
=====

NORMAL   2452.95   2447.71
LARGE    2461.07   2509.69
HUGE     2527.93   2435.89

FP
==

NORMAL   2473.04   2536.27
LARGE    2472.11   2473.25
HUGE     2450.22   2495.14

Two example runs:
https://gist.github.com/anakryiko/06582919b42be043eb6bf0e57158351b
https://gist.github.com/anakryiko/f74e38038825464846cfe02db43fd89c

You can check gist to see what CFLAGS were used for compilation and what tests were run.

Edited 2 years ago by anakryiko

anakryiko commented 2 years ago

And I would argue that slowing down GCC compilations by 2.4% is also unacceptable.

What would be acceptable, though? I don't think anyone ever mentioned what are the criteria and thresholds? In my view, waiting for extra 6 seconds for full kernel rebuild which takes 4 minutes is not a big deal, if instead I get a system that is traceable and profileable with tons of BPF-based tools like bpftrace and lots of BCC tools, which rely on user-space stack traces.

Even if we lose 1-2% of benchmark performance, what we gain instead is lots of systems enthusiasts that now can do ad-hoc profiling and investigations, without the need to recompile the system or application in special configuration. It's extremely underappreciated how big of a barrier it is for people contribution towards performance and efficiency, if even trying to do anything useful in this space takes tons of effort. If we care about the community to contribute, we need to make it simple for that community to observe applications. And saying that all this is possible with valgrind, recompiling apps, etc -- that's not how it works for most people in practice. I wouldn't even bother trying to help, too much hassle.

There is a reason why companies like Meta compile their internal applications with frame pointers. Whatever the initial cost might have been (and we looked at pretty performance and latency sensitive systems when turning this on 5 years ago and didn't see any meaningful degradation; perhaps because we didn't focus on synthetic benchmarks, but rather tested on real workloads), we've recuperated it many times over thanks for effortlessly available high-quality performance and observability data.

So if Fedora and larger open-source community cares about ad-hoc enthusiasts to help improve the efficiency of applications, remember that frame pointers significantly lower the barrier of entry to start contributing in this area.

As for pyperformance, such noticeable degradation is surprising to me and I'd say it would warrant looking deeper into why this is happening and where the costs are coming from. We had even more surprising results with Phoronix posting 3x (or was it 10x) slowdown in botan, and when we went looking it turned out it was due to wrong optimization levels. I'm not saying it's that obvious for pyperformance, but losing up to 9% just because of %rbp is used to maintain frame base pointer is weird. You'd have to have tons of almost no-op functions being called (and then the question is whether we should try to remove such waste) or some function perhaps is very sensitive to having just one extra register (and then maybe the code has to be changed and split to make this sensitivity smaller) to have this 9% degradation.

Unfortunately the feel I get is that there is generally negative predisposition to this change in general, which makes such time investment to do investigations like that way less appealing.

I'm glad that GNOME community is sending a similarly strong message on how important all this is for their ecosystem. This shows that frame pointers are important not just for big companies with lots of software, it's important for big projects and ecosystems that care about long-term performance work in general.

All the proposed "alternatives" (DWARF, valgrind, whatnot) do not stand the test of practical reality.

fche commented 2 years ago

I think this mail and this mail more or less summarize the problem here.

I appreciate the link/reminder, but what would really help is a runnable recipe or script for running a profiling scenario that has unacceptable performance. (Especially helpful to those of us who are not already sysprof users.)

Edited 2 years ago by fche

catanzaro commented 2 years ago

If you've never used sysprof before, just open it, check Profile Entire System, and click the blue Record button. You're not going to notice unacceptable performance. Instead, what you'll notice is the profiling results for Fedora software are crap due to lack of useful backtraces. You won't make much progress if you're hoping to debug performance problems. Contrast that to the results you'll see for Flatpak applications distributed by GNOME or Flathub, where you can actually see useful backtraces because they're compiled with frame pointers. Providing useful backtraces is not possible otherwise because sysprof has to construct those backtraces really fast and doing it without frame pointers is slow. It can easily take 40 seconds or more for gdb to print a simple backtrace for a program linked to WebKit, but sysprof has to be able to do that many many many times per second.

At least, I think so. Important disclaimer: I don't actually know what I'm talking about. Just trying to summarize my understanding of this. What I do know is that our performance engineers really want the frame pointers.

fche commented 2 years ago

You're not going to notice unacceptable performance.

Do you have instructions for how to invoke sysprof in such a way that it produces useful results but at "not currently possible" = unacceptable performance? That way we have a target we could analyze and try to improve.

decathorpe commented 2 years ago

That's all well and good, we all want nice things. However, assuming that adding these compiler flags to the defaults in Fedora improves profiling, who's going to actually do the work to get back the 0-10% performance losses we'd have across the whole system, in thousands of upstream projects? The limitation here is not lack of good data, but as always, a lack of manpower ...

catanzaro commented 2 years ago

Do you have instructions for how to invoke sysprof in such a way that it produces useful results but at "not currently possible" = unacceptable performance? That way we have a target we could analyze and try to improve.

Please reread the mails that I linked to above. Our developers do not want to waste time developing something they know will never work. It's not a reasonable request.

That's all well and good, we all want nice things. However, assuming that adding these compiler flags to the defaults in Fedora improves profiling, who's going to actually do the work to get back the 0-10% performance losses we'd have across the whole system, in thousands of upstream projects? The limitation here is not lack of good data, but as always, a lack of manpower ...

I'm pretty confident that Fedora actually has the largest developer community of any Linux distribution. Developers who do want to profile Fedora cannot plausibly do so currently because there are no frame pointers.

Clearly something is wrong with python that needs to be better understood before we can enable frame pointers for python packages. @anakryiko has already pointed out that the results are unexpected and require further investigation.

Then the worst-case non-python result, the gcc benchmark, is 2.4%, not 10%. Desktop users will not notice a 2.4% slowdown unless they're paying way too much attention to Phoronix, so we don't have to do anything about it. What users will actually notice are severe performance problems that have noticeable impact on system responsiveness. What Fedora users actually complain about is "why is GNOME so slow?" or "why is scrolling so slow?" or "why is video playback choppy?" Without frame pointers, these questions cannot be answered. A 2.4% or 24% or even 240% speedup will likely not be enough to fix noticeable performance problems; when something is slow enough for users to notice a problem, it's because the code is orders of magnitude too slow and needs to be 100x faster or more. From this perspective, worrying about 2.4% just doesn't make a whole lot of sense.

music commented 2 years ago

I appreciate the detailed benchmarks.

The crypto benchmarks generally show no impact, which is not surprising since these libraries likely spend most of their time in assembly routines. I would expect similar results for any scientific applications that are dominated by FFT performance and use a highly-optimized library like FFTW. So it’s valuable to confirm this, but these parts of the benchmarks shouldn’t be taken as indicators of performance in other kinds of software.

I think that the blender, gcc, pgbench, redis, and zstd benchmarks represent a decent cross-section of typical CPU-intensive applications. From this, I infer that 0.5%–2.5% is the typical impact.

I wish I understood why the Python impact was so high. I am not convinced that this can be hand-waved away with “we’ll just opt out Python and all of its extensions.” If it’s not clear why Python is special, then it’s not clear how many other applications or ecosystems that aren’t benchmarked here might also be disproportionately affected. That bothers me.

I personally care quite a bit about Fedora’s usefulness for developers, including for upstreams such as GNOME. I also think the kind of profiling-in-production use case that @daandemeyer is trying to enable is a reasonable thing for people to want to be able to do. That means I would like to find a way for this change to be viable.

On the other hand, I also care quite a bit about the “long tail” of applications that are performance-sensitive but for which there will probably never be anyone who steps up to do the kind of profiling and optimization this would enable. Those will get only the downside and not the potential benefit of this change. This kind of echoes @decathorpe’s point above…

In response to @anakryiko, my “seat of the pants” intuition is that 1% performance regression is probably acceptable (though some will disagree), 5% is too much (though some will disagree) and unlikely to ever be repaid in hypothetical performance patches enabled by better profiling, and 2-3% is difficult: some people will find it obviously acceptable and other people will find it obviously unacceptable, depending on their priorities.

kkofler commented 2 years ago

Two example runs:
https://gist.github.com/anakryiko/06582919b42be043eb6bf0e57158351b
https://gist.github.com/anakryiko/f74e38038825464846cfe02db43fd89c

You can check gist to see what CFLAGS were used for compilation and what tests were run.

Those are both with -fomit-frame-pointer. Also, you have apparently recompiled only the benchmark and not the entire distribution (in particular, glibc). The benchmark is only useful if you compare a distribution (OS) and an application, both together compiled once with -fomit-frame-pointer and once with -fno-omit-frame-pointer. Recompiling only the application will not show the complete performance impact.

catanzaro commented 2 years ago

I wish I understood why the Python impact was so high. I am not convinced that this can be hand-waved away with “we’ll just opt out Python and all of its extensions.” If it’s not clear why Python is special, then it’s not clear how many other applications or ecosystems that aren’t benchmarked here might also be disproportionately affected. That bothers me.

FWIW I agree. This makes sense to me.

Those are both with -fomit-frame-pointer.

There are actually results for -fno-omit-frame-pointer later down.

Also, you have apparently recompiled only the benchmark and not the entire distribution (in particular, glibc). The benchmark is only useful if you compare a distribution (OS) and an application, both together compiled once with -fomit-frame-pointer and once with -fno-omit-frame-pointer. Recompiling only the application will not show the complete performance impact.

Good catch.

Edited 2 years ago by catanzaro

anakryiko commented 2 years ago

Two example runs:
https://gist.github.com/anakryiko/06582919b42be043eb6bf0e57158351b
https://gist.github.com/anakryiko/f74e38038825464846cfe02db43fd89c

You can check gist to see what CFLAGS were used for compilation and what tests were run.

Those are both with -fomit-frame-pointer.

No, they are not. There are scimark4-fp and scimark4-nofp, and I printed out CFLAGS used for compilation with -fomit-frame-pointer and -fno-omit-frame-pointer flags listed explicitly. I also summarized a small table with results on two runs for both variants, just in case you wouldn't want to read gist carefully, which it seems like you didn't. But neither you trusted (or looked) at that small table I presented, unfortunately.

Also, you have apparently recompiled only the benchmark and not the entire distribution (in particular, glibc). The benchmark is only useful if you compare a distribution (OS) and an application, both together compiled once with -fomit-frame-pointer and once with -fno-omit-frame-pointer. Recompiling only the application will not show the complete performance impact.

If you check scimark source code, you'll see it doesn't really use glibc much. It links against libc and libm, it uses a whole 5 stdlib functions:
- clock, measuring elapsed time, once per iteration (not a hot path at all);
- printf for information output (very not hot path, few lines of informational output);
- malloc, a whopping 348 times (I counted with bpftrace) for default run, it doesn't matter relative to the useful work scimark4 is doing;
- sin and sqrt. So here I tried to trace all functions called sin and sqrt (they are IFUNCs in ELF), and I got 8 runtime hits in total. I didn't find a disassembly of sqrt and sin in libm.so, closest was sqrtf64, but sqrtf64 wasn't called at runtime at all. I'd expect good compiler to actually optimize sin and sqrt down to native CPU instructions, but I'm not a compiler expert and I don't know exactly how this is done in practice.

With the above, I'd argue that how my OS or glibc is compiled is completely immaterial to this specific benchmark. Which is what I'd expect from computationally-heavy scientific benchmark. It is not supposed to benchmark libc overhead.

zbyszek commented 2 years ago

This will be discussed during today's meeting at 17:00 UTC.

fweimer commented 2 years ago

I'm not sure if anyone from Red Hat Platform Tools will be able to attend because of the short notice, sorry.

pviktori commented 2 years ago

FWIW, CPython upstream is likely to recommend these flags for Python 3.12 (main Python for Fedora 39), and I think it would be best to follow the recommendation regardless of Fedora-wide defaults.
I posted more on python-devel@lists.f.p.o.

daandemeyer commented 2 years ago

FWIW, CPython upstream is likely to recommend these flags for Python 3.12 (main Python for Fedora 39), and I think it would be best to follow the recommendation regardless of Fedora-wide defaults. I posted more on python-devel@lists.f.p.o.

Do you have an official source for python recommending always building with frame pointers aside from the perf docs?

kkofler commented 2 years ago

FWIW, CPython upstream is likely to recommend these flags for Python 3.12 (main Python for Fedora 39), and I think it would be best to follow the recommendation regardless of Fedora-wide defaults.

Seeing how Python is one of the worst hit applications, I would not think so, sorry.

zbyszek commented 2 years ago

This was discussed during today's FESCo meeting. We agreed to punt until next meeting (two weeks hence).

pviktori commented 2 years ago

Do you have an official source for python recommending always building with frame pointers aside from the perf docs?

Nothing yet, I expect an official recommendation to come with Python 3.12.

anakryiko commented 2 years ago

I wish I understood why the Python impact was so high. I am not convinced that this can be hand-waved away with “we’ll just opt out Python and all of its extensions.” If it’s not clear why Python is special, then it’s not clear how many other applications or ecosystems that aren’t benchmarked here might also be disproportionately affected. That bothers me.

So I did look a bit at Python with and without frame pointers trying to understand pyperformance regressions.

First, perf data suggests that big chunk of CPU is spent in _PyEval_EvalFrameDefault, so I looked specifically into it (also we had to use DWARF mode for perf for apples-to-apples comparison, and a bunch of stack traces weren't symbolized properly, which just again reminds why having frame pointers is important).

perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot spots, the work seemed to be distributed pretty similarly with or without frame pointers. Also scrolling through _PyEval_EvalFrameDefault disassembly also showed that instruction patterns between fp and no-fp versions are very similar.

But just a few interesting observations.

The size of _PyEval_EvalFrameDefault function specifically (and all the other functions didn't change much in that regard) increased very significantly from 46104 to 53592 bytes, which is a considerable 15% increase. Looking deeper, I believe it's all due to more stack spills and reloads due to one less register available to keep local variables in registers instead of on the stack.

Looking at _PyEval_EvalFrameDefault C code, it is a humongous one function with gigantic switch statement that implements Python instruction handling logic. So the function itself is big and it has a lot of local state in different branches, which to me explained why there is so much stack spill/load.

Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov 0x50(%rsp),%r10 (and their reverse variants), I see that there is a substantial amount of stack spill/load in _PyEval_EvalFrameDefault disassembly already in default no frame pointer variant (1870 out of 11181 total instructions in that function, 16.7%), and it just increases further in frame pointer version (2341 out of 11733 instructions, 20%).

One more interesting observation. With no frame pointers, GCC generates stack accesses using %rsp with small positive offsets, which results in pretty compact binary instruction representation, e.g.:

0x00000000001cce40 <+44160>: 4c 8b 54 24 50          mov    0x50(%rsp),%r10

This uses 5 bytes. But if frame pointers are enabled, GCC switches to using %rbp-relative offsets, which are all negative. And that seems to result in much bigger instructions, taking now 7 bytes instead of 5:

0x00000000001d3969 <+53065>: 48 8b 8d 10 ff ff ff    mov    -0xf0(%rbp),%rcx

I found it pretty interesting. I'd imagine GCC should be capable to keep using %rsp addressing just fine regardless of %rbp and save on instruction sizes, but apparently it doesn't. Not sure why. But this instruction increase, coupled with increase of number of spills/reloads, actually explains huge increase in byte size of _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra bytes for existing 1870 instructions that were switched from %rsp+positive offset to %rbp + negative offset, plus 7 bytes for each of new 471 instructions). I'm no compiler expert, but it would be nice for someone from GCC community to check this as well (please CC relevant folks, if you know them).

In summary, to put it bluntly, there is just more work to do for CPU saving/restoring state to/from stack. But I don't think _PyEval_EvalFrameDefault example is typical of how application code is written, nor is it, generally speaking, a good idea to do so much within single gigantic function. So I believe it's more of an outlier than a typical case.

Looking also at @pviktori response in https://lists.fedoraproject.org/archives/list/python-devel@lists.fedoraproject.org/message/IKPBMMFIQDZFEG72LOW7VHO3LFWQRFDM/ I agree that microbenchmarking matrix multiplication in pure Python instead of offloading it to native C/C++ libraries as is typically done in practice with Python apps for all CPU-heavy stuff, is as far from real-world benchmark as we could get. So I hope that these microbenchmarks won't be the primary driver of the decision on enabling frame pointers.

cc @pviktori just in case he and Python community would like to look deeper into this and maybe consider splitting up _PyEval_EvalFrameDefault into subfunctions somehow to facilitate better register allocation in compilers.

I hope this was helpful.

In response to @anakryiko, my “seat of the pants” intuition is that 1% performance regression is probably acceptable (though some will disagree), 5% is too much (though some will disagree) and unlikely to ever be repaid in hypothetical performance patches enabled by better profiling, and 2-3% is difficult: some people will find it obviously acceptable and other people will find it obviously unacceptable, depending on their priorities.

Thanks for specific numbers as a guideline!

TBH, I think this decision can't be made purely based on benchmarks, as real-world workloads and microbenchmarks are different things. Building real-world-like reliable benchmark is hard and not always possible or practical. We at Meta, when enabling frame pointers fleet-wide 5 years ago, deployed frame pointer-enabled binaries to a separate fleet of production hosts and compared aggregated performance data between two comparable sets of hosts and they were neutral for production work. I believe this is the case in most if not all cases, even for Python apps.

And then there is also cumulative longer term effects and benefits from improved performance and observability data available thanks to frame pointers, which is not easily quantifiable.

So anyways, looking forward to the voting next week. Thanks!

ngompa commented 2 years ago

Looking also at @pviktori response in https://lists.fedoraproject.org/archives/list/python-devel@lists.fedoraproject.org/message/IKPBMMFIQDZFEG72LOW7VHO3LFWQRFDM/ I agree that microbenchmarking matrix multiplication in pure Python instead of offloading it to native C/C++ libraries as is typically done in practice with Python apps for all CPU-heavy stuff, is as far from real-world benchmark as we could get. So I hope that these microbenchmarks won't be the primary driver of the decision on enabling frame pointers.

Well, as it stands, we don't have anything else to give us reasonable data on doing so, so yes, the microbenchmarks are the primary driver. Note that @daandemeyer put together the benchmarks in the first place, so to some degree I have to trust he made the benchmarks to be representative of the type of workloads that can be executed. And frankly, the performance drop for Python is significant because of how much Linux tooling is either mostly or purely in Python.

So anyways, looking forward to the voting next week. Thanks!

We're not meeting next week because it's Thanksgiving in the United States. We'll be meeting in two weeks.

brendangregg commented 2 years ago

I enabled frame pointers at Netflix, for Java and glibc, and summarized the effect in BPF Performance Tools (page 40):

"Last time I studied the performance gain from frame pointer
omission in our production environment, it was usually less than one percent, and it
was often so close to zero that it was difficult to measure. Many microservices at
Netflix are running with the frame pointer reenabled, as the performance wins found
by CPU profiling outweigh the tiny loss of performance."

I've spent a lot of time analyzing frame pointer performance, and I did the original work to add them to the JVM (which became -XX:+PreserveFramePoiner). I was also working with another major Linux distro to make frame pointers the default in glibc, although I since changed jobs and that work has stalled. I'll pick it up again, but I'd be happy to see Fedora enable it in the meantime and be the first to do so.

We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost. Off-CPU flame graphs in particular need to walk the pthread functions in glibc as most blocking paths go through them; CPU flame graphs need them as well to reconnect the floating glibc tower of futex/pthread functions with the developers code frames.

I see the comments about benchmark results of up to 10% slowdowns. It's good to look out for regressions, although in my experience all benchmarks are wrong or deeply misleading. You'll need to do cycle analysis (PEBS-based) to see where the extra cycles are, and if that makes any sense. Benchmarks can be super sensitive to degrading a single hot function (like "CPU benchmarks" that really just hammer one function in a loop), and if extra instructions (function prologue) bump it over a cache line or beyond L1 cache-warmth, then you can get a noticeable hit. This will happen to the next developer who adds code anyway (assuming such a hot function is real world) so the code change gets unfairly blamed. It will only regress in this particular scenario, and regression is inevitable. Hence why you need the cycle analysis ("active benchmarking") to make sense of this.

There was one microservice that was an outlier and had a 10% performance loss with Java frame pointers enabled (not glibc, I've never seen a big loss there). 10% is huge. This was before PMCs were available in the cloud, so I could do little to debug it. Initially the microservice ran a "flame graph canary" instance with FPs for flame graphs, but the developers eventually just enabled FPs across the whole microservice as the gains they were finding outweighed the 10% cost. This was the only noticeable (as in, >1%) production regression we saw, and it was a microservice that was bonkers for a variety of reasons, including stack traces that were over 1000 frames deep (and that was after inlining! Over 3000 deep without. ACME added the perf_event_max_stack sysctl just so Netflix could profile this microservice, as the prior limit was 128). So one possibility is that the extra function prologue instructions add up if you frequently walk 1000 frames of stack (although I still don't entirely buy it). Another attribute was that the microservice had over 1 Gbyte of instruction text (!), and we may have been flying close to the edge of hardware cache warmth, where adding a bit more instructions caused a big drop. Both scenarios are debuggable with PMCs/PEBS, but we had none at the time.

So while I think we need to debug those rare 10%s, we should also bear in mind that customers can recompile without FPs to get that performance back. (Although for that microservice, the developers chose to eat the 10% because it was so valuable!) I think frame pointers should be the default for enterprise OSes, and to opt out if/when necessary, and not the other way around. It's possible that some math functions in glibc should opt out of frame pointers (possibly fixing scimark, FWIW), but the rest (especially pthread) needs them.

In the distant future, all runtimes should come with an eBPF stack walker, and the kernel should support hopping between FPs, ORC, LBR, and eBPF stack walking as necessary. We may reach a point where we can turn off FPs again. Or maybe that work will never get done. Turning on FPs now is an improvement we can do, and then we can improve it more later.

For some more background: Eric Schrock (my former colleague at Sun Microsystems) described the then-recent gcc change in 2004 as "a dubious optimization that severely hinders debuggability" and that "it's when people start compiling /usr/bin/* without frame pointers that it gets out of control" I recommend reading his post: [0].

The original omit FP change was done for i386 that only had four general-purpose registers and saw big gains freeing up a fifth, and it assumed stack walking was a solved problem thanks to gdb(1) without considering real-time tracers, and the original change cites the need to compete with icc [1]. We have a different circumstance today -- 18 years later -- and it's time we updated this change.

[0] http://web.archive.org/web/20131215093042/https://blogs.oracle.com/eschrock/entry/debugging_on_amd64_part_one
[1] https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01033.html

churchyard commented 2 years ago

Folks, I am glad that there's more and more info, but could we please forward any technical discussion to the devel mailing list? It has so much greater audience than this ticket, which was opened for processing reasons.

ngompa commented 2 years ago

@brendangregg, I appreciate that you offered such valuable context and feedback. It also answered a question that I couldn't find an answer to: why we omit frame pointers in the first place!

You make a reasonably compelling case for making this change, and I'm even willing to vote in favor of this change, but the problem I have is that we're going to be roasted for taking a statistically significant performance hit across the entire distribution.

All of you in Meta, Netflix, and others have the wonderful benefit of opacity, where nobody can see the consequences of your decisions in extreme detail unless you explicitly talk about it. And your conveyance is obviously filtered through the lens of the story you want to tell.

We don't have that benefit. Worse, we have things like Phoronix and other sites who will benchmark Fedora and put out headlines saying that we're the slowest Linux distribution, which is tremendous bad press. Nuance gets lost. Everyone here has been talking about the benefit for profiling and performance gains, but nobody here is saying that they'd use Fedora if we did it. Nobody is saying they'd help regain the losses if we did this. Not even the GNOME developers (who say they're interested in it) are willing to do that.

You yourself say that your developers were willing to eat a 10% performance loss. But you had the benefit of choosing not to eat that loss. Once we do that distro-wide, there's no choice there. Critically, if nobody is stepping up to commit to claw back the performance losses, then it will likely not survive to RHEL either.

Overall, I'm willing to say I'm +1 for the Change, but I'm generally concerned how we're going to be eaten alive for doing this. I'm also concerned that the Change may not survive to downstreams if we don't somehow come up with enough community interest to claw back the lost performance.

decathorpe commented 2 years ago

Ok, so if we assume that this change would attract lots of developers to use Fedora as their development environment, I'm willing to accept that this change will be good in the long run. As such, I'm not going to block it, and change my vote to 0 (especially if some of the worst-affected projects like Python will enable this in the future anyway).

Do you plan to make follow-up changes to compiler flags for other languages? For example, I just found out that rustc with our current default compiler flags will also omit frame pointers. The same will probably be true for some other LLVM based languages (Swift? Julia? not sure)

pviktori commented 2 years ago

I sent an update to python-devel, concluding that:

I think we should treat Python 3.11 and 3.12 as entirely separate when it comes to performance with no-omit-frame-pointer.

I think that Fedora should ignore the Python benchmarks when evaluating the distro default -- and if Fedora switches to no-omit-frame-pointer, Python 3.11 should be an exception (to be re-evaluated for 3.12).
(Most of the current benchmarks [7] are from Python, so more might be needed.)

@anakryiko, I let faster-CPython devs know about your post. Not sure how it plays into their plans – that function is being overhauled, possibly to better allow optimizing the points you found, across all the cases.
If you want to contact them, IMO the best way is an issue on https://github.com/faster-cpython/ideas. I told them to ping you on GitHub if they want to follow up.

zbyszek commented 2 years ago

I think we ~~can~~ can't get much more useful information from benchmarks. The work that @daandemeyer did is useful to establish that there's a potential for single-digit slowdowns, but maybe more realistic cases will not be measurably impacted. The cases where the biggest slowdowns happen seem to be caused by register pressure, and that can be improved by refactoring and/or by changes to the compiler. But to get meaningful results, we need to recompile a significant chunk of the distro with frame pointers. This will allow more realistic measurements and will also allow people to actually make use of the feature for debugging. Right now we're not making much progress on the decision because there is very little practical data and we won't get that until we try. I think it makes sense to flip the flag in Fedora to test this. If it is enabled, I also expect that egregious cases will be noticed and fixed. I wouldn't like to see a permanent slowdown of this magnitude, but a temporary one is OK if it allows better development.

Proposal: accept the Change, but with the limitations that after two releases (i.e. before F40 is branched), we'll look at the status and decide whether to revert the change. The decision will be based on distro benchmarks and on reports from people doing profiling if they find this useful.

(In particular, I expect that Phronix will oblige and publish detailed benchmarks once the mass rebuild is done and F38 beta is out.)

Edited 2 years ago by zbyszek

kkofler commented 2 years ago

@anakryiko commented:

The size of _PyEval_EvalFrameDefault function specifically (and all the other functions didn't change much in that regard) increased very significantly from 46104 to 53592 bytes, which is a considerable 15% increase. Looking deeper, I believe it's all due to more stack spills and reloads due to one less register available to keep local variables in registers instead of on the stack.

This leads me to an important question: What is the overall impact of frame pointers on code size? Unless I missed it, this does not seem to have been discussed here at all. If 15% size increases are widespread, I would consider this a no go no matter how small the performance impact is.

From my experience with pocket calculators using Motorola 68000 CPUs, code with -fomit-frame-pointer tended to always be smaller than with -fno-omit-frame-pointer, but I would like to see the numbers for the architectures we care about in Fedora, i.e., mainly x86_64 and aarch64. (The former being what most people use, and the latter where size matters the most.) So, recompile all packages in Rawhide with -fno-omit-frame-pointer, and then compare the sizes.

And then there is also cumulative longer term effects and benefits from improved performance and observability data available thanks to frame pointers, which is not easily quantifiable.

There is no guarantee that those will happen, at all. Nor that they would not also happen without forcing frame pointers on millions of Fedora users, most of whom will never touch a profiler in their life.

@brendangregg commented:

I enabled frame pointers at Netflix, for Java and glibc, and summarized the effect in BPF Performance Tools (page 40):

"Last time I studied the performance gain from frame pointer
omission in our production environment, it was usually less than one percent, and it
was often so close to zero that it was difficult to measure. Many microservices at
Netflix are running with the frame pointer reenabled, as the performance wins found
by CPU profiling outweigh the tiny loss of performance."

As was already pointed out several times, that applies in a closed corporate environment like the ones you have at Netflix or Meta. It does not apply the same way to a distribution shipping upstream projects, many of which do not even use the same distribution. Upstream developers using, e.g., Ubuntu or Arch could not care less about whether Fedora enables frame pointers or not. Nor is every single upstream project going to have time to run profiling on their application at all. Yet, their application will be affected by the performance hit anyway. So the tradeoff is not quite the same.

We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost. Off-CPU flame graphs in particular need to walk the pthread functions in glibc as most blocking paths go through them; CPU flame graphs need them as well to reconnect the floating glibc tower of futex/pthread functions with the developers code frames.

Those enterprise environments are free to recompile glibc with frame pointers enabled. They are in fact already doing so. Hence, I do not see why they would need Fedora to do it for them.

I see the comments about benchmark results of up to 10% slowdowns. It's good to look out for regressions, although in my experience all benchmarks are wrong or deeply misleading. You'll need to do cycle analysis (PEBS-based) to see where the extra cycles are, and if that makes any sense. Benchmarks can be super sensitive to degrading a single hot function (like "CPU benchmarks" that really just hammer one function in a loop), and if extra instructions (function prologue) bump it over a cache line or beyond L1 cache-warmth, then you can get a noticeable hit. This will happen to the next developer who adds code anyway (assuming such a hot function is real world) so the code change gets unfairly blamed. It will only regress in this particular scenario, and regression is inevitable. Hence why you need the cycle analysis ("active benchmarking") to make sense of this.

Adding extra instructions also means a size increase. That means longer downloads, for those people unlucky enough to have metered Internet connections also more expensive ones, more disk space use, more RAM use, and for the reasons you describe above, also slower execution. Sure, any added code can push code above critical thresholds, but that is exactly why we should not add unnecessary code.

You argue that some later change could also push that code above the threshold anyway. To that, I reply that, with the many projects Fedora ships, there will always be some code that the added instructions from your change will push above the threshold, and that the combination of the frame pointers and later unrelated changes can end up pushing the code above the threshold together.

So while I think we need to debug those rare 10%s, we should also bear in mind that customers can recompile without FPs to get that performance back.

That is absolutely not realistic if the entire distribution is recompiled with -fno-omit-frame-pointer. Those people who care about running profilers are the ones able to recompile the entire distribution. The end users who will be hit by the performance loss are not. So I do not see why it should not be on the ones who want frame pointers enabled (i.e., not omitted) to recompile the entire distribution and not on everyone else. Also considering that the latter is by far the majority.

(Although for that microservice, the developers chose to eat the 10% because it was so valuable!)

That may make sense in your corporate environment. Not in a worldwide distribution where you will never see profiler output from all those machines that will get the 10% slowdown, so having the frame pointers on those adds no value whatsoever.

I think frame pointers should be the default for enterprise OSes,

But Fedora is most definitely not an enterprise OS!

and to opt out if/when necessary, and not the other way around.

See above why that is completely backwards.

In the distant future, all runtimes should come with an eBPF stack walker, and the kernel should support hopping between FPs, ORC, LBR, and eBPF stack walking as necessary. We may reach a point where we can turn off FPs again. Or maybe that work will never get done. Turning on FPs now is an improvement we can do, and then we can improve it more later.

If we revert to the legacy hack (frame pointers), we will never get the tooling fixed. Only if we resist the temptation to take the easy way out (with all its drawbacks), there will be enough pressure and motivation to improve the situation properly.

The original omit FP change was done for i386 that only had four general-purpose registers and saw big gains freeing up a fifth

In my experience with a much less register-starved architecture (Motorola 68000: 8 data registers, 8 address registers, of which one is inherently reserved for the stack pointer), omitting the frame pointer still helped register allocation (reducing spilling) a lot.

@ngompa commented:

@brendangregg, I appreciate that you offered such valuable context and feedback. It also answered a question that I couldn't find an answer to: why we omit frame pointers in the first place!

Well, yes, this was an upstream GCC change. IMHO, a change done for good reasons, and whose usefulness is not limited to 32-bit i686.

You make a reasonably compelling case for making this change,

I am not convinced at all. Please see my replies above.

and I'm even willing to vote in favor of this change, but the problem I have is that we're going to be roasted for taking a statistically significant performance hit across the entire distribution.

I do not understand that at all: You clearly see that there is a problem and that it will make Fedora look really bad (and IMHO, also be really bad, not just look), but you are still willing to vote for the Change. Is that not a contradiction?

I am also really disappointed that you are making a complete U-turn on your position from just a few days ago.

All of you in Meta, Netflix, and others have the wonderful benefit of opacity, where nobody can see the consequences of your decisions in extreme detail unless you explicitly talk about it. And your conveyance is obviously filtered through the lens of the story you want to tell.

We don't have that benefit. Worse, we have things like Phoronix and other sites who will benchmark Fedora and put out headlines saying that we're the slowest Linux distribution, which is tremendous bad press. Nuance gets lost. Everyone here has been talking about the benefit for profiling and performance gains, but nobody here is saying that they'd use Fedora if we did it. Nobody is saying they'd help regain the losses if we did this. Not even the GNOME developers (who say they're interested in it) are willing to do that.

You yourself say that your developers were willing to eat a 10% performance loss. But you had the benefit of choosing not to eat that loss. Once we do that distro-wide, there's no choice there. Critically, if nobody is stepping up to commit to claw back the performance losses, then it will likely not survive to RHEL either.

In the above paragraphs, you are making a very clear case for rejecting this Change, yet…

Overall, I'm willing to say I'm +1 for the Change,

Huh? How does that fit together? I am sorry, but I genuinely do not understand!

but I'm generally concerned how we're going to be eaten alive for doing this. I'm also concerned that the Change may not survive to downstreams if we don't somehow come up with enough community interest to claw back the lost performance.

Then why are you in favor of the Change?

@decathorpe commented:

Ok, so if we assume that this change would attract lots of developers to use Fedora as their development environment, I'm willing to accept that this change will be good in the long run.

But that is a pretty strong assumption that remains to be proven.

I do not think that this outweighs the impact on the vast majority of our users, who are not developers (no matter whom Workstation officially claims to target). And of those who are developers, many are developing using, e.g., web technologies that are not affected at all by this issue. (Those who came up with the Developer persona for Workstation know that too and have gone to great lengths to accomodate these developers.) And even those who are developing native code do not necessarily need frame pointers: I do C/C++ development and I have never missed them.

As such, I'm not going to block it, and change my vote to 0

Another sudden flip in opinion (even if it is not a complete reversal) that leaves me very disappointed.

(especially if some of the worst-affected projects like Python will enable this in the future anyway).

Then surely the answer needs to be to ban the Fedora packages of those projects from following the upstream change there (requiring them to use the Fedora build flags), rather than to shrug it off and use it as an excuse to degrade the performance of the rest of the distro along with Python (and others?). Your reaction sounds entirely backwards to me.

Do you plan to make follow-up changes to compiler flags for other languages? For example, I just found out that rustc with our current default compiler flags will also omit frame pointers. The same will probably be true for some other LLVM based languages (Swift? Julia? not sure)

Has anyone benchmarked how much the performance and the code size will be degraded for those languages?

@zbyszek commented:

I think we can get much more useful information from benchmarks. The work that @daandemeyer did is useful to establish that there's a potential for single-digit slowdowns, but maybe more realistic cases will not be measurably impacted. The cases where the biggest slowdowns happen seem to be caused by register pressure, and that can be improved by refactoring and/or by changes to the compiler. But to get meaningful results, we need to recompile a significant chunk of the distro with frame pointers. This will allow more realistic measurements and will also allow people to actually make use of the feature for debugging. Right now we're not making much progress on the decision because there is very little practical data and we won't get that until we try.

So far, this makes sense, but:

I think it makes sense to flip the flag in Fedora to test this.

The production release is not the place for such experiments! The rebuilds should be done elsewhere and put up as an unofficial Fedora Remix somewhere. If Meta and Netflix care so much about this feature, how about they provide the infrastructure for creating and hosting the rebuilds? Those companies have loads of money.

If it is enabled, I also expect that egregious cases will be noticed and fixed.

There is no evidence that this will happen at all.

I wouldn't like to see a permanent slowdown of this magnitude, but a temporary one is OK if it allows better development.

I have to disagree on that point. Even more if I see your definition of "temporary" below.

Proposal: accept the Change, but with the limitations that after two releases (i.e. before F40 is branched), we'll look at the status and decide whether to revert the change. The decision will be based on distro benchmarks and on reports from people doing profiling if they find this useful.

I am sorry, but I think this makes no sense whatsoever. You are willing to let two entire releases ship with potentially severely degraded performance, possibly also with severely increased code size, and only then decide whether it was not actually an unacceptable hit? This means that, even if the performance is deemed to be unacceptable, we will be shipping that slow code as the current code for a whole year! And that the last fast release from before the Change will reach its end of life before the next fast release after it is reverted. If it even actually gets reverted at all, no matter how bad it turns out to be, because everyone will just have sucked up the degraded performance and lowered their expectations accordingly.

Can we please leave the experiments out of stable Fedora releases?

(In particular, I expect that Phronix will oblige and publish detailed benchmarks once the mass rebuild is done and F38 beta is out.)

Sure they will. They will write a sensationalist article about how horrible the Fedora performance is, and the whole press will publish articles citing it and saying Fedora is a slow and bad distribution. Is that really what you want?

zbyszek commented 2 years ago

@kkofler We're talking about slowdowns between of 0%–1%, or maybe a bit higher. (We don't actually know, because we don't have enough data about real systems.) So it's not necessary to make sensationalist remarks about "massive slowdowns" and "horrible performance".

And yes, I do believe that there's potential for long-term benefits from improved debugging and profiling. Perf is a very very cool technology but most people are not using it. I would love us to change that and have a great profiling experience on Fedora.

kkofler commented 2 years ago

@kkofler We're talking about slowdowns between of 0%–1%, or maybe a bit higher. (We don't actually know, because we don't have enough data about real systems.)

Well, that is exactly the issue, you do not actually know. The only benchmark that has been posted here has much bigger impact.

I have also not seen any data about the size impact.

So it's not necessary to make sensationalist remarks about "massive slowdowns" and "horrible performance".

I do not think that I am being sensationalist. I think the Meta and Netflix people are using clever wording to downplay the issue.

When that Phoronix article that you are hoping for will come out, you will see what is really sensationalist language.

And yes, I do believe that there's potential for long-term benefits from improved debugging and profiling.

Of course there is potential, but how much will it actually be used?

Perf is a very very cool technology but most people are not using it. I would love us to change that and have a great profiling experience on Fedora.

But that should not happen at literally everyone else's expense. We can ship a separate profiling-optimized Fedora, just like we ship ELN with RHEL flags. In fact, maybe ELN is even the place to enable this, if we agree with @brendangregg's claim that "frame pointers should be the default for enterprise OSes". But Fedora is not.

daandemeyer commented 2 years ago

Proposal: accept the Change, but with the limitations that after two releases (i.e. before F40 is branched), we'll look at the status and decide whether to revert the change. The decision will be based on distro benchmarks and on reports from people doing profiling if they find this useful.

From the perspective of the change proposal authors, we'd be happy with this approach.

ngompa commented 2 years ago

From my perspective, there are four reasons why I'm in favor of this Change despite the beating we'll take initially for doing this:

Making perf and related tools more useful in Fedora will make it more attractive for developers.
KDE developers that use Hotspot and GNOME developers that use Sysprof will tremendously benefit just in day-to-day debugging and development, especially around performance analysis.
Python 3.12 is going to force including frame pointers anyway, so we're going to take that hit no matter what next year.
Red Hat Enterprise Linux won't ship these flags unless Fedora does it first and we see take-up by developers to leverage the new data to do performance profiling and improve performance across the board. If we want more folks in our ecosystem, especially enterprise contributors (which I firmly believe we do), we should make our platform better for them too.

Thus, I'm +1 to @zbyszek's proposal, because it gives us an opportunity to broadly expose this and see whether we can prove the thesis that providing more working tools for performance analysis will lead to people actually improving performance of the software in the Linux platform (especially on the Linux desktop!).

Edited 2 years ago by ngompa

kkofler commented 2 years ago

From my perspective, there are four reasons why I'm in favor of this Change despite the beating we'll take initially for doing this:

Thank you for at least trying to lift my confusion, your explanation makes me understand better what your rationale is, though I still do not agree with it.

1. Making perf and related tools more useful in Fedora will make it more attractive for developers.

But if that comes at the expense of Fedora's overall performance, size, and reputation, it is a high price to pay.

2. KDE developers that use Hotspot and GNOME developers that use Sysprof will tremendously benefit just in day-to-day debugging and development, especially around performance analysis.

Do you know how many KDE developers use Hotspot (a KDAB tool hosted in KDAB's GitHub namespace) vs. KCachegrind (the official kdesdk profiling tool)?

3. Python 3.12 is going to force including frame pointers anyway, so we're going to take that hit no matter what next year.

As I understand it, Python upstream is going to recommend it, not force it. They also have no way to force it as long as Python is Free Software. To avoid taking the hit, FESCo can ban Python in Fedora from unilaterally shipping with frame pointers enabled.

4. Red Hat Enterprise Linux won't ship these flags unless Fedora does it first and we see take-up by developers to leverage the new data to do performance profiling and improve performance across the board. If we want more folks in our ecosystem, especially enterprise contributors (which I firmly believe we do), we should make our platform better for them too.

If RHEL wants this, it should be enabled in ELN, not in Fedora proper. Just like the reduced hardware support (another user-unfriendly compiler flag change) was.

Thus, I'm +1 to @zbyszek's proposal, because it gives us an opportunity to broadly expose this and see whether we can prove the thesis that providing more working tools for performance analysis will lead to people actually improving performance of the software in the Linux platform (especially on the Linux desktop!).

Why can we not have that opportunity by shipping the packages with frame pointers in ELN or in a dedicated side repository?

Edited 2 years ago by kkofler

ngompa commented 2 years ago

From my perspective, there are four reasons why I'm in favor of this Change despite the beating we'll take initially for doing this:

Thank you for at least trying to lift my confusion, your explanation makes me understand better what your rationale is, though I still do not agree with it.

Making perf and related tools more useful in Fedora will make it more attractive for developers.

But if that comes at the expense of Fedora's overall performance, size, and reputation, it is a high price to pay.

KDE developers that use Hotspot and GNOME developers that use Sysprof will tremendously benefit just in day-to-day debugging and development, especially around performance analysis.

Do you know how many KDE developers use Hotspot (a KDAB tool hosted in KDAB's GitHub namespace) vs. KCachegrind (the official kdesdk profiling tool)?

At a couple I've talked to who use Fedora Linux for development have indicated they use Hotspot for this purpose.

Python 3.12 is going to force including frame pointers anyway, so we're going to take that hit no matter what next year.

As I understand it, Python upstream is going to recommend it, not force it. They also have no way to force it as long as Python is Free Software. To avoid taking the hit, FESCo can ban Python in Fedora from unilaterally shipping with frame pointers enabled.

The Python maintenance team has already indicated they're going to follow the recommendation. From our perspective, that means the Python stack is taking the hit no matter what.

Red Hat Enterprise Linux won't ship these flags unless Fedora does it first and we see take-up by developers to leverage the new data to do performance profiling and improve performance across the board. If we want more folks in our ecosystem, especially enterprise contributors (which I firmly believe we do), we should make our platform better for them too.

If RHEL wants this, it should be enabled in ELN, not in Fedora proper. Just like the reduced hardware support (another user-unfriendly compiler flag change) was.

Thus, I'm +1 to @zbyszek's proposal, because it gives us an opportunity to broadly expose this and see whether we can prove the thesis that providing more working tools for performance analysis will lead to people actually improving performance of the software in the Linux platform (especially on the Linux desktop!).

Why can we not have that opportunity by shipping the packages with frame pointers in ELN or in a dedicated side repository?

The problem with ELN is that it doesn't build the whole distribution. It only builds enough for RHEL and select stuff that has been added as "workloads". It's also not generally accessible or usable for people to deploy and develop on.

As for a dedicated side repository: our build system infrastructure is horrible for that. It is unbelievably difficult to do this correctly with Koji. We don't have the same kind of superpowers that the openSUSE Build Service has, and there's no appetite from the Red Hat build system team to add the equivalent functionality to make it possible. A third party contributing it isn't going to work either, because the Koji "team" is essentially @mikem and @tkopecek and there's just no bandwidth for them to review and integrate work from outside that would enable that capability right now.

For better or worse, this is the best way to roll it out.

kkofler commented 2 years ago

The Python maintenance team has already indicated they're going to follow the recommendation. From our perspective, that means the Python stack is taking the hit no matter what.

This could easily be overruled by a mandate from FESCo or FPC.

This decision is clearly within FESCo's decision competence, so I do not see why you folks are willing to accept whatever the Python package maintainers want to do as a given.

As for a dedicated side repository: our build system infrastructure is horrible for that.

As I already mentioned: The users who want this are Meta (Facebook) and Netflix. Companies with deep pockets! I do not see why it should not be on those companies (who want that Change so badly) to provide the infrastructure.

For better or worse, this is the best way to roll it out.

I have to disagree. This is the absolute worst way, treating the users as guinea pigs and considering a revert only after one year of the regressions shipping in stable releases. Experiments need to be done in experimental repositories.

We have already shipped way too many failed experiments that eventually had to be reverted, e.g., Modularity.

kkofler commented 2 years ago

Proposal: Reject the Change as is. Meta and/or Netflix should provide infrastructure for a side repository in which the change can be tested and benchmarked and the code size measured. Packages in Fedora, including but not limited to Python, SHOULD NOT enable frame pointers before the evaluation is done, and MUST NOT do so without a FESCo-approved exception. Considering the known performance impact, such an exception will NOT be granted for Python, which as a result MUST ship without frame pointers until the evaluation is done. The Change will be reevaluated for Fedora 40. If the impact on performance or code size turns out to be unacceptable, it will be rejected permanently.

kkofler commented 2 years ago

As for a dedicated side repository: our build system infrastructure is horrible for that. It is unbelievably difficult to do this correctly with Koji.

The rebuilds do not necessarily have to be done with Koji. They can be done, e.g., with a CLI mass-rebuild script.

ngompa commented 2 years ago

The Python maintenance team has already indicated they're going to follow the recommendation. From our perspective, that means the Python stack is taking the hit no matter what.

This could easily be overruled by a mandate from FESCo or FPC.

This decision is clearly within FESCo's decision competence, so I do not see why you folks are willing to accept whatever the Python package maintainers want to do as a given.

I would not support such a motion, because in this case, Python wants it because they're actively doing something the necessitates it. They're doing what I want with that extra data.

As for a dedicated side repository: our build system infrastructure is horrible for that.

As I already mentioned: The users who want this are Meta (Facebook) and Netflix. Companies with deep pockets! I do not see why it should not be on those companies (who want that Change so badly) to provide the infrastructure.

This is actually blocked on Fedora's governance. Multiple offers for infrastructure contributions have been made over the years, but it gets stuck because of how Fedora is set up.

kkofler commented 2 years ago

The machine with the rebuilds does not have to be integrated with Fedora infrastructure at all. They just need to take the Fedora SRPMs, run them through a rebuild script, and upload them as a Fedora Remix. Kinda like how Rocky Linux and AlmaLinux are working (and CentOS and Scientific Linux used to work). I do not see why this requires any kind of approval by or interaction with Fedora governance.

kkofler commented 2 years ago

Python wants it because they're actively doing something the necessitates it.

It is a mode that is not the default mode of operation of Python, but only enabled through a special environment variable. I think it would be fine for such a mode to require using python-debug or a dedicated python-perf (I would be fine with FESCo granting an exception for such a subpackage as long as it is not the Python installed by default), or the Fedora Remix I suggest creating (which would also provide the frame pointers for glibc etc., not just Python).

ngompa commented 2 years ago

For better or worse, this is the best way to roll it out.

I have to disagree. This is the absolute worst way, treating the users as guinea pigs and considering a revert only after one year of the regressions shipping in stable releases. Experiments need to be done in experimental repositories.

We have already shipped way too many failed experiments that eventually had to be reverted, e.g., Modularity.

Modularity was a different case, where Red Hat tried to do something without data, without proper engineering funding, and without a desire to adapt to community needs. It also technically was not reverted.

As for other "experiments", I think that means the process is working. We do things, it works or it doesn't, and we respond accordingly.

ngompa commented 2 years ago

The machine with the rebuilds does not have to be integrated with Fedora infrastructure at all. They just need to take the Fedora SRPMs, run them through a rebuild script, and upload them as a Fedora Remix. Kinda like how Rocky Linux and AlmaLinux are working (and CentOS and Scientific Linux used to work). I do not see why this requires any kind of approval by or interaction with Fedora governance.

That is basically saying they should take their ball and go home. Moreover, it eliminates the principal benefit of doing it in Fedora: broad exposure and usefulness. I would not prefer they do that when they can do it here, integrate their improvements across the distribution, and continue to work here as a preferred place to build the best Linux platform.

kkofler commented 2 years ago

Modularity was a different case, where Red Hat tried to do something without data,

This Change is also being done almost without data. The one benchmark that was posted leaves many open questions.

without proper engineering funding,

This Change has no engineering funding from Red Hat at all, does it?

and without a desire to adapt to community needs.

Nor do I see such a desire in this Change. This is clearly putting the needs of some developers over everyone else's.

It also technically was not reverted.

It was ruled that packages must not be in modules only, there are no modules installed by default, and it was made easy for users to remove the modular repository altogether. So central parts of the original Change were reverted. It was turned into an option rather than something forced onto all users, exactly what I would like to see for this Change as well.

As for other "experiments", I think that means the process is working. We do things, it works or it doesn't, and we respond accordingly.

If the experiment is so broken that it has to be reverted, it should never have reached a purportedly stable release.

With such statements, are you really surprised that the press calls Fedora users RHEL beta testers?

kkofler commented 2 years ago

That is basically saying they should take their ball and go home.

No, it is saying (to stick with the analogy) that they should take their ball to the soccer pitch in the recess courtyard and invite people to play soccer there (and so will we) instead of attempting to play soccer in the classroom, causing noise in the entire school and risking breaking things.

Moreover, it eliminates the principal benefit of doing it in Fedora: broad exposure and usefulness.

Not if we actively send interested developers to the Remix. If it does not include anything violating copyrights or patents (and I hope we can trust Meta or Netflix on that), I see no reason why we would not be able to do that, even if the packages are built outside of Fedora infrastructure. Just like Fedora is now actively promoting Flathub and some (but not all) third-party RPM repositories.

I would not prefer they do that

Why not? It would bring the best of both worlds: regular users can keep enjoying the performance (and probably smaller size, though we are missing data on that) of Fedora, whereas developers who are interested in fast profiling can easily opt into the rebuilt packages.

when they can do it here,

But they cannot do it here in a way that does not lead to regressions for our end users.

integrate their improvements across the distribution,

But the Change is not an improvement from an end user point of view.

and continue to work here as a preferred place to build the best Linux platform.

I do not see why a Remix would not work at least as well.

kkofler commented 2 years ago

And if (and only if) the benchmarks on the Remix do not show a significant performance loss or a significant size increase, then we can consider making the change in Fedora proper. (But if they do, then we are better off leaving it in an opt-in Remix forever.)

My proposal does not preclude making this decision for Fedora 40 in any way. It just says we should not ship the untested Change to our end users for an entire year (and also provides an alternative path (keeping the Remix) to follow if the results end up as bad as I fear, unless @zbyszek's proposal, that leaves us with nothing if we end up having to revert).

Edited 2 years ago by kkofler

churchyard commented 2 years ago

Proposal: Reject the Change as is. Meta and/or Netflix should provide infrastructure for a side repository in which the change can be tested and benchmarked and the code size measured. Packages in Fedora, including but not limited to Python, SHOULD NOT enable frame pointers before the evaluation is done, and MUST NOT do so without a FESCo-approved exception. Considering the known performance impact, such an exception will NOT be granted for Python, which as a result MUST ship without frame pointers until the evaluation is done. The Change will be reevaluated for Fedora 40. If the impact on performance or code size turns out to be unacceptable, it will be rejected permanently.

I am all for a technical discussion (as much as I dislike having it in this ticket), but you ultimately just proposed to ban Python Maint from doing something instead of having a conversation with us. That is not productive.

kkofler commented 2 years ago

Would you be more willing to approve the proposal leaving off the sentence "Considering the known performance impact, such an exception will NOT be granted for Python, which as a result MUST ship without frame pointers until the evaluation is done.", keeping the path of a FESCo-approved exception for Python open?

kkofler commented 2 years ago

And I am sorry that I seem to have offended you with my proposal, it was not my intention.

The impression I have gotten so far was that it was @pviktori from Python Maint who basically withdrew from the conversation towards what I would paraphrase as "upstream will recommend frame pointers, so we will ship them no matter what" (hoping that is an accurate summary and not a strawman). I would very much like there to be a conversation involving all players involved: Python Maint, FESCo, and the Fedora community at large. But I would like to urge Python Maint to refrain from making unilateral changes, even if the recommendation comes from upstream, without discussing the impact with other affected players. Python is used by basically all Fedora users (DNF 5 notwithstanding).

Edited 2 years ago by kkofler

kkofler commented 2 years ago

PS: I am pretty sure that an acceptable compromise could be found, such as a non-default frame-pointer-enabled build of the Python interpreter in a subpackage (or as a separate SRPM if it makes things technically easier).

churchyard commented 2 years ago

In https://lists.fedoraproject.org/archives/list/python-devel@lists.fedoraproject.org/message/ZVDEXGPU6JQFXB3XHYZ4IXVQNNR3YM3V/ @pviktori basically said that "if Fedora switches to no-omit-frame-pointer, Python 3.11 should be an exception (to be re-evaluated for 3.12). "

kkofler commented 2 years ago

So if that ("to be re-evaluated") is really Python Maint's position, why do at least 2 FESCo members (@decathorpe and @ngompa) assume that it is already a given that Python will be built with frame pointers enabled no matter the outcome of this issue? There seem to be quite some misunderstandings.

ngompa commented 2 years ago

So if that ("to be re-evaluated") is really Python Maint's position, why do at least 2 FESCo members (@decathorpe and @ngompa) assume that it is already a given that Python will be built with frame pointers enabled no matter the outcome of this issue? There seem to be quite some misunderstandings.

Because Python's compilation flags already differ from the rest of the distribution, and as @pviktori is a member of Python upstream, it is incredibly unlikely he will deviate from what upstream recommends.

kkofler commented 2 years ago

So to be clear, I believe that there should definitely be discussion among the involved players, and I am pretty sure that a reasonable compromise can be reached (as I said, Python on Fedora could offer both a default build without frame pointers, possibly even with perf integration disabled altogether (so people attempting to use the latter will get a warning and get pointed to the alternate build) and an alternate build with frame pointers), but the possibility of a FESCo mandate should not be preemptively precluded.

In any case, I do not believe that the decision on this Change should depend on the Python Maint team's plans, because those plans are not final and can be overruled.

ngompa commented 2 years ago

If we exclude Python, then we have no significant reason not to do it. Python was the most painful benchmark, and the rest have insignificant hits that could easily be made back with the benefit of the performance tooling this Change enables.

kkofler commented 2 years ago

There were almost no non-Python benchmarks. The ones that were there, GCC and Blender, had around 2% performance hit, which is twice the upper end of the 0-1% @zbyszek is claiming, and in @music's middle range ("2-3% is difficult: some people will find it obviously acceptable and other people will find it obviously unacceptable, depending on their priorities").

And "could easily be made back" is theoretical, it does not mean that it will ever happen, nor does it mean that it could not also happen in another way (e.g., by providing rebuilds with frame pointers in an unofficial Remix as I am suggesting, or by using another distribution, or a tool like Craft or JHBuild) allowing us to have both the win from omitting the frame pointer and the benefits of profiling.

Edited 2 years ago by kkofler

catanzaro commented 2 years ago

As I already mentioned: The users who want this are Meta (Facebook) and Netflix. Companies with deep pockets! I do not see why it should not be on those companies (who want that Change so badly) to provide the infrastructure.

GNOME developers still want this too.

kkofler commented 2 years ago

OK, but if we can get Meta or Netflix to provide the infrastructure (and Meta are the ones who pushed for this Change to begin with), GNOME will also benefit for free.

xvitaly commented 2 years ago

This change should be rejected, because Facebook and other big corporations have enough resources to build their own package set with debugging profile enabled.

99.999% shouldn't suffer because of 0.001%. 2.5%+ performance regression is unacceptable for a general purpose distribution.

Edited 2 years ago by xvitaly

ngompa commented 2 years ago

This change should be rejected, because Facebook have enough resources to build their own package set with debugging profile enabled.

99.999% shouldn't suffer because of 0.001%. 2.5%+ performance regression is unacceptable for a general purpose distribution.

This is hyperbolic and also not true. As @catanzaro points out, GNOME wants this too. And KDE developers using profiling tools will benefit from it too. Those two stakeholders make it much more than "0.001%". If anything, that probably makes up a little under a quarter of Fedora's developer community (back of the napkin estimate).

kkofler commented 2 years ago

Maybe 0.001% is hyperbolically low, but I doubt our "developer community" as a whole is more than 1% of the total user base. And you estimate that only a quarter of those would benefit.

xvitaly commented 2 years ago

As @catanzaro points out, GNOME wants this too. And KDE developers using profiling tools will benefit from it too.

Fedora is a general purpose, not a developer-oriented distribution.

Those two stakeholders make it much more than "0.001%".

I don't think so. Millions of users vs. 10-20 developers (in theory). Incommensurable.

The impact on CPython benchmarks can be anywhere from 1-10% depending on the specific benchmark

It looks terrible.

kkofler commented 2 years ago

Not to mention that, as has already been pointed more than once, developers can rebuild the packages for their needs much more easily than end users. So the binary packages should be optimized for end users, not developers.

xvitaly commented 2 years ago

Not to mention that, as has already been pointed more than once, developers can rebuild the packages for their needs much more easily than end users.

True. Also, developers can use COPR to build whatever they want with flags they need.

ngompa commented 2 years ago

As evidenced by a number of other changes of this scope that have passed through before, it's been made very clear that COPR does not work well for this case. That's why it's not being used for the C99 work, it wasn't used for other previous compiler changes either.

xvitaly commented 2 years ago

As evidenced by a number of other changes of this scope that have passed through before, it's been made very clear that COPR does not work well for this case.

Create a new COPR repository.
Change the default flags in redhat-rpm-config, increase its version, release or epoch and build it in this COPR.
Build whatever you want with overridden flags.

kkofler commented 2 years ago

Indeed, I do not see why Copr could not be used for this.

And even if that were the case, I do not see why we would need to all take the penalty for limitations in Copr. Rebuilds can be done in external infrastructure as well. E.g., the tools Rocky Linux is using (Rocky Devtools) are public.

dcavalca commented 2 years ago

OK, but if we can get Meta or Netflix to provide the infrastructure (and Meta are the ones who pushed for this Change to begin with), GNOME will also benefit for free.

To be clear, Meta would be happy to contribute to Fedora Infrastracture, both specifically around this Change and in general. We've actually tried doing so in the past, but as @ngompa mentioned there are complicating factors in play. And in case it's not obvious, @daandemeyer has been leveraging both Fedora (notably copr) and Meta resources for the benchmarking efforts that went into this Change.

On the subject of a Remix: this isn't a matter of resourcing, but of opportunity IMO. The point of this Change is to allow folks running Fedora to leverage modern profiling tools on their systems to troubleshoot and fix issues as they happen. An unofficial Remix built on external infra (which is all this could be with the present constraints) wouldn't be particularly useful for folks running Fedora -- one would need to run the Remix to incur any benefits, which isn't something I see happening in practice, especially for desktop usecases. It's not just a matter of cherrypicking or rebuilding individual packages -- for profiling to be useful (and especially continuous profiling), the whole dependency chain needs to use frame pointers, and practically speaking that extends to the entire distribution for any non-trivial scenario.

kkofler commented 2 years ago

It would actually be very easy for Fedora users to migrate to your Remix: they just need to install the .repo file one way or the other, then if you bump the EVRs (just appending a repotag should be sufficient) while rebuilding, dnf update or dnf distro-sync should be sufficient, if you keep the same EVRs, then disabling the Fedora repositories and dnf reinstall '*' will do the trick.

Edited 2 years ago by kkofler

xvitaly commented 2 years ago

A separate developer-oriented Fedora remix, maintained by a big company is the best choice, IMO. Developers who need an extended profiling can use it.

catanzaro commented 2 years ago

Maybe 0.001% is hyperbolically low, but I doubt our "developer community" as a whole is more than 1% of the total user base. And you estimate that only a quarter of those would benefit.

This line of argumentation is really not doing well with me because when our developers are able to make performance improvements, then that benefits our users.

If the first step to doing performance work is "replace all your RPM packages with alternative versions provided by a remix repository," then that performance work is just not going to happen. Also, maintaining such a remix sounds like a tremendously huge effort.

kkofler commented 2 years ago

If the first step to doing performance work is "replace all your RPM packages with alternative versions provided by a remix repository," then that performance work is just not going to happen.

Why not? You are going to have to do other preparations anyway, such as installing perf and/or other tools, installing all the -debuginfo packages, etc.

ngompa commented 2 years ago

If the first step to doing performance work is "replace all your RPM packages with alternative versions provided by a remix repository," then that performance work is just not going to happen.

Why not? You are going to have to do other preparations anyway, such as installing perf and/or other tools, installing all the -debuginfo packages, etc.

Yes to installing perf, but these days installing -debuginfo packages is no longer required.

fweimer commented 2 years ago

Shouldn't this discussion happen on the devel list?

If the first step to doing performance work is "replace all your RPM packages with alternative versions provided by a remix repository," then that performance work is just not going to happen. Also, maintaining such a remix sounds like a tremendously huge effort.

@catanzaro I'm worried that Fedora itself ends up as that remix, that is, everyone will just use OpenSUSE or Debian for day-to-day use because of the few extra % they get from not enabling certain performance debugging tools. Basically, Fedora as a research tool for other distributions (curiously except Red Hat Enterprise Linux—our performance team doesn't seem to be concerned that much by the lack of frame pointers).

kkofler commented 2 years ago

but these days installing -debuginfo packages is no longer required.

Sure, if you are fine with every single user on a multi-user machine having a copy of the same debugging information, then it is not required. Any reasonable system administrator is going to disable debuginfod, because doing this per user is just wrong.

Edited 2 years ago by kkofler

irogers commented 2 years ago

At Google we use -fno-omit-frame-pointer with -momit-leaf-frame-pointer, the recollection is that adding the second option is worth less than 1% in performance but worth it at data center scale. The numbers are from older chips, with newer chips there are technologies like memory renaming which mean that a hot memory location can be speculatively promoted to a register. I suspect the slow downs observed are because the compilers haven't been properly tuned and owe more to inlining choices than the loss of a register. Java was mentioned, with Java R15 is dedicated as a current thread pointer. It is possible to win this register back by using a selector (%fs or %gs) as is done in the ART runtime on Android and regular thread local storage.

My points are:
- if 1 register is so important we have ways to win them back by being smarter (you could move the frame pointer to thread local storage even),
- big companies run with -fno-omit-frame-pointer, but -momit-leaf-frame-pointer may be worth exploring,
- performance looks worse currently due to a lack of tuning what is currently an uncommon compiler option for GCC,
- with newer hardware memory renaming blurs the distinction between a memory and a register cost making it hard to assert that memory accesses will definitely be slow.

Note, with -momit-leaf-frame-pointer the last allocated register is RBP in order to allow more stack traces, albeit with the caller of the leaf function missing.

kkofler commented 2 years ago

-fno-omit-frame-pointer -momit-leaf-frame-pointer sounds to me like a worst-of-both-worlds compromise: You get neither the reliable backtraces without using unwinding information (as you pointed out, a function will always be missing in the traces, and sometimes the register can be used by the leaf function for something else entirely) nor the performance of -fomit-frame-pointer (though the performance hit is slightly less than with frame pointers everywhere).

tstellar commented 2 years ago

As evidenced by a number of other changes of this scope that have passed through before, it's been made very clear that COPR does not work well for this case. That's why it's not being used for the C99 work, it wasn't used for other previous compiler changes either.

I may have agreed with this a few years ago, but COPR has improved tremendously over the last few years, and I know that I, personally, would use it if I were implementing a system-wide compiler change.

I also don't think that it's true that it won't be used for the C99 work.

xvitaly commented 2 years ago

This line of argumentation is really not doing well with me because when our developers are able to make performance improvements, then that benefits our users.

You know that it's not true. Only a few big corporations will get benefit from this change. They will use Fedora in development, but switch to another distribution in production because the 2.5%+ performance penalty per server is a huge price to pay and will cost them money.

Fedora will gain the reputation of being the slowest and most debug-oriented distro, and we will also lose a lot of end users.

Also, maintaining such a remix sounds like a tremendously huge effort.

Corporations want benefits for themselves, but don't want to spend human resources on implementing their own test branch. And everyone else will suffer.

Edited 2 years ago by xvitaly

xvitaly commented 2 years ago

As an alternative, I suggest creating the Rawhide Debug branch with automatic Koji rebuilds with debug flags present (just like Fedora ELN). Everyone will be happy.

fweimer commented 2 years ago

At Google we use -fno-omit-frame-pointer with -momit-leaf-frame-pointer, the recollection is that adding the second option is worth less than 1% in performance but worth it at data center scale.

Would you please qualify this statement a bit? I looked at the precompiled _amd64.syso files in the Go source tree (for TSAN and BoringCrypto), and they happily clobber %rbp, so they are not even frame-pointer-preserving. libssn3.so.1d in the Chrome distribution does not seem to be frame-pointer-preserving, either. (Any use of %ebp is a red flag.) The Go compiler builds binaries with frame pointers, though.

Maybe you could also elaborate why skipped frames with -momit-leaf-frame-pointer aren't a problem for you?

@brendangregg Maybe you could comment on this as well? You said you built glibc with frame pointers, but I assume you didn't patch the string functions to add frame pointers. This must lead to skipped frames, too. By cycle count, the string functions are probably the most extensively used glibc functions (particularly if you do not use glibc malloc), so this must have been visible.

performance looks worse currently due to a lack of tuning what is currently an uncommon compiler option for GCC,

It's the default for AArch64 and POWER among others, so the generic parts are well-exercised.

@anakryiko analysis above (search for _PyEval_EvalFrameDefault) suggests that GCC on x86-64 indeed has gaps. To me, this strongly suggests that we should make the change in upstream GCC once GCC is ready, not just downstream.

I wasn't aware that there were such huge differences between GCC and Clang. It means that we can't readily take experience reports from Clang users (which I believe includes many parts of Google) and assume that they apply to a GCC-built distribution such as Fedora.

Note, with -momit-leaf-frame-pointer the last allocated register is RBP in order to allow more stack traces, albeit with the caller of the leaf function missing.

I suspect there must also be compilers and similar tools out there that do not clobber RBP in non-leaf functions unless it is used as a frame pointer. Then you also get missed frames in the middle of the stack. For example, the Python perf trampoline (which is used to justify switching on frame pointers in Python) does not use a frame pointer, it merely preserves it. As far as I understand it, it means that the immediately calling function is dropped from backtraces.

daandemeyer commented 2 years ago

You know that it's not true. Only a few big corporations will get benefit from this change. They will use Fedora in development, but switch to another distribution in production because the 2.5%+ performance penalty per server is a huge price to pay and will cost them money.

This is false, all the software that Meta builds itself is built with frame pointers. And if CentOS gets to a point where its packages are built with frame pointers, we will deploy those packages without recompiling to our entire fleet as well. This is exactly the point we're trying to make: The performance improvements we can make by being able to profile software that's running in production far outweigh the performance impact of building all software with frame pointers.

Edited 2 years ago by daandemeyer

ngompa commented 2 years ago

This line of argumentation is really not doing well with me because when our developers are able to make performance improvements, then that benefits our users.

You know that it's not true. Only a few big corporations will get benefit from this change. They will use Fedora in development, but switch to another distribution in production because the 2.5%+ performance penalty per server is a huge price to pay and will cost them money.

Wasting developer time costs way more than even a 10% performance loss. If developer efficiency and productivity can be improved and there's a possibility to make back the loss, usually that's considered a win.

fche commented 2 years ago

Would a survey possibly help inform this decision? After all, we're trying to balance different unknowns. So maybe ask the user base:

would you be concerned with a 1? 5? 10? percent slowdown?
how likely are you personally to run these profiling tools?
how long to accept a slowdown until profiling-based or other improvements win it back? 1 release? 2? rawhide only?

zbyszek commented 2 years ago

Would a survey possibly help inform this decision?

Unfortunately no. I'm pretty sure that if F38 was slower by 2% for an unannounced reason, nobody would even notice, because such an amount is hard to notice. But if you ask people if they are willing to give up 2% of performance, many will make it a hill to die on. (E.g.: do you see Fedora users picketing to disable CPU bug mitigations in our kernels? I don't. And the slowdowns from that are much bigger in many scenarios.)

Also, not many people use profiling tools. If we accept this change, we should certainly drum up awareness by publishing articles about whole-stack profiling on Fedora. We create both demand and supply at the same time.

Also, we could ask about 1/5/10%, but don't really know what the number will be. Once builds with the new flags are being made, we'll get a feedback loop of improvements to packages (and maybe the compiler), and the numbers will change. Using surveys to resolve complex technical questions just doesn't work.

Overall, I think the concerns are massively overblown. Both about the impact of the slowdowns (which I expect will be not-readily-noticeable in normal use), and the response from reviewers/users/media (if we explain why we're doing this), and the long-term risks (because if turns out more costly than expected, we can just undo the flag change).

xvitaly commented 2 years ago

E.g.: do you see Fedora users picketing to disable CPU bug mitigations in our kernels? I don't. And the slowdowns from that are much bigger in many scenarios.

All my friends use mitigations=off. Me too.

kkofler commented 2 years ago

@zbyszek:

Would a survey possibly help inform this decision?

Unfortunately no. I'm pretty sure that if F38 was slower by 2% for an unannounced reason, nobody would even notice, because such an amount is hard to notice. But if you ask people if they are willing to give up 2% of performance, many will make it a hill to die on.

I consider this "we know better than the users" attitude to be really non-constructive.

Not only will attentive users notice a 2% performance hit, but the sum of that with other sources of performance decrease (e.g., creeping software bloat, or the CPU bug mitigations that you brought up) will make it even more noticeable.

(E.g.: do you see Fedora users picketing to disable CPU bug mitigations in our kernels? I don't. And the slowdowns from that are much bigger in many scenarios.)

I do not believe that this is a fair comparison:

CPU bug mitigations are about security. Frame pointers are not. Users will be really careful when security is affected, and often willing to accept any kind of inconvenience for it (see also SELinux). And in the case of SPECTRE, the potential impact is so broad-reaching that it is really hard to know whether one is affected. When this initially came out, I thought that my single-user machine would surely not be affected and that I should just disable those mitigations, but then reports came out that SPECTRE can even be exploited by JavaScript on a website! (IMHO, allowing websites to execute client-side code was a huge mistake to begin with, but alas, that ship has sailed, the modern-day web has become unusable without JavaScript.)
Many of those mitigations, at least the ones in the kernel, can be disabled at runtime with a simple kernel CLI option. Frame pointers cannot, one has to rebuild the entire distribution to get rid of them.

Also, not many people use profiling tools.

That is actually a reason to not accept the change, since it means "not many people" (your words!) will benefit from it.

If we accept this change, we should certainly drum up awareness by publishing articles about whole-stack profiling on Fedora. We create both demand and supply at the same time.

There is no evidence that this will actually lead to more people using profiling. The lack of demand is more likely to be due to the fact that people are simply not interested in profiling, and no amount of advertising will change that.

Non-developer end users will have no use for profiling because they will not be able to do anything with the data.

Also, we could ask about 1/5/10%, but don't really know what the number will be. Once builds with the new flags are being made, we'll get a feedback loop of improvements to packages (and maybe the compiler), and the numbers will change.

That is why the builds with the changed flags should be made in an opt-in side repository so they can be benchmarked against the official unchanged Fedora before being merged into it.

Doing things the way you propose means:

you will be releasing unbenchmarked builds to end users, and
you will be removing (overwriting) the baseline for comparison, so there is no way to do an apples-to-apples benchmark (short of rebuilding everything without frame pointers in a side repository, so a side repository would be needed anyway).

(And by the way, I object to the term "new flags", because frame pointers are actually the old way of doing things, to which this Change proposes reverting. They used to be the default everywhere. This was changed by upstream GCC for a reason.)

Using surveys to resolve complex technical questions just doesn't work.

See my reply to the first paragraph.

Overall, I think the concerns are massively overblown. Both about the impact of the slowdowns (which I expect will be not-readily-noticeable in normal use),

I doubt that. And there is no data to prove either of us wrong. We need rebuilds of the current Rawhide in a side repository (in parallel to the unchanged Rawhide that serves as the baseline) so we can do fair benchmarks. Otherwise, the only safe thing to do is to reject the Change.

and the response from reviewers/users/media (if we explain why we're doing this),

I doubt that, too. I know how some sites (cough Phoronix cough) always come out with sensationalist headlines as soon as one distribution loses 0.01% performance on their synthetic benchmark. And this Change has orders of magnitude higher impact than that.

and the long-term risks (because if turns out more costly than expected, we can just undo the flag change).

We cannot "just undo the flag change" in a stable release because it means mass-rebuilding the entire distribution! So if we ship that, we are stuck with it for at least 6 months!

Edited 2 years ago by kkofler

fweimer commented 2 years ago

I checked with the Platform Tools team at Red Hat (who maintain binutils/gcc/gdb/glibc/…), and the team remains opposed to switching to -fno-omit-frame-pointer.

sgallagh commented 2 years ago

#agreed REJECTED - Add -fno-omit-frame-pointer to default compilation flags (+2, 1, -4)

ngompa commented 2 years ago

I would like to have this discussed again in a meeting specifically with having the desktop performance team and the toolchain team present with the Change owners. @catanzaro knows who those folks are and can bring them in.

I really feel like we're making a bad choice here by rejecting this because the toolchain team doesn't understand exactly how broken the real-time observability problem is for Linux software. Real-time tracing and profiling is basically impossible on Linux for substantially non-trivial applications because of this. The cloud native world is going nuts adding custom instrumentation because they can't rely on the built-in capabilities in Linux. The desktop world has nowhere near the funding to pull off something like that.

Ironically, because Meta doesn't use Kubernetes or CNCF tooling for their platform, they actually try to leverage the stuff Linux provides, which is how this Change proposal started.

I really feel like @codonell and @fweimer have seriously missed this point (as did @kkofler in this ticket).

kkofler commented 2 years ago

Can we please stop beating a dead horse? There was a vote. It did not come out the way you would have liked it, but there was a result, finally, after 5 months (!) of discussion.

ngompa commented 2 years ago

Can we please stop beating a dead horse? There was a vote. It did not come out the way you would have liked it, but there was a result, finally, after 5 months (!) of discussion.

Yes, but the other stakeholder I wanted there didn't even know it was on the agenda yesterday, which meant we mostly heard only one side (the toolchain people).

(Actually I didn't know either, but that's because of exhaustion from work things kept me from reading my email like I normally do...)

Edited 2 years ago by ngompa

kkofler commented 2 years ago

This is a toolchain decision. The toolchain people are the most qualified experts on the topic.

ngompa commented 2 years ago

This is a toolchain decision. The toolchain people are the most qualified experts on the topic.

They are up to a point, and they also have their own biases about how their stuff should be used. There's been a whole new field that was created with real-time tracing. As we've moved into more complex systems where issues can only be found by just-in-time observability and tracing, a whole new discipline and set of tools around performance analysis has built up.

I've seen amazing work done with these capabilities, and while I'm nowhere near as skilled as @brendangregg, @daandemeyer, and others, I very much appreciate when I'm able to do it.

Also, here's a blunt truth: the reason we take a performance hit is because the toolchain folks haven't built any optimizations to work around the usage of an extra register on x86_64. And they don't want to do that work. Fine, my understanding is that @daandemeyer's team was willing to do that work if it could be pointed to concrete issues (they're currently not patching anything for this and haven't observed impactful performance issues). For that matter, when I used to do that stuff, I didn't either.

Can we please stop beating a dead horse? There was a vote. It did not come out the way you would have liked it, but there was a result, finally, after 5 months (!) of discussion.

Also, stones and glass houses, dude.

catanzaro commented 2 years ago

Our desktop performance expert's most relevant comments on this topic are here, here, and here.

This is a toolchain decision. The toolchain people are the most qualified experts on the topic.

Honestly, I no longer trust the toolchain developers to make rational decisions regarding real-world performance impact due to their handling of this issue. They are hyper-focused on benchmarks at the expense of pragmatism.

Anyway, Kevin's right about one thing: FESCo has voted, and we should accept the result.

kkofler commented 2 years ago

This is a toolchain decision. The toolchain people are the most qualified experts on the topic.

They are up to a point, and they also have their own biases about how their stuff should be used. There's been a whole new field that was created with real-time tracing. As we've moved into more complex systems where issues can only be found by just-in-time observability and tracing, a whole new discipline and set of tools around performance analysis has built up.

I've seen amazing work done with these capabilities, and while I'm nowhere near as skilled as @brendangregg, @daandemeyer, and others, I very much appreciate when I'm able to do it.

Also, here's a blunt truth: the reason we take a performance hit is because the toolchain folks haven't built any optimizations to work around the usage of an extra register on x86_64. And they don't want to do that work.

I believe I am qualified to reply to this one. I am actually also coming from the toolchain side. You may have noticed the @tigcc.ticalc.org e-mail address I use sometimes. That comes from my (unpaid volunteer) work on the (unpaid volunteer) project TIGCC. I spent quite some time there working on GCC optimizations, debugging information, GDB backtraces (I managed to integrate GDB and the GDB frontend Insight into an emulator), etc. Though it was for a m68k target, not x86_64.

Being able to do backtraces without frame pointers was a part of that work. -fomit-frame-pointer just made the code both faster and smaller, both of which happened a lot on the calculators. And the huge unwinding information was never actually sent to the calculator, but put into a split debuginfo file (actually a COFF file containing both a copy of the program code with relocations and the debugging information in DWARF 2 format) that was loaded directly by the emulator. Nobody ever tried producing backtraces on the calculator anyway.

But to get back to our x86_64 computers: No amount of optimizations is going to "work around" the loss of a usable register. The toolchain might be able to do optimizations that recover the percentual performance loss, but those optimizations are also going to apply to the -fomit-frame-pointer case, so the performance difference is not going to magically go away.

Fine, my understanding is that @daandemeyer's team was willing to do that work if it could be pointed to concrete issues (they're currently not patching anything for this and haven't observed impactful performance issues). For that matter, when I used to do that stuff, I didn't either.

That assumes the work is possible at all, which I do not believe to be the case, see above.

Can we please stop beating a dead horse? There was a vote. It did not come out the way you would have liked it, but there was a result, finally, after 5 months (!) of discussion.

Also, stones and glass houses, dude.

Well, everyone complains when I do that. ;-)

catanzaro commented 2 years ago

Our desktop performance expert's most relevant comments on this topic are here, here, and here.

And now also here.

brendangregg commented 2 years ago

Right now eBPF is the hotness with new startups cropping up every couple of weeks. No OS yet has a perfect experience of eBPF, which would include stack walking support, and I see that Fedora has decided against it. This is like Fedora choosing not to support containers at the peak of Docker, and letting another OS be first.

At large companies this is a performance team decision: Considering the performance gains provided versus the performance loss for real world applications. If you have not done off-CPU analysis with eBPF, then you don't have firsthand experience with the cost of this decision.

[I'

The cost of a possible Phoronix article is a marketing decision, with input from the performance team, and needs to consider the alternate: Articles recommending moving off Fedora for the best eBPF experience.

I speak about eBPF and OSes a lot in public: How would you recommend I explain this decision?

"The holy grail of performance analysis

brendangregg commented 2 years ago

... Turns out that tab then space will post an unfinished comment, and there's no edit button. Brevity is good, so I'll leave it as is!

zbyszek commented 2 years ago

@brendangregg You can edit comments, even the latest one. Just reload the page, and the button should appear. (I would love you to edit your comment. Right now it's a bit hard to parse and it sounds interesting.)

fesco

#2817 Change proposal: Add -fno-omit-frame-pointer to default compilation flags

Closed: Rejected 2 years ago by sgallagh. Opened 2 years ago by bcotton.

Metadata

fesco

Source Code

#2817 Change proposal: Add -fno-omit-frame-pointer to default compilation flags Closed: Rejected 2 years ago by sgallagh. Opened 2 years ago by bcotton.

Metadata

system wide change meeting

#2817 Change proposal: Add -fno-omit-frame-pointer to default compilation flags

Closed: Rejected 2 years ago by sgallagh. Opened 2 years ago by bcotton.