hngopher.com

       [HN Gopher] Limitations of frame pointer unwinding
       ___________________________________________________________________
        
       Limitations of frame pointer unwinding
        
       Author : rwmj
       Score  : 104 points
       Date   : 2024-11-04 11:25 UTC (11 hours ago)
        
 (HTM) web link (developers.redhat.com)
 (TXT) w3m dump (developers.redhat.com)
        
       | dap wrote:
       | This reads to me like FUD. Isn't the fraction of profile samples
       | in a prologue heavily workload dependent? And whichever way you
       | go on frame pointers, there are winners and losers to including
       | them by default.
        
         | thegeomaster wrote:
         | There are no performance winners if you include them by
         | default. There will be an additional >0% overhead when you are
         | executing additional code in the prologue and epilogue, and
         | increasing the register pressure by removing rbp from being
         | ever allocated.
         | 
         | There are only "winners" in the sense that people will be able
         | to more easily see why their never-tuned system is so slow. On
         | the other hand, you're punishing all perf-critical usecases
         | with unnecessary overhead.
         | 
         | I believe if you have a slow system, it's up to you to profile
         | and optimize it, and that includes even recompiling some
         | software with different flags to enable profiling. It's not the
         | job of upstream to make this easier for you if it means
         | punishing those workloads where teams have diligently profiled
         | and optimized through the years so that there is no, as the
         | author says, low-hanging fruit to find.
        
           | dap wrote:
           | I've been around long enough to have had frame pointers
           | pretty ubiquitously, then lost them, and now starting to have
           | them again. The dark times in the middle were painful. For
           | the software I've worked on, the easy dynamic profiling using
           | frame pointers (eg using DTrace) has given far more in
           | performance wins than omitting them would have. (Part of my
           | beef with the article is that while edge cases do break some
           | samples, in practice it's a very small fraction, and almost
           | by definition not the important ones if you're trying to find
           | heavy on-CPU code paths.)
           | 
           | I get that some use cases may be better without frame
           | pointers. A well-resourced team can always recompile the
           | world, whichever the default is. It's just that my experience
           | is that most software is not already perfectly tuned and I'd
           | much rather the default be more easily observable.
        
             | thegeomaster wrote:
             | Look, it's likely we just come from different backgrounds.
             | Most of my perf-sensitive work was optimizing inner loops
             | with SIMD, allowing the compiler to inline hot functions,
             | creating better data structures to make use of the CPU
             | cache, etc. Frame pointer prologue overhead was measurable
             | on most of our use-cases. I have a smaller amount of
             | experience on profiling systems where calls trace across
             | multiple processes, so maybe I haven't felt this pain
             | enough. Though I still think the onus should be on teams to
             | be able to comfortably recompile---not the world---but some
             | part of it. After all, a lot of tuning can only be done
             | through compile flags, such as turning off
             | codepaths/capabilities which are unnecessary.
        
               | dap wrote:
               | Makes sense.
               | 
               | I wasn't exaggerating about recompiling the world,
               | though. Even if we say I'm only interested in profiling
               | my application, a single library compiled without frame
               | pointers makes useless any samples where code in that
               | library was at the top of the stack. I've seen that be
               | libc, openssl, some random Node module or JNI thing, etc.
               | You can't just throw out those samples because they might
               | still be your application's problem. For me in those
               | situations, I would have needed to recompile most of the
               | packages we got from both the OS distro and the
               | supplemental package repo.
        
               | audidude wrote:
               | I think your viewpoint is valid.
               | 
               | My experience is on performance tuning the other side you
               | mention. Cross-application, cross-library, whole-system,
               | daemons, etc. Basically, "the whole OS as it's shipped to
               | users".
               | 
               | For my case, I need the whole system setup correctly
               | before it even starts to be useful. For your case, you
               | only need the specific library or application compiled
               | correctly. The rest of the system is negligible and
               | probably not even used. Who would optimize SIMD routines
               | next to function calls anyway?
        
       | audidude wrote:
       | I added support to Sysprof this weekend for unwinding using
       | libdwfl and DWARF/CFI/eh_frame/etc techniques that Serhei did in
       | eu-stacktrace.
       | 
       | The overhead is about 10% of samples. But at least you can unwind
       | on systems without frame-pointers. Personally I'll take the
       | statistical anomalies of frame-pointers which still allow you to
       | know what PID/TID are your cost center even if you don't get
       | perfect unwinds. Everyone seems motivated towards SFrame going
       | forward, which is good.
       | 
       | https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-fr...
        
       | Brian_K_White wrote:
       | Is this a response to Alma Kitten?
       | 
       | In any event I don't understand why frame pointers need to be in
       | by default instead of developers enabling where needed.
       | 
       | Having Kitten include pointers by default seems reasonable
       | enough, since Kitten is a devel system.
        
         | ithkuil wrote:
         | It's useful to be able to profile on production workloads
        
         | rwmj wrote:
         | The real benefit is being able to turn on profiling when a
         | problem is spotted, or in some cases to be able to profile
         | continuously in production (as apparently they do at Netflix).
        
           | thegeomaster wrote:
           | I get it. This _frustrated me to no end_. But still I did
           | what I had to do --- recompiled random software throughout
           | the stack, enabled random flags, etc. It was doable and now I
           | can do it much faster. I don 't think it's fair for upstream
           | to disable a useful optimization just so _I_ don 't have to
           | do this additional work to fix and optimize _my system_.
        
             | rwmj wrote:
             | Doing real world, whole system profiling, we've found
             | performance was affected by completely unexpected software
             | running on the system. Recompiling the entire distribution,
             | or even the subset of all software installed, is not
             | realistic for most people. Besides, I have measured the
             | overhead of frame pointers and it's less than 1%, so
             | there's not really any trade-off here.
             | 
             | Anyway, soon we'll have SFrame support in the userspace
             | tools and the whole issue will go away.
        
               | thegeomaster wrote:
               | In one of my jobs, a 1% perf regression (on a more
               | stable/reproducible system, not PCs) was a reason for a
               | customer raising a ticket, and we'd have to look into it.
               | For dynamically dispatched but short functions, the
               | overhead is easily more than 1% too. So, there _is_ a
               | trade-off, just not one that affects you.
        
             | Brian_K_White wrote:
             | I think it comes down to numbers. What are most installed
             | systems used for? Do more than 50% of installed systems
             | need to be doing this profiling all the time on just all
             | binaries such that they just need to be already built this
             | way without having to identify them and prepare them ahead
             | of time?
             | 
             | If so, then it should be the default.
             | 
             | If it's a close call, then there should be 2 versions of
             | the iso and repos.
             | 
             | As many developers and service operators as there are, as
             | much as everyone on this page is including both you and I,
             | I still do not believe the profiling use case is the
             | majority use case.
             | 
             | The way I am trying to judge "majority" is: Pick a binary
             | at random from a distribution. Now imagine all running
             | instances of that binary everywhere. How many of those
             | instances need to be profiled? Is it really most of them?
             | 
             | So it's not just unsympathetic "F developers/services
             | problems". I are one myself.
        
               | nemetroid wrote:
               | Do 50% of users need to be able to:
               | 
               | * modify system services?
               | 
               | * run a compiler?
               | 
               | * add custom package repositories?
               | 
               | * change the default shell?
               | 
               | I believe the answer to all of the above is "no".
        
               | redox99 wrote:
               | All those things are free in terms of performance though.
        
               | Brian_K_White wrote:
               | I don't see how this applies. Some shell has to be the
               | default one, and all systems don't pick the same one
               | even. Most systems don't install a compiler by default.
               | Thank you for making my point?
        
               | nemetroid wrote:
               | All these things are _possible_ to do, even though only
               | developers need them. Why shouldn't the same be true for
               | useful profiling abilities? Because of the 1-2% penalty?
        
               | Brian_K_White wrote:
               | Are you serious?
               | 
               | Visa makes billions per year off of nothing but
               | collecting a mere 2%-3% tax on everything else.
        
               | oasisaimlessly wrote:
               | I don't see how Visa is in any way relevant here.
        
               | Brian_K_White wrote:
               | I don't see why not.
               | 
               | The whole point of an analogy is to expose a blind spot
               | by showing the same thing in some other context where it
               | is recognized or percieved differently.
        
               | recursivecaveat wrote:
               | Everyone benefits from the net performance wins that come
               | from an ecosystem where everyone can easily profile
               | things. I have no doubt that works out to more than a 1%
               | lifetime improvement. Same reason you log stuff on your
               | servers. 99.9% pure overhead, never even seen by a human.
               | Slows stuff down, even causes uptime issues sometimes
               | from bugs or full discs. It's still worthwhile though
               | because occasionally it makes fixes or enhancements
               | possible that are so much larger than the cost of the
               | observability.
        
           | Brian_K_White wrote:
           | Then Netflix can enable it for their systems? Are they
           | actually still profiling cat and ls that come from the os or
           | are they profiling their own applications and the
           | interpreters and daemons they run on?
           | 
           | This does not explain why a distribution should have such a
           | feature on by default. It only explains why Netflix wants it
           | on some of their systems.
        
             | the_mitsuhiko wrote:
             | > Then Netflix can enable it for their systems?
             | 
             | And they did.
             | 
             | The question is though why only Netflix should benefit from
             | that. It takes a lot of effort to recompile an entire Linux
             | distribution.
        
             | soraminazuki wrote:
             | People across the industry are suffering from incomplete
             | stacktraces because their applications call into libraries
             | like glibc or OpenSSL that has frame pointer optimization
             | enabled by their distro. It's pretty ridiculous to have to
             | pull off a Linux from Scratch on CentOS just to get a
             | decent stacktrace. Needless to say, this has nothing at all
             | to do with profiling cat and ls.
        
               | pkhuong wrote:
               | OpenSSL is the worst because some configurations execute
               | asm generated by a specialised program. That code
               | clobbers the frame pointer (gotta go fast!) but isn't
               | annotated with dwarf unwinding info (what do you mean you
               | want to know what lead to your app crashing in
               | OpenSSL?)...
        
             | Brian_K_White wrote:
             | Quoting my other comment in this thread:
             | 
             | ---
             | 
             | I think it comes down to numbers. What are most installed
             | systems used for? Do more than 50% of installed systems
             | need to be doing this profiling all the time on just all
             | binaries such that they just need to be already built this
             | way without having to identify them and prepare them ahead
             | of time?
             | 
             | If so, then it should be the default.
             | 
             | If it's a close call, then there should be 2 versions of
             | the iso and repos.
             | 
             | As many developers and service operators as there are, as
             | much as everyone on this page is including both you and I,
             | I still do not believe the profiling use case is the
             | majority use case.
             | 
             | The way I am trying to judge "majority" is: Pick a binary
             | at random from a distribution. Now imagine all running
             | instances of that binary everywhere. How many of those
             | instances need to be profiled? Is it really most of them?
             | 
             | So it's not just unsympathetic "F developers/services
             | problems". I are one myself.
             | 
             | ---
             | 
             | "people across the industry" is a meaningless and valueless
             | term and is an empty argument.
        
           | loeg wrote:
           | Meta also continuously profiles in production, FWIW.
        
         | adrian_b wrote:
         | In reality all this discussion has its origin in a design
         | mistake made by Intel already in the 8086 CPU, when it was
         | launched in 1978.
         | 
         | They have designed the instruction set in such a way that two
         | distinct registers were necessary for fulfilling the roles of
         | the stack pointer and of the frame pointer.
         | 
         | In better designed instruction sets, for example in IBM POWER,
         | a single register is enough for fulfilling both roles,
         | simultaneously being both stack pointer and frame pointer.
         | 
         | Unfortunately, the Intel designers have not thought at all
         | about this problem, but in 1978 they have just followed the
         | example of the architectures popular at that time, e.g. DEC
         | VAX, which had also made the same mistake of reserving two
         | distinct registers for the roles of stack pointer and of frame
         | pointer.
         | 
         | In the architectures where a single register plays both roles,
         | the stack pointer always points to a valid stack frame that is
         | a part of a linked list of all stack frames. For this to work,
         | there must be an atomic instruction for both creating a new
         | stack frame (which consists in storing the old frame pointer in
         | the right place of the new stack frame) and updating the stack
         | pointer to point to the new stack frame. The Intel/AMD ISA does
         | not have such an atomic instruction, and this is the reason why
         | two registers are needed for creating a new stack frame in a
         | safe way (safe means that the frame pointer always points to a
         | valid stack frame and the stack pointer always points to the
         | top of stack).
        
         | brenns10 wrote:
         | > I don't understand why frame pointers need to be in by
         | default instead of developers enabling where needed
         | 
         | If you enable frame pointers, you need to recompile every
         | library your executable depends on. Otherwise, the unwind will
         | fail at the first function that's not part of your executable.
         | Usually library function calls (like glibc) are at the top of
         | the stack, so for a large portion of the samples in a typical
         | profile, you won't get any stack unwind at all.
         | 
         | In many (most?) cases recompiling all those libraries is just
         | infeasible for the application developers, which is why the
         | distro would need to do it. Developers can still choose whether
         | to include frame pointers in their own applications (and so
         | they can still pick up those 1-2% performance gains in their
         | own code). But they're stuck with frame pointers enabled on all
         | the distro provided code.
         | 
         | So the choice developers get to make is more along the lines
         | of: should they use a distro with FP or without. Which is
         | definitely not ideal, but that's life.
        
       | fooblaster wrote:
       | I have always had issues with the perf call trace sampling with
       | frame pointers, even when virtually everything in userspace
       | compiled with fno-omit-frame-pointer. It doesn't look like any of
       | the failure modes listed in the article to me though. Shrug.
       | 
       | FYI, if you happen to be running on an intel cpu, --call-graph
       | lbr uses some specicalized hardware and often delivers a far
       | superior result, with some notable failure modes. Really looking
       | forward to when AMD implements a similar feature.
        
         | rwmj wrote:
         | The problem with Intel LBR (last branch records) is that the
         | depth of the call stack is relatively limited. It depends on
         | the generation of CPU, but LWN has a table here:
         | https://lwn.net/Articles/680985/ Anything less than 32 is
         | fairly useless for profiling from the kernel through to
         | userspace.
        
           | fooblaster wrote:
           | Yeah, right. I still seem to get far more "comprehensible"
           | traces when using it, even with this limitation. It's often
           | really easy to localize where a trace is coming from, even
           | when truncated. It probably breaks flamegraphs though.
        
         | Sesse__ wrote:
         | I've tried --call-graph lbr a bunch of times, but often, it...
         | just returns junk? I don't fully understand why, it sometimes
         | returns wild pointers even if you don't have deep stacks.
        
           | fooblaster wrote:
           | I often get junk when sampling without lbr. Which kernel are
           | you running? The quality of perf and the associated
           | perf_events varies wildly across kernel versions.
        
             | Sesse__ wrote:
             | A variety of kernels over the last five years, on a
             | multitude of Intel CPUs. :-) I last tested this on 6.10, I
             | think.
             | 
             | It's certainly true that there can be junk in --call-graph
             | fp, too.
        
       | laserbeam wrote:
       | "enabling frame pointers is a 1-2% performance loss, which
       | translates to the loss of about 1 or 2 years of compiler
       | improvements"
       | 
       | Wait, are we really that close to the maximum of what a compiler
       | can optimize that we're getting barely 1% performance
       | improvements per year with new versions?
        
         | clausecker wrote:
         | Yeah, compilers are already pretty close to the limit of what
         | is possible, unless your code is unusually poorly written.
        
           | londons_explore wrote:
           | Clearly this isn't the case. Plenty of neat C++ "reference
           | implementation" code ends up 5x faster when hand optimized,
           | parallelized, vectorized, etc.
           | 
           | There are some transformations that compilers are really bad
           | at. Rearranging data structures, switching out algorithms for
           | equivalent ones with better big-O complexity, generating &
           | using lookup tables, bit-packing things, using caches, hash
           | tables and bloom filters for time/memory trade offs, etc.
           | 
           | The spec doesn't prevent such optimizations, but current
           | compilers aren't smart enough to find them.
        
             | adrianN wrote:
             | Imagine the outcry if compilers switched algorithms. How
             | can the compiler know my input size and input distribution?
             | Maybe my dumb algorithm is optimal for my data.
        
               | londons_explore wrote:
               | Compilers can easily runtime-detect the size and shape of
               | the problem, and run different code for different problem
               | sizes. Many already do for loop-unrolling. Ie. if you
               | memcpy 2 bytes, they won't even branch into the fancy
               | SIMD version.
               | 
               | This would just be an extension of that. If the code
               | creates and uses a linked list, yet the list is 1M items
               | long and being accessed entirely by index, branch to a
               | different version of the code which uses an array, etc.
        
             | laserbeam wrote:
             | That's my question. I'm also under the impression that
             | optimizations CAN be made manyally, but I find it
             | surprising that "current compilers aren't smart enought to
             | find them" isn't improving
        
               | londons_explore wrote:
               | The percentage of all software engineers working on
               | compilers is probably lower now than it ever has been...
        
         | thechao wrote:
         | As a part time compiler author I'm extremely skeptical we're
         | getting a global 1-2%/yr. I'd've thought more like a tenth to
         | half that? I've not seen any numbers, so I'm just making shit
         | up.
         | 
         | However, for sure, if compiler optimizations disappeared, HW
         | would pick up the slack in a few years.
        
         | fanf2 wrote:
         | Proebsting's Law suggests 4% per year, but as a satirical joke
         | it seems to have underdone its cynicism.
         | 
         | https://gwern.net/doc/cs/algorithm/2001-scott.pdf
        
         | variadix wrote:
         | There's likely a lot of performance still on the table if
         | compilers were permitted to change data structure layout, but I
         | think doing this effectively is an open problem.
         | 
         | Current compilers could do a lot better with vectorization, but
         | it will often be limited by the data structure layout.
        
       | clausecker wrote:
       | The "function prologue is at least 8 bytes long" bit only applies
       | if CET is used. If it is not used, the endbr64 instruction is not
       | emitted and the prologue is only 4 bytes long.
        
       | elteto wrote:
       | Didn't really get the point of the post as it just presents
       | something without a conclusion.
       | 
       | 9X% of users do not care about a <1% drop in performance. I
       | suspect we get the same variability just by going from one kernel
       | version to another. The impact from all the Intel mitigations
       | that are now enabled by default is much worse.
       | 
       | However I do care about nice profiles and stack traces without
       | having to jump through hoops.
       | 
       | Asking people to recompile an _entire_ distribution just to get
       | sane defaults is wrong. Those who care about the last drop should
       | build their custom systems as they see fit, and they probably
       | already do.
        
         | yosefk wrote:
         | it does present a conclusion. once the kernel supports .sframe
         | it will be all-around superior to -fomit-frame-pointer, and a
         | better default for distros to use.
        
           | audidude wrote:
           | It does cause more memory pressure because the kernel will
           | have to look at the user-space memory for decoding registers.
           | 
           | So yes it will be faster than alternatives to frame-pointers,
           | but it still wont be as fast as frame pointers.
        
         | Brian_K_White wrote:
         | But does what you care about matter enough to be the default?
         | 
         | Are you the majority?
         | 
         | Evaluate "majority" this way: For every/any random binary in a
         | distro, out of all the currently running instances of that
         | binary in the world at any given moment, how many of those need
         | to be profiled?
         | 
         | There is no way the answer is "most of them".
         | 
         | You have a job where you profile things, and maybe even you
         | profile almost everything you touch. Your whole world has a
         | high quotient of profiling in it. So you want the whole system
         | built for profiling by default. How convenient for you. But
         | your whole world is not _the_ whole world.
         | 
         | But it's not just you, there are, zomg thousands, tens of
         | thousands, maybe even _hundreds of thousands_ of developers and
         | ops admins the same as you.
         | 
         | Yes and? Is even that most installed instances of any given
         | executable? No way.
         | 
         | Or maybe yes. It's possible. Can you show that somehow? But I
         | will guess no way and not even close.
        
           | audidude wrote:
           | I regularly have users run Sysprof and upload it to issues.
           | It's immensely powerful to be able to see what is going on
           | systems which are having issues. I'd argue it's one of the
           | major reasons GNOME performance has gotten so much better in
           | the recent-past.
           | 
           | You can't do that when step one is reinstall another distro
           | and reproduce your problem.
           | 
           | Additionally, the overhead for performance related things
           | that could fall into the 1% range (hint, it's not much)
           | rarely are using the system libraries in such a way anyway
           | that would cause this. They can compile that app with frame-
           | pointers disabled. And for stuff where they do use system
           | libraries (qsort, bsearch, strlen, etc) the frame pointer is
           | negligible to the work being performed. You're margin of
           | error is way larger than the theoretical overhead.
        
             | Brian_K_White wrote:
             | 1% is a ton. 1% is crazy. Visa owns the world off just a 3%
             | tax on everything else. Brokers make billions off of just
             | 1% or even far less.
             | 
             | 1% _of all activity_ is only rational if you get more than
             | 1% _of all activity_ back out from those times and places
             | where it was used.
             | 
             | 1%, when it's of everything, is an absolutely stupendous
             | collossal number that is absolutely crazy to try to treat
             | as trivial.
        
               | ploxiln wrote:
               | Better analogy: you're paying 30% to apple, and over 50%
               | in bad payday loans, and you're worried about the 3%
               | visa/stripe overhead ... that's kinda crazy. But that's
               | where we are in computer performance, there's 10x, 100x,
               | and even greater inefficiencies everywhere, 1% for better
               | backtraces is nothing.
        
               | audidude wrote:
               | Absolutely. We've gotten numerous double digit
               | performance improvements across applications, libraries,
               | and system daemons because of frame-pointers in Fedora
               | (and that's just from me).
        
           | wbl wrote:
           | Performance problems matter to the people who have them, who
           | often are in an inconvenient place. Having the ability for
           | profiling to just work means that it's easy to help these
           | people.
        
           | elteto wrote:
           | I think you are trying to make this out something that it
           | isn't.
           | 
           | Visibility at the "cost" of negligible impact is more
           | important than raw performance. That's it.
           | 
           | I'm a regular user of Linux with some performance sensitivity
           | that does not go as far as "I _need_ that extra register!".
           | That's what the majority of developers working on Linux are
           | like. I think it's up to _you_ to prove the contrary.
        
           | PittleyDunkin wrote:
           | This seems like a ridiculous attempt to bury your head in the
           | sand. Is there any evidence _anyone_ doesn 't want frame
           | pointers?
        
             | Brian_K_White wrote:
             | I think it's ridiculous to question that since obviously,
             | yes, many people have decided exactly that. I see no point
             | myself and I'm even in the field. And I am not in charge of
             | all the distributions which disabled it by default.
             | 
             | So, "yes". In fact "yes, duh?" Talk about head in sand...
        
               | PittleyDunkin wrote:
               | Ok, where's the evidence?
               | 
               | > I see no point myself and I'm even in the field.
               | 
               | You don't see the point of readable stack traces?
        
               | Brian_K_White wrote:
               | Nope. Not on 99.999% of installed binaries in existence
               | and running at a given moment.
        
               | PittleyDunkin wrote:
               | That strikes me as an insane take (not to mention
               | blatantly inaccurate), but I take your point that this is
               | a common one for distribution-maintainers to have.
        
           | dap wrote:
           | > Evaluate "majority" this way: For every/any random binary
           | in a distro, out of all the currently running instances of
           | that binary in the world at any given moment, how many of
           | those need to be profiled? > There is no way the answer is
           | "most of them".
           | 
           | This is an absurd way to evaluate it. All it takes is one
           | savvy user to report a performance problem that developers
           | are able to root-cause using stack traces from the user's
           | system. Suppose they're able to make a 5% performance
           | improvement to the program. Now _all_ user 's programs are 5%
           | faster because of the frame pointers on this one user's
           | system.
           | 
           | At this point people usually ask: but couldn't developers
           | have done that on their own systems with debug code? But the
           | performance of debug code is not the same as the performance
           | of shipping code. And not all problems manifest the same on
           | all systems. This is why you need shipping code to be
           | debuggable (or instrumentable or profileable or whatever you
           | want to call it).
        
         | baq wrote:
         | fifty independent 1% performance drops nobody cares about
         | compound to a ~40% reduction.
        
         | josefx wrote:
         | > 9X% of users do not care about a <1% drop in performance.
         | 
         | Except Python got opted out of the frame pointer change due to
         | benchmarks showing slowdowns of up to 10%. The discussion
         | around that had the great idea of just adding a pragma to flat
         | out override the build setting. So in the end that "%1"
         | reduction claim only holds if everything even remotely affected
         | silently ignores the flag.
        
           | audidude wrote:
           | This is a bit of a mischaracterization of the Python side of
           | things.
           | 
           | They only opted out for 3.11 which did not yet have the perf-
           | integration fixes anyway. 3.12 uses frame-pointers just fine.
        
             | josefx wrote:
             | Any link to the fix or documentation about it? I could find
             | added perf support but did not see anything about improved
             | performance related to frame pointer use.
        
       | jeffbee wrote:
       | Complaining about frame pointers is like complaining about the
       | budget of the Bureau of Labor Statistics. Yes, it's pure
       | overhead, but also yes, it's good to know what is going on.
        
       | ot wrote:
       | I broadly agree with the thesis of the post, which if I
       | understand correctly is that frame pointers are a temporary
       | compromise until the whole ecosystem gets its act together and
       | manages to agree on some form of out-of-band tracking of frame
       | pointers, and it seems that we'll eventually get there.
       | 
       | Some of the statements in the post seem odd to me though.
       | 
       | - 5% of system-wide cycles spent in function prologues/epilogues?
       | That is _wild_ , it can't be right.
       | 
       | - Is using the whole 8 bytes right for the estimate? Pushing the
       | stack pointer is the first instruction in the prologue and it's
       | literally 1 byte. Epilogue is symmetrical.
       | 
       | - Even if we're in the prologue, we know that we're in a leaf
       | call, we can still resolve the instruction pointer to the
       | function, and we can read the return address to find the parent,
       | so what information is lost?
       | 
       | When it comes to future alternatives, while frame pointers have
       | their own problems, I think that there are still a few open
       | questions:
       | 
       | - Shadow stacks are cool but aren't they limited to a fixed
       | number of entries? What if you have a deeper stack?
       | 
       | - Is the memory overhead of lookup tables for very large programs
       | acceptable?
        
         | audidude wrote:
         | > Shadow stacks are cool but aren't they limited to a fixed
         | number of entries?
         | 
         | Current available hardware yes. But I think some of the future
         | Intel stuff was going to allow for much larger depth.
         | 
         | > Is the memory overhead of lookup tables for very large
         | programs acceptable?
         | 
         | I don't think SFrame is as "dense" as DWARF as a format so you
         | trade a bit of memory size for a much faster unwind experience.
         | But you are definitely right that this adds memory pressure
         | that could otherwise be ignored.
         | 
         | Especially if the anomalies are what they sound like, just
         | account for them statistically. You get a PID for cost
         | accounting in the perf_event frame anyway.
        
         | rwmj wrote:
         | > - Is using the whole 8 bytes right for the estimate? Pushing
         | the stack pointer is the first instruction in the prologue and
         | it's literally 1 byte. Epilogue is symmetrical.
         | 
         | I believe it's because of the landing pad for Control Flow
         | Integrity which basically all functions now need. Grabbing
         | main() from a random program on Fedora (which uses frame
         | pointers):                   0000000000007000 <main>:
         | 7000:       f3 0f 1e fa       endbr64     ; landing pad
         | 7004:       55                push   %rbp ; set up frame
         | pointer         7005:       48 89 e5          mov    %rsp,%rbp
         | 
         | It's not much of an issue in practice as the stack trace will
         | still be nearly correct, enough for you to identify the
         | problematic area of the code.
         | 
         | > - Shadow stacks are cool but aren't they limited to a fixed
         | number of entries? What if you have a deeper stack?
         | 
         | Yes shadow stacks are limited to 32 entries on the most recent
         | Intel CPUs (and as little as 4 entries on very old ones).
         | However they are basically cost free so that's a big advantage.
         | 
         | I think SFrame is a sensible middle ground here. It's saner
         | than DWARF and has a long history of use in the kernel so we
         | know it will work.
        
           | Sesse__ wrote:
           | If you're limited to 32 entries, why not just use LBR, then?
           | It has basically the same pros and cons.
        
         | Sesse__ wrote:
         | > - 5% of system-wide cycles spent in function
         | prologues/epilogues? That is wild, it can't be right.
         | 
         | TBH I wouldn't be surprised on x86. There are so many registers
         | to be pushed and popped due to the ABI, so every time I profile
         | stuff I get depressed... Aarch64 seems to be better, the
         | prologues are generally shorter when I look at those. (There's
         | probably a reason why Intel APX introduces push2/pop2
         | instructions.)
        
           | manwe150 wrote:
           | This sounds to me more like an inlining problem than an ABI
           | problem. If the calls take as much time than the running,
           | perhaps you just need a better language that doesn't
           | arbitrarily prevent inlining due to compilation boundaries
           | (eg. basically any modern language that isn't in the C/C++
           | family, before LTO)
        
             | Sesse__ wrote:
             | I see this in LTO/PGO binaries as well. If a function is 20
             | instructions long, it's not like you can inline it
             | uncritically, yet a five-cycle prologue and a five-cycle
             | epilogue will hurt. (Also, recursive functions etc.)
        
         | quotemstr wrote:
         | > temporary compromise until the whole ecosystem gets its act
         | together and manages to agree on some form of out-of-band
         | tracking of frame pointers,
         | 
         | Temporary solutions have a way of becoming permanent. I was
         | against the recent frame pointer enablement on the grounds of
         | moral hazard. I still think it would have been better to force
         | the ecosystem to get its act together first.
         | 
         | Another factor nobody is talking about is JITed and interpreted
         | languages. Whatever the long-term solution might be, it should
         | enable stack traces that interleave accurate source-level frame
         | information from native and managed code. The existing perf
         | /tmp file hack is inadequate in many ways, including security,
         | performance, and compatibility with multiple language runtimes
         | coexisting in a single process.
        
           | audidude wrote:
           | It's a disaster no doubt.
           | 
           | But, at least from the GNOME side of things, we've been
           | complaining about it for roughly 15 years and kept getting
           | push-back in the form of "we'll make something better".
           | 
           | Now that we have frame-pointers enabled in Fedora, Ubuntu,
           | Arch, etc we're starting to see movement on realistic
           | alternatives. So in many ways, I think the moral hazard was
           | waiting until 2023 to enable them.
        
       | tempfile wrote:
       | The JS constantly grabbing the anchor and updating it is
       | absolutely appalling UX. It took me something like 11 back button
       | presses to get back to where I was. Borderline malware.
        
       | 0908400383 wrote:
       | thkyr tdmyr tfshyl t`Tyl rtj mSr wtsb
        
       | 0908400383 wrote:
       | wtsb. qrwbt. jht tSl. fysbwk. Sdq
       | 
       | Cicada. Yazanzawi. It pays off. You can use your phone number to
       | enter your account. He is a Zionist, in Buyneh, in Boti, in Tibet
       | WeYouWebZnbYouWeetbuy. Iurii i riu seter uutat Ziiaarnitioso
       | zizzbzittit
        
       ___________________________________________________________________
       (page generated 2024-11-04 23:00 UTC)