[HN Gopher] Limitations of frame pointer unwinding
___________________________________________________________________
Limitations of frame pointer unwinding
Author : rwmj
Score : 104 points
Date : 2024-11-04 11:25 UTC (11 hours ago)
(HTM) web link (developers.redhat.com)
(TXT) w3m dump (developers.redhat.com)
| dap wrote:
| This reads to me like FUD. Isn't the fraction of profile samples
| in a prologue heavily workload dependent? And whichever way you
| go on frame pointers, there are winners and losers to including
| them by default.
| thegeomaster wrote:
| There are no performance winners if you include them by
| default. There will be an additional >0% overhead when you are
| executing additional code in the prologue and epilogue, and
| increasing the register pressure by removing rbp from being
| ever allocated.
|
| There are only "winners" in the sense that people will be able
| to more easily see why their never-tuned system is so slow. On
| the other hand, you're punishing all perf-critical usecases
| with unnecessary overhead.
|
| I believe if you have a slow system, it's up to you to profile
| and optimize it, and that includes even recompiling some
| software with different flags to enable profiling. It's not the
| job of upstream to make this easier for you if it means
| punishing those workloads where teams have diligently profiled
| and optimized through the years so that there is no, as the
| author says, low-hanging fruit to find.
| dap wrote:
| I've been around long enough to have had frame pointers
| pretty ubiquitously, then lost them, and now starting to have
| them again. The dark times in the middle were painful. For
| the software I've worked on, the easy dynamic profiling using
| frame pointers (eg using DTrace) has given far more in
| performance wins than omitting them would have. (Part of my
| beef with the article is that while edge cases do break some
| samples, in practice it's a very small fraction, and almost
| by definition not the important ones if you're trying to find
| heavy on-CPU code paths.)
|
| I get that some use cases may be better without frame
| pointers. A well-resourced team can always recompile the
| world, whichever the default is. It's just that my experience
| is that most software is not already perfectly tuned and I'd
| much rather the default be more easily observable.
| thegeomaster wrote:
| Look, it's likely we just come from different backgrounds.
| Most of my perf-sensitive work was optimizing inner loops
| with SIMD, allowing the compiler to inline hot functions,
| creating better data structures to make use of the CPU
| cache, etc. Frame pointer prologue overhead was measurable
| on most of our use-cases. I have a smaller amount of
| experience on profiling systems where calls trace across
| multiple processes, so maybe I haven't felt this pain
| enough. Though I still think the onus should be on teams to
| be able to comfortably recompile---not the world---but some
| part of it. After all, a lot of tuning can only be done
| through compile flags, such as turning off
| codepaths/capabilities which are unnecessary.
| dap wrote:
| Makes sense.
|
| I wasn't exaggerating about recompiling the world,
| though. Even if we say I'm only interested in profiling
| my application, a single library compiled without frame
| pointers makes useless any samples where code in that
| library was at the top of the stack. I've seen that be
| libc, openssl, some random Node module or JNI thing, etc.
| You can't just throw out those samples because they might
| still be your application's problem. For me in those
| situations, I would have needed to recompile most of the
| packages we got from both the OS distro and the
| supplemental package repo.
| audidude wrote:
| I think your viewpoint is valid.
|
| My experience is on performance tuning the other side you
| mention. Cross-application, cross-library, whole-system,
| daemons, etc. Basically, "the whole OS as it's shipped to
| users".
|
| For my case, I need the whole system setup correctly
| before it even starts to be useful. For your case, you
| only need the specific library or application compiled
| correctly. The rest of the system is negligible and
| probably not even used. Who would optimize SIMD routines
| next to function calls anyway?
| audidude wrote:
| I added support to Sysprof this weekend for unwinding using
| libdwfl and DWARF/CFI/eh_frame/etc techniques that Serhei did in
| eu-stacktrace.
|
| The overhead is about 10% of samples. But at least you can unwind
| on systems without frame-pointers. Personally I'll take the
| statistical anomalies of frame-pointers which still allow you to
| know what PID/TID are your cost center even if you don't get
| perfect unwinds. Everyone seems motivated towards SFrame going
| forward, which is good.
|
| https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-fr...
| Brian_K_White wrote:
| Is this a response to Alma Kitten?
|
| In any event I don't understand why frame pointers need to be in
| by default instead of developers enabling where needed.
|
| Having Kitten include pointers by default seems reasonable
| enough, since Kitten is a devel system.
| ithkuil wrote:
| It's useful to be able to profile on production workloads
| rwmj wrote:
| The real benefit is being able to turn on profiling when a
| problem is spotted, or in some cases to be able to profile
| continuously in production (as apparently they do at Netflix).
| thegeomaster wrote:
| I get it. This _frustrated me to no end_. But still I did
| what I had to do --- recompiled random software throughout
| the stack, enabled random flags, etc. It was doable and now I
| can do it much faster. I don 't think it's fair for upstream
| to disable a useful optimization just so _I_ don 't have to
| do this additional work to fix and optimize _my system_.
| rwmj wrote:
| Doing real world, whole system profiling, we've found
| performance was affected by completely unexpected software
| running on the system. Recompiling the entire distribution,
| or even the subset of all software installed, is not
| realistic for most people. Besides, I have measured the
| overhead of frame pointers and it's less than 1%, so
| there's not really any trade-off here.
|
| Anyway, soon we'll have SFrame support in the userspace
| tools and the whole issue will go away.
| thegeomaster wrote:
| In one of my jobs, a 1% perf regression (on a more
| stable/reproducible system, not PCs) was a reason for a
| customer raising a ticket, and we'd have to look into it.
| For dynamically dispatched but short functions, the
| overhead is easily more than 1% too. So, there _is_ a
| trade-off, just not one that affects you.
| Brian_K_White wrote:
| I think it comes down to numbers. What are most installed
| systems used for? Do more than 50% of installed systems
| need to be doing this profiling all the time on just all
| binaries such that they just need to be already built this
| way without having to identify them and prepare them ahead
| of time?
|
| If so, then it should be the default.
|
| If it's a close call, then there should be 2 versions of
| the iso and repos.
|
| As many developers and service operators as there are, as
| much as everyone on this page is including both you and I,
| I still do not believe the profiling use case is the
| majority use case.
|
| The way I am trying to judge "majority" is: Pick a binary
| at random from a distribution. Now imagine all running
| instances of that binary everywhere. How many of those
| instances need to be profiled? Is it really most of them?
|
| So it's not just unsympathetic "F developers/services
| problems". I are one myself.
| nemetroid wrote:
| Do 50% of users need to be able to:
|
| * modify system services?
|
| * run a compiler?
|
| * add custom package repositories?
|
| * change the default shell?
|
| I believe the answer to all of the above is "no".
| redox99 wrote:
| All those things are free in terms of performance though.
| Brian_K_White wrote:
| I don't see how this applies. Some shell has to be the
| default one, and all systems don't pick the same one
| even. Most systems don't install a compiler by default.
| Thank you for making my point?
| nemetroid wrote:
| All these things are _possible_ to do, even though only
| developers need them. Why shouldn't the same be true for
| useful profiling abilities? Because of the 1-2% penalty?
| Brian_K_White wrote:
| Are you serious?
|
| Visa makes billions per year off of nothing but
| collecting a mere 2%-3% tax on everything else.
| oasisaimlessly wrote:
| I don't see how Visa is in any way relevant here.
| Brian_K_White wrote:
| I don't see why not.
|
| The whole point of an analogy is to expose a blind spot
| by showing the same thing in some other context where it
| is recognized or percieved differently.
| recursivecaveat wrote:
| Everyone benefits from the net performance wins that come
| from an ecosystem where everyone can easily profile
| things. I have no doubt that works out to more than a 1%
| lifetime improvement. Same reason you log stuff on your
| servers. 99.9% pure overhead, never even seen by a human.
| Slows stuff down, even causes uptime issues sometimes
| from bugs or full discs. It's still worthwhile though
| because occasionally it makes fixes or enhancements
| possible that are so much larger than the cost of the
| observability.
| Brian_K_White wrote:
| Then Netflix can enable it for their systems? Are they
| actually still profiling cat and ls that come from the os or
| are they profiling their own applications and the
| interpreters and daemons they run on?
|
| This does not explain why a distribution should have such a
| feature on by default. It only explains why Netflix wants it
| on some of their systems.
| the_mitsuhiko wrote:
| > Then Netflix can enable it for their systems?
|
| And they did.
|
| The question is though why only Netflix should benefit from
| that. It takes a lot of effort to recompile an entire Linux
| distribution.
| soraminazuki wrote:
| People across the industry are suffering from incomplete
| stacktraces because their applications call into libraries
| like glibc or OpenSSL that has frame pointer optimization
| enabled by their distro. It's pretty ridiculous to have to
| pull off a Linux from Scratch on CentOS just to get a
| decent stacktrace. Needless to say, this has nothing at all
| to do with profiling cat and ls.
| pkhuong wrote:
| OpenSSL is the worst because some configurations execute
| asm generated by a specialised program. That code
| clobbers the frame pointer (gotta go fast!) but isn't
| annotated with dwarf unwinding info (what do you mean you
| want to know what lead to your app crashing in
| OpenSSL?)...
| Brian_K_White wrote:
| Quoting my other comment in this thread:
|
| ---
|
| I think it comes down to numbers. What are most installed
| systems used for? Do more than 50% of installed systems
| need to be doing this profiling all the time on just all
| binaries such that they just need to be already built this
| way without having to identify them and prepare them ahead
| of time?
|
| If so, then it should be the default.
|
| If it's a close call, then there should be 2 versions of
| the iso and repos.
|
| As many developers and service operators as there are, as
| much as everyone on this page is including both you and I,
| I still do not believe the profiling use case is the
| majority use case.
|
| The way I am trying to judge "majority" is: Pick a binary
| at random from a distribution. Now imagine all running
| instances of that binary everywhere. How many of those
| instances need to be profiled? Is it really most of them?
|
| So it's not just unsympathetic "F developers/services
| problems". I are one myself.
|
| ---
|
| "people across the industry" is a meaningless and valueless
| term and is an empty argument.
| loeg wrote:
| Meta also continuously profiles in production, FWIW.
| adrian_b wrote:
| In reality all this discussion has its origin in a design
| mistake made by Intel already in the 8086 CPU, when it was
| launched in 1978.
|
| They have designed the instruction set in such a way that two
| distinct registers were necessary for fulfilling the roles of
| the stack pointer and of the frame pointer.
|
| In better designed instruction sets, for example in IBM POWER,
| a single register is enough for fulfilling both roles,
| simultaneously being both stack pointer and frame pointer.
|
| Unfortunately, the Intel designers have not thought at all
| about this problem, but in 1978 they have just followed the
| example of the architectures popular at that time, e.g. DEC
| VAX, which had also made the same mistake of reserving two
| distinct registers for the roles of stack pointer and of frame
| pointer.
|
| In the architectures where a single register plays both roles,
| the stack pointer always points to a valid stack frame that is
| a part of a linked list of all stack frames. For this to work,
| there must be an atomic instruction for both creating a new
| stack frame (which consists in storing the old frame pointer in
| the right place of the new stack frame) and updating the stack
| pointer to point to the new stack frame. The Intel/AMD ISA does
| not have such an atomic instruction, and this is the reason why
| two registers are needed for creating a new stack frame in a
| safe way (safe means that the frame pointer always points to a
| valid stack frame and the stack pointer always points to the
| top of stack).
| brenns10 wrote:
| > I don't understand why frame pointers need to be in by
| default instead of developers enabling where needed
|
| If you enable frame pointers, you need to recompile every
| library your executable depends on. Otherwise, the unwind will
| fail at the first function that's not part of your executable.
| Usually library function calls (like glibc) are at the top of
| the stack, so for a large portion of the samples in a typical
| profile, you won't get any stack unwind at all.
|
| In many (most?) cases recompiling all those libraries is just
| infeasible for the application developers, which is why the
| distro would need to do it. Developers can still choose whether
| to include frame pointers in their own applications (and so
| they can still pick up those 1-2% performance gains in their
| own code). But they're stuck with frame pointers enabled on all
| the distro provided code.
|
| So the choice developers get to make is more along the lines
| of: should they use a distro with FP or without. Which is
| definitely not ideal, but that's life.
| fooblaster wrote:
| I have always had issues with the perf call trace sampling with
| frame pointers, even when virtually everything in userspace
| compiled with fno-omit-frame-pointer. It doesn't look like any of
| the failure modes listed in the article to me though. Shrug.
|
| FYI, if you happen to be running on an intel cpu, --call-graph
| lbr uses some specicalized hardware and often delivers a far
| superior result, with some notable failure modes. Really looking
| forward to when AMD implements a similar feature.
| rwmj wrote:
| The problem with Intel LBR (last branch records) is that the
| depth of the call stack is relatively limited. It depends on
| the generation of CPU, but LWN has a table here:
| https://lwn.net/Articles/680985/ Anything less than 32 is
| fairly useless for profiling from the kernel through to
| userspace.
| fooblaster wrote:
| Yeah, right. I still seem to get far more "comprehensible"
| traces when using it, even with this limitation. It's often
| really easy to localize where a trace is coming from, even
| when truncated. It probably breaks flamegraphs though.
| Sesse__ wrote:
| I've tried --call-graph lbr a bunch of times, but often, it...
| just returns junk? I don't fully understand why, it sometimes
| returns wild pointers even if you don't have deep stacks.
| fooblaster wrote:
| I often get junk when sampling without lbr. Which kernel are
| you running? The quality of perf and the associated
| perf_events varies wildly across kernel versions.
| Sesse__ wrote:
| A variety of kernels over the last five years, on a
| multitude of Intel CPUs. :-) I last tested this on 6.10, I
| think.
|
| It's certainly true that there can be junk in --call-graph
| fp, too.
| laserbeam wrote:
| "enabling frame pointers is a 1-2% performance loss, which
| translates to the loss of about 1 or 2 years of compiler
| improvements"
|
| Wait, are we really that close to the maximum of what a compiler
| can optimize that we're getting barely 1% performance
| improvements per year with new versions?
| clausecker wrote:
| Yeah, compilers are already pretty close to the limit of what
| is possible, unless your code is unusually poorly written.
| londons_explore wrote:
| Clearly this isn't the case. Plenty of neat C++ "reference
| implementation" code ends up 5x faster when hand optimized,
| parallelized, vectorized, etc.
|
| There are some transformations that compilers are really bad
| at. Rearranging data structures, switching out algorithms for
| equivalent ones with better big-O complexity, generating &
| using lookup tables, bit-packing things, using caches, hash
| tables and bloom filters for time/memory trade offs, etc.
|
| The spec doesn't prevent such optimizations, but current
| compilers aren't smart enough to find them.
| adrianN wrote:
| Imagine the outcry if compilers switched algorithms. How
| can the compiler know my input size and input distribution?
| Maybe my dumb algorithm is optimal for my data.
| londons_explore wrote:
| Compilers can easily runtime-detect the size and shape of
| the problem, and run different code for different problem
| sizes. Many already do for loop-unrolling. Ie. if you
| memcpy 2 bytes, they won't even branch into the fancy
| SIMD version.
|
| This would just be an extension of that. If the code
| creates and uses a linked list, yet the list is 1M items
| long and being accessed entirely by index, branch to a
| different version of the code which uses an array, etc.
| laserbeam wrote:
| That's my question. I'm also under the impression that
| optimizations CAN be made manyally, but I find it
| surprising that "current compilers aren't smart enought to
| find them" isn't improving
| londons_explore wrote:
| The percentage of all software engineers working on
| compilers is probably lower now than it ever has been...
| thechao wrote:
| As a part time compiler author I'm extremely skeptical we're
| getting a global 1-2%/yr. I'd've thought more like a tenth to
| half that? I've not seen any numbers, so I'm just making shit
| up.
|
| However, for sure, if compiler optimizations disappeared, HW
| would pick up the slack in a few years.
| fanf2 wrote:
| Proebsting's Law suggests 4% per year, but as a satirical joke
| it seems to have underdone its cynicism.
|
| https://gwern.net/doc/cs/algorithm/2001-scott.pdf
| variadix wrote:
| There's likely a lot of performance still on the table if
| compilers were permitted to change data structure layout, but I
| think doing this effectively is an open problem.
|
| Current compilers could do a lot better with vectorization, but
| it will often be limited by the data structure layout.
| clausecker wrote:
| The "function prologue is at least 8 bytes long" bit only applies
| if CET is used. If it is not used, the endbr64 instruction is not
| emitted and the prologue is only 4 bytes long.
| elteto wrote:
| Didn't really get the point of the post as it just presents
| something without a conclusion.
|
| 9X% of users do not care about a <1% drop in performance. I
| suspect we get the same variability just by going from one kernel
| version to another. The impact from all the Intel mitigations
| that are now enabled by default is much worse.
|
| However I do care about nice profiles and stack traces without
| having to jump through hoops.
|
| Asking people to recompile an _entire_ distribution just to get
| sane defaults is wrong. Those who care about the last drop should
| build their custom systems as they see fit, and they probably
| already do.
| yosefk wrote:
| it does present a conclusion. once the kernel supports .sframe
| it will be all-around superior to -fomit-frame-pointer, and a
| better default for distros to use.
| audidude wrote:
| It does cause more memory pressure because the kernel will
| have to look at the user-space memory for decoding registers.
|
| So yes it will be faster than alternatives to frame-pointers,
| but it still wont be as fast as frame pointers.
| Brian_K_White wrote:
| But does what you care about matter enough to be the default?
|
| Are you the majority?
|
| Evaluate "majority" this way: For every/any random binary in a
| distro, out of all the currently running instances of that
| binary in the world at any given moment, how many of those need
| to be profiled?
|
| There is no way the answer is "most of them".
|
| You have a job where you profile things, and maybe even you
| profile almost everything you touch. Your whole world has a
| high quotient of profiling in it. So you want the whole system
| built for profiling by default. How convenient for you. But
| your whole world is not _the_ whole world.
|
| But it's not just you, there are, zomg thousands, tens of
| thousands, maybe even _hundreds of thousands_ of developers and
| ops admins the same as you.
|
| Yes and? Is even that most installed instances of any given
| executable? No way.
|
| Or maybe yes. It's possible. Can you show that somehow? But I
| will guess no way and not even close.
| audidude wrote:
| I regularly have users run Sysprof and upload it to issues.
| It's immensely powerful to be able to see what is going on
| systems which are having issues. I'd argue it's one of the
| major reasons GNOME performance has gotten so much better in
| the recent-past.
|
| You can't do that when step one is reinstall another distro
| and reproduce your problem.
|
| Additionally, the overhead for performance related things
| that could fall into the 1% range (hint, it's not much)
| rarely are using the system libraries in such a way anyway
| that would cause this. They can compile that app with frame-
| pointers disabled. And for stuff where they do use system
| libraries (qsort, bsearch, strlen, etc) the frame pointer is
| negligible to the work being performed. You're margin of
| error is way larger than the theoretical overhead.
| Brian_K_White wrote:
| 1% is a ton. 1% is crazy. Visa owns the world off just a 3%
| tax on everything else. Brokers make billions off of just
| 1% or even far less.
|
| 1% _of all activity_ is only rational if you get more than
| 1% _of all activity_ back out from those times and places
| where it was used.
|
| 1%, when it's of everything, is an absolutely stupendous
| collossal number that is absolutely crazy to try to treat
| as trivial.
| ploxiln wrote:
| Better analogy: you're paying 30% to apple, and over 50%
| in bad payday loans, and you're worried about the 3%
| visa/stripe overhead ... that's kinda crazy. But that's
| where we are in computer performance, there's 10x, 100x,
| and even greater inefficiencies everywhere, 1% for better
| backtraces is nothing.
| audidude wrote:
| Absolutely. We've gotten numerous double digit
| performance improvements across applications, libraries,
| and system daemons because of frame-pointers in Fedora
| (and that's just from me).
| wbl wrote:
| Performance problems matter to the people who have them, who
| often are in an inconvenient place. Having the ability for
| profiling to just work means that it's easy to help these
| people.
| elteto wrote:
| I think you are trying to make this out something that it
| isn't.
|
| Visibility at the "cost" of negligible impact is more
| important than raw performance. That's it.
|
| I'm a regular user of Linux with some performance sensitivity
| that does not go as far as "I _need_ that extra register!".
| That's what the majority of developers working on Linux are
| like. I think it's up to _you_ to prove the contrary.
| PittleyDunkin wrote:
| This seems like a ridiculous attempt to bury your head in the
| sand. Is there any evidence _anyone_ doesn 't want frame
| pointers?
| Brian_K_White wrote:
| I think it's ridiculous to question that since obviously,
| yes, many people have decided exactly that. I see no point
| myself and I'm even in the field. And I am not in charge of
| all the distributions which disabled it by default.
|
| So, "yes". In fact "yes, duh?" Talk about head in sand...
| PittleyDunkin wrote:
| Ok, where's the evidence?
|
| > I see no point myself and I'm even in the field.
|
| You don't see the point of readable stack traces?
| Brian_K_White wrote:
| Nope. Not on 99.999% of installed binaries in existence
| and running at a given moment.
| PittleyDunkin wrote:
| That strikes me as an insane take (not to mention
| blatantly inaccurate), but I take your point that this is
| a common one for distribution-maintainers to have.
| dap wrote:
| > Evaluate "majority" this way: For every/any random binary
| in a distro, out of all the currently running instances of
| that binary in the world at any given moment, how many of
| those need to be profiled? > There is no way the answer is
| "most of them".
|
| This is an absurd way to evaluate it. All it takes is one
| savvy user to report a performance problem that developers
| are able to root-cause using stack traces from the user's
| system. Suppose they're able to make a 5% performance
| improvement to the program. Now _all_ user 's programs are 5%
| faster because of the frame pointers on this one user's
| system.
|
| At this point people usually ask: but couldn't developers
| have done that on their own systems with debug code? But the
| performance of debug code is not the same as the performance
| of shipping code. And not all problems manifest the same on
| all systems. This is why you need shipping code to be
| debuggable (or instrumentable or profileable or whatever you
| want to call it).
| baq wrote:
| fifty independent 1% performance drops nobody cares about
| compound to a ~40% reduction.
| josefx wrote:
| > 9X% of users do not care about a <1% drop in performance.
|
| Except Python got opted out of the frame pointer change due to
| benchmarks showing slowdowns of up to 10%. The discussion
| around that had the great idea of just adding a pragma to flat
| out override the build setting. So in the end that "%1"
| reduction claim only holds if everything even remotely affected
| silently ignores the flag.
| audidude wrote:
| This is a bit of a mischaracterization of the Python side of
| things.
|
| They only opted out for 3.11 which did not yet have the perf-
| integration fixes anyway. 3.12 uses frame-pointers just fine.
| josefx wrote:
| Any link to the fix or documentation about it? I could find
| added perf support but did not see anything about improved
| performance related to frame pointer use.
| jeffbee wrote:
| Complaining about frame pointers is like complaining about the
| budget of the Bureau of Labor Statistics. Yes, it's pure
| overhead, but also yes, it's good to know what is going on.
| ot wrote:
| I broadly agree with the thesis of the post, which if I
| understand correctly is that frame pointers are a temporary
| compromise until the whole ecosystem gets its act together and
| manages to agree on some form of out-of-band tracking of frame
| pointers, and it seems that we'll eventually get there.
|
| Some of the statements in the post seem odd to me though.
|
| - 5% of system-wide cycles spent in function prologues/epilogues?
| That is _wild_ , it can't be right.
|
| - Is using the whole 8 bytes right for the estimate? Pushing the
| stack pointer is the first instruction in the prologue and it's
| literally 1 byte. Epilogue is symmetrical.
|
| - Even if we're in the prologue, we know that we're in a leaf
| call, we can still resolve the instruction pointer to the
| function, and we can read the return address to find the parent,
| so what information is lost?
|
| When it comes to future alternatives, while frame pointers have
| their own problems, I think that there are still a few open
| questions:
|
| - Shadow stacks are cool but aren't they limited to a fixed
| number of entries? What if you have a deeper stack?
|
| - Is the memory overhead of lookup tables for very large programs
| acceptable?
| audidude wrote:
| > Shadow stacks are cool but aren't they limited to a fixed
| number of entries?
|
| Current available hardware yes. But I think some of the future
| Intel stuff was going to allow for much larger depth.
|
| > Is the memory overhead of lookup tables for very large
| programs acceptable?
|
| I don't think SFrame is as "dense" as DWARF as a format so you
| trade a bit of memory size for a much faster unwind experience.
| But you are definitely right that this adds memory pressure
| that could otherwise be ignored.
|
| Especially if the anomalies are what they sound like, just
| account for them statistically. You get a PID for cost
| accounting in the perf_event frame anyway.
| rwmj wrote:
| > - Is using the whole 8 bytes right for the estimate? Pushing
| the stack pointer is the first instruction in the prologue and
| it's literally 1 byte. Epilogue is symmetrical.
|
| I believe it's because of the landing pad for Control Flow
| Integrity which basically all functions now need. Grabbing
| main() from a random program on Fedora (which uses frame
| pointers): 0000000000007000 <main>:
| 7000: f3 0f 1e fa endbr64 ; landing pad
| 7004: 55 push %rbp ; set up frame
| pointer 7005: 48 89 e5 mov %rsp,%rbp
|
| It's not much of an issue in practice as the stack trace will
| still be nearly correct, enough for you to identify the
| problematic area of the code.
|
| > - Shadow stacks are cool but aren't they limited to a fixed
| number of entries? What if you have a deeper stack?
|
| Yes shadow stacks are limited to 32 entries on the most recent
| Intel CPUs (and as little as 4 entries on very old ones).
| However they are basically cost free so that's a big advantage.
|
| I think SFrame is a sensible middle ground here. It's saner
| than DWARF and has a long history of use in the kernel so we
| know it will work.
| Sesse__ wrote:
| If you're limited to 32 entries, why not just use LBR, then?
| It has basically the same pros and cons.
| Sesse__ wrote:
| > - 5% of system-wide cycles spent in function
| prologues/epilogues? That is wild, it can't be right.
|
| TBH I wouldn't be surprised on x86. There are so many registers
| to be pushed and popped due to the ABI, so every time I profile
| stuff I get depressed... Aarch64 seems to be better, the
| prologues are generally shorter when I look at those. (There's
| probably a reason why Intel APX introduces push2/pop2
| instructions.)
| manwe150 wrote:
| This sounds to me more like an inlining problem than an ABI
| problem. If the calls take as much time than the running,
| perhaps you just need a better language that doesn't
| arbitrarily prevent inlining due to compilation boundaries
| (eg. basically any modern language that isn't in the C/C++
| family, before LTO)
| Sesse__ wrote:
| I see this in LTO/PGO binaries as well. If a function is 20
| instructions long, it's not like you can inline it
| uncritically, yet a five-cycle prologue and a five-cycle
| epilogue will hurt. (Also, recursive functions etc.)
| quotemstr wrote:
| > temporary compromise until the whole ecosystem gets its act
| together and manages to agree on some form of out-of-band
| tracking of frame pointers,
|
| Temporary solutions have a way of becoming permanent. I was
| against the recent frame pointer enablement on the grounds of
| moral hazard. I still think it would have been better to force
| the ecosystem to get its act together first.
|
| Another factor nobody is talking about is JITed and interpreted
| languages. Whatever the long-term solution might be, it should
| enable stack traces that interleave accurate source-level frame
| information from native and managed code. The existing perf
| /tmp file hack is inadequate in many ways, including security,
| performance, and compatibility with multiple language runtimes
| coexisting in a single process.
| audidude wrote:
| It's a disaster no doubt.
|
| But, at least from the GNOME side of things, we've been
| complaining about it for roughly 15 years and kept getting
| push-back in the form of "we'll make something better".
|
| Now that we have frame-pointers enabled in Fedora, Ubuntu,
| Arch, etc we're starting to see movement on realistic
| alternatives. So in many ways, I think the moral hazard was
| waiting until 2023 to enable them.
| tempfile wrote:
| The JS constantly grabbing the anchor and updating it is
| absolutely appalling UX. It took me something like 11 back button
| presses to get back to where I was. Borderline malware.
| 0908400383 wrote:
| thkyr tdmyr tfshyl t`Tyl rtj mSr wtsb
| 0908400383 wrote:
| wtsb. qrwbt. jht tSl. fysbwk. Sdq
|
| Cicada. Yazanzawi. It pays off. You can use your phone number to
| enter your account. He is a Zionist, in Buyneh, in Boti, in Tibet
| WeYouWebZnbYouWeetbuy. Iurii i riu seter uutat Ziiaarnitioso
| zizzbzittit
___________________________________________________________________
(page generated 2024-11-04 23:00 UTC)