[HN Gopher] The return of the frame pointers
___________________________________________________________________
The return of the frame pointers
Author : mfiguiere
Score : 517 points
Date : 2024-03-17 03:59 UTC (19 hours ago)
(HTM) web link (www.brendangregg.com)
(TXT) w3m dump (www.brendangregg.com)
| adsharma wrote:
| I was at Google in 2005 on the other side of the argument. My
| view back then was simple:
|
| Even if $BIG_COMPANY makes a decision to compile everything with
| frame pointers, the rest of the community is not. So we'll be
| stuck fighting an unwinnable argument with a much larger
| community. Turns out that it was a ~20 year argument.
|
| I ended up writing some patches to make libunwind work for
| gperftools and maintained libunwind for some number of years as a
| consequence of that work.
|
| Having moved on to other areas of computing, I'm now a passive
| observer. But it's fascinating to read history from the other
| perspective.
| starspangled wrote:
| > So we'll be stuck fighting an unwinnable argument with a much
| larger community.
|
| In what way would you be stuck? What functional problems does
| _adding_ frame pointers introduce?
| tempay wrote:
| It "wastes" a register when you're not actively using them.
| On x86 that can make a big difference, though with the added
| registers of x86_64 it much less significant.
| nlewycky wrote:
| It caused a problem when building inline assembly heavy
| code that tried to use all the registers, frame pointer
| register included.
| starspangled wrote:
| Right, but I was asking about functional problems (being
| "stuck"), which sounded like a big issue for the choice.
| charleshn wrote:
| It's not just the loss of an architectural register, it's
| also the added cost to the prologue/epilogue. Even on
| x86_64, it can make a difference, in particular for small
| functions, which might not be inlined for a variety of
| reasons.
| Asooka wrote:
| If your small function is not getting inlined, you should
| investigate why that is instead of globally breaking
| performance analysis of your code.
| Sesse__ wrote:
| A typical case would be C++ virtual member functions.
| (They can sometimes be devirtualized, or speculatively
| partially devirtualized, using LTO+PGO, but there are
| lots of legitimate cases where they cannot.)
| kaba0 wrote:
| CPUs spend an enormous amount of time waiting for IO and
| memory, and push/pop and similar are just insanely well
| optimized. As the article also claims, I would be very
| surprised to see any effect, unless that more
| instructions themselves would spill the I-cache.
| charleshn wrote:
| I've seen around 1-3% on non micro benchmarks, real
| applications.
|
| Aee also this benchmark from Phoronix [0]:
| Of the 100 tests carried out for this article, when
| taking the geometric mean of all these benchmarks it
| equated to about a 14% performance penalty of the
| software with -O2 compared to when adding -fno-omit-
| frame-pointer.
|
| I'm not saying these benchmarks or the workloads I've
| seen are representative of the "real world", but people
| keep repeating that frame pointers are basically free,
| which is just not the case.
|
| [0] https://www.phoronix.com/review/fedora-frame-pointer
| inkyoto wrote:
| Wasting a register on comparatively more modern ISA's (PA-
| RISC 2.0, MIPS64, POWER, aarch64 etc - they are all more
| modern and have an abundance of general purpose registers)
| is not a concern.
|
| The actual <<wastage>> is in having to generate a prologue
| and an epilogue for each function - 2x instructions to
| preserve the old frame pointer and set a new one up, and 2x
| instruction at the point of return - to restore the
| previous frame pointer.
|
| Generally, it is not a big deal with an exception of a
| pathological case of a very large number of very small
| functions calling each other frequently where the extra 4x
| instructions per each such a function will be filling up
| the L1 instruction cache <<unnessarily>>.
| weebull wrote:
| Those pathological cases are really what inlining is for,
| with the exception of any tiny recursive functions that
| can't be tail call optimised.
| adsharma wrote:
| I wasn't talking about functional problems. It was a simple
| observation that big companies were not going to convince
| Linux distributors to add frame pointers anytime soon and
| that what those distributors do is relevant.
|
| All of the companies involved believed that they were special
| and decided to build their own (poorly managed) distribution
| called "third party code" and having to deal with it was not
| my best experience working at these companies.
| starspangled wrote:
| Oh, I just assumed you were talking about Google's Linux
| distribution and applications it runs on its fleet. I must
| have mis-assumed. Re-reading... maybe you weren't talking
| about any builds but just whether or not to oppose kernel
| and toolchain defaulting to omit frame pointers?
| adsharma wrote:
| Google didn't have a Linux distribution for a long time
| (the one everyone used on the desktop was an outdated rpm
| based distro, we mostly ignored it for development
| purposes).
|
| What existed was a x86 to x86 cross compilation
| environment and the libraries involved were manually
| imported by developers who needed that particular
| library.
|
| My argument was about the cost of ensuring that those
| libraries were compiled with frame pointers when much of
| the open source community was defaulting to omit-fp.
| dooglius wrote:
| Would it not be easier to patch compilers to always
| assume the equivalent of -fno-omit-frame-pointer
| adsharma wrote:
| That was done in 2005. But the task of auditing the
| supply chain to ensure that every single shared library
| you ever linked with was compiled a certain way was still
| hard. Nothing prevented an intern or a new employee from
| checking in a library without frame pointers into the
| third-party repo.
|
| In 2024, you'd probably create a "build container" that
| all developers are required to use to build binaries or
| pay a linux distributor to build that container.
|
| But cross compilation was the preferred approach back
| then. So all binaries had a rpath (run time search path
| to look for shared library) that ignored the distributor
| supplied libraries.
|
| Having come from a open source background, I found this
| system hard to digest. But there was a lot of social
| pressure to work as a bee in a system that thousands of
| other very competent engineers are using (quite
| successfully).
|
| I remember briefly talking to a chrome OS related group
| who were using the "build your own custom distro"
| approach, before deciding to move to another faang.
| cruffle_duffle wrote:
| > or pay a linux distributor to build that container.
|
| What does this mean?
| adsharma wrote:
| I didn't mean anything nefarious here :)
|
| Since Google would rather have the best brains in the
| industry build the next search indexing algorithm or the
| browser, they didn't have the time to invest human
| capital into building a better open source friendly dev
| environment.
|
| A natural alternative is to contract out the work. Linux
| distributors were good candidates for such contract work.
|
| But the vibe back then was Google could build better
| alternatives to some of these libraries and therefore
| bridging the gap between dev experience as an open source
| dev vs in house software engineer wasn't important.
|
| You could see the same argument play out in git vs
| monorepo etc, where people take up strong positions on a
| particular narrow tech topic, whereas the larger issue
| gets ignored as a result of these belief systems.
| rwmj wrote:
| You do get occasional regressions. eg. We found an extremely
| obscure bug involving enabling frame pointers, valgrind,
| glibc ifuncs and inlining (all at the same time):
|
| https://bugzilla.redhat.com/show_bug.cgi?id=2267598
| https://github.com/tukaani-
| project/xz/commit/82ecc538193b380...
| brcmthrowaway wrote:
| What area?
| pajko wrote:
| There's another option: https://lesenechal.fr/en/linux/unwinding-
| the-stack-the-hard-...
| loeg wrote:
| Brendan mentions DWARF unwinding, actually, and briefly
| mentions why he considers it insufficient.
| haberman wrote:
| The biggest objection seems to be the Java/JIT case. eh_frame
| supports a "personality function" which is AIUI basically a
| callback for performing custom unwinding. If the personality
| function could also support custom logic for producing
| backtraces, then the profiling sampler could effectively read
| the JVM's own metadata about the JIT'ted code, which I assume
| it must have in order to produce backtraces for the JVM
| itself.
| loeg wrote:
| This also seems like a big objection:
|
| > The overhead to walk DWARF is also too high, as it was
| designed for non-realtime use.
| kouteiheika wrote:
| Not a problem in practice. The way you solve it is to
| just translate DWARF into a simpler representation that
| doesn't require you to walk anything. (But I understand
| why people don't want to do it. DWARF is insanely complex
| and annoying to deal with.)
|
| Source: I wrote multiple profilers.
| loeg wrote:
| In this thread[1] we're discussing problems with using
| DWARF directly for unwinding, not possible translations
| of the metadata into other formats (like ORC or
| whatever).
|
| [1]: https://news.ycombinator.com/item?id=39732010
| kouteiheika wrote:
| I wasn't talking about other formats. I was talking about
| preloading the information contained in DWARF into a more
| efficient in-memory representation once when your
| profiler starts, and then the problem of "the overhead is
| too high for realtime use" disappears.
| menaerus wrote:
| From https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf
| DWARF-based unwinding can be a bottleneck for time-
| sensitive program analysis tools. For instance the perf
| profiler is forced to copy the whole stack on taking each
| sample and to build the backtraces offline: this solution
| has a memory and time overhead but also serious
| confidentiality and security flaws.
|
| So if I get this correctly, the problem with DWARF is
| that building the backtrace online (on each sample) in
| comparison to frame pointers is an expensive operation
| which, however, can be mitigated by building the
| backtrace offline at the expense of copying the stack.
|
| However, paper also mentions Similarly,
| the Linux kernel by default relies on a frame pointer to
| provide reliable backtraces. This incurs in a space and
| time overhead; for instance it has been reported
| (https://lwn.net/Articles/727553/) that the kernel's
| .text size increases by about 3.2%, resulting in a broad
| kernel-wide slowdown.
|
| and Measurements have shown a slowdown
| of 5-10% for some workloads (https://lore.kernel.org/lkml
| /20170602104048.jkkzssljsompjdwy@suse.de/T/#u).
| haberman wrote:
| But that one has at least some potential mitigation. Per
| his analysis, the Java/JIT case is the only one that has
| no mitigation:
|
| > Javier Honduvilla Coto (Polar Signals) did some
| interesting work using an eBPF walker to reduce the
| overhead, but...Java.
| rwmj wrote:
| DWARF unwinding isn't practical:
| https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...
| rfoo wrote:
| TBH this sounds more like perf's implementation is bad.
|
| I'm waiting for this to happen: https://github.com/open-
| telemetry/community/issues/1918
| javierhonduco wrote:
| There's always room for improvement, for example, Samply
| [0] is a wonderful profiler that uses the same APIs that
| `perf` uses, but unwinds the stacks as they come rather
| than dumping them all to disk and then having to process
| them in bulk.
|
| Samply unwinds significantly faster than `perf` because it
| caches unwind information.
|
| That being said, this approach still has some limitations,
| such as that very deep stacks won't be unwound, as the size
| of the process stack the kernel sends is quite limited.
|
| - [0]: https://github.com/mstange/samply
| dap wrote:
| Good post!
|
| > Profiling has been broken for 20 years and we've only now just
| fixed it.
|
| It was a shame when they went away. Lots of people, certainly on
| other systems and probably Linux too, have found the absence of
| frame pointers painful this whole time and tried to keep them
| available in as many environments as possible. It's validating
| (if also kind of frustrating) to see mainstream Linux bring them
| back.
| trws wrote:
| I'm sincerely curious. While I realize that using dwarf for
| unwinding is annoying, why is it so bad that it's worth
| pessimizing all code on the system? It's slow on Debian
| derivatives because they package only the slow unwinding path
| for perf for example, for license reasons, but with decent
| tooling I barely notice the difference. What am I missing?
| ngcc_hk wrote:
| It said gcc. I noted the default of llvm said to default with
| framepounter from 2011. Is this mainly a gcc issue?
| bawolff wrote:
| It doesn't really matter what the default of the compiler is,
| but what distros chose.
| WalterBright wrote:
| Guess I'll add it back in to the DMD code generator!
| Joker_vD wrote:
| Of course, if you cede RBP to be a frame pointer, you may as well
| have two stacks, one which is pointed into by RBP and stores the
| activation frames, and the other one which is pointed into by RSP
| and stores the return addresses only. At this point, you don't
| even need to "walk the stack" because the call stack is literally
| just a flat array of return addresses.
|
| Why do we normally store the return addresses near to the local
| variables in the first place, again? There are so many downsides.
| naasking wrote:
| It simplifies storage management. A stack frame is a simple
| bump pointer which is always in cache and only one guard page
| for overflow, in your proposal you need two guard pages and
| double the stack manipulations and doubling the chance of a
| cache miss.
| Joker_vD wrote:
| Yes, two guard pages are needed. No, the stack management
| stays the same: it's just "CALL func" at the call site, "SUB
| RBP, <frame_size>" at the prologue and "ADD RBP,
| <frame_size>; RET" at the epilogue. As for chances of a cache
| miss... probably, but I guess you also double them up when
| you enable CFET/Shadow Stack so eh.
|
| In exchange, it becomes very difficult for the stack smashing
| to corrupt the return address.
| imtringued wrote:
| The reduceron had five stacks and it was faster because of
| it.
| dan-robertson wrote:
| Note the 'shadow stacks' CPU feature mentioned briefly in the
| article, though it's more for security reasons. It's pretty
| similar to what you describe.
| rwmj wrote:
| Shadow stacks have been proposed as an alternative, although
| it's my understanding that in current CPUs they hold only a
| limited number of frames, like 16 or 32?
| amluto wrote:
| You may be thinking of the return stack buffer. The shadow
| stack holds every return address.
| astrobe_ wrote:
| You may be ready for Forth [1] ;-). Strangely, the Wikipedia
| article apparently doesn't put forward that Forth allows access
| both to the parameter and the return stack, which is a major
| feature of the model.
|
| [1] https://en.wikipedia.org/wiki/Forth_(programming_language)
| mikewarot wrote:
| Forth has a parameter stack, return stack, vocabulary stack
|
| STOIC, a variant of Forth, includes a file stack when loading
| words
| samatman wrote:
| I'm not sure what you're referring to with "vocabulary
| stack" here, perhaps the dictionary? More of a linked list,
| really a distinctive data structure of its own.
| astrobe_ wrote:
| Maybe OP refers to vocabulary search order manipulation
| [1]. It's sort of like namespaces, but "stacked". There's
| also the old MARKER and FORGET pair [2].
|
| The dictionary pointer can also be manipulated in some
| dialects. That can be used directly as the stack variant
| of the arena allocator idea. It is particularly useful
| for text concatenation.
|
| [1] https://forth-standard.org/standard/search [2]
| https://forth-standard.org/standard/core/MARKER
| samatman wrote:
| That does seem like a significant oversight. >r and r>, and
| cousins, are part of ANSI Forth, and I've never used a Forth
| which doesn't have them.
| stefan_ wrote:
| While here, why do we grow the stack the wrong way so
| misbehaved programs cause security issues? I know the reason of
| course, like so many things it last made sense 30 years ago,
| but the effects have been interesting.
| sweetjuly wrote:
| >Why do we normally store the return addresses near to the
| local variables in the first place, again? There are so many
| downsides.
|
| The advantage of storing them elsewhere is not quite clear
| (unless you have hardware support for things like shadow
| stacks).
|
| You'd have to argue that the cost of moving things to this
| other page and managing two pointers (where one is less
| powerful in the ISA) is meaningfully cheaper than the other
| equally effective mitigation of stack cookies/protectors which
| are already able to provide protection only where needed. There
| is no real security benefit to doing this over what we
| currently have with stack protectors since an arbitrary
| read/write will still lead to a CFI bypass.
| weebull wrote:
| > The advantage of storing them elsewhere is not quite clear
| (unless you have hardware support for things like shadow
| stacks).
|
| The classic buffer overflow issue should spring immediately
| to mind. By having a separate return address stack it's far
| less vulnerable to corruption through overflowing your data
| structures. This stops a bunch of attacks which purposely put
| crafted return addresses into position that will jump the
| program to malicious code.
|
| It's not a panacea, but generally keeping code pointers away
| from data structures is a good idea.
| claytonwramsey wrote:
| That's very interesting to me - I had seen the `[unknown]`
| mountain in my profiles but never knew why. I think it's a tough
| thing to justify: 2% performance is actually a pretty big
| difference.
|
| It would be really nice to have fine-grained control over frame
| pointer inclusion: provided fine-grained profiling, we could
| determine whether we needed the frame pointers for a given
| function or compilation unit. I wouldn't be surprised if we see
| that only a handful of operations are dramatically slowed by
| frame pointer inclusion while the rest don't really care.
| naasking wrote:
| > 2% performance is actually a pretty big difference.
|
| No it's not, particularly when it can help you identify
| hotspots via profiling that can net you improvements of 10% or
| more.
| pm215 wrote:
| Sure, but how many of the people running distro compiled code
| do perf analysis? And how many of the people who need to do
| perf analysis are unable to use a with-frame-pointers version
| when they need to? And how many of those 10% perf
| improvements are in common distro code that get upstreamed to
| improve general user experience, as opposed to being in
| private application code?
|
| If you're netflix then "enable frame pointers" is a no-
| brainer. But if you're a distro who's building code for
| millions of users, many of whom will likely never need to
| fire up a profiler, I think the question is at least a little
| trickier. The overall best tradeoff might end up being still
| to enable frame pointers, but I can see the other side too.
| jart wrote:
| It's not a technical tradeoff, it's a refusal to
| compromise. Lack of frame pointers prevents many groups
| from using software built by distros altogether. If a
| distro decides that they'd rather make things go 1% faster
| for grandma, at the cost of alienating thousands of
| engineers at places like Netflix and Google who simply want
| to volunteer millions of dollars of their employers
| resources helping distros to find 10x performance
| improvements, then the distros are doing a great disservice
| to both grandma and themselves.
| quotemstr wrote:
| Presenting people with false dichotomies is no way to
| build something worthwhile
| alerighi wrote:
| I mean if you need to do performance analysis on a
| software just recompile it. Why it's such a big deal?
|
| In the end a 2% of performance of every application it's
| a big deal. On a single computer may not be that
| significant, think about all the computers, servers,
| clusters, that run a Linux distro. And yes, I would ask a
| Google engineer that if scaled on the I don't know how
| many servers and computers that Google has a 2% increase
| in CPU usage is not a big deal: we are probably talking
| about hundreds of kW more of energy consumption!
|
| We talk a lot of energy efficiency these days, to me
| wasting a 2% only to make the performance analysis of
| some software easier (that is that you can analyze
| directly the package shipped by the distro and you don't
| have to recompile it) it's something stupid.
| samatman wrote:
| I would say the question here is what should be the
| default, and that the answer is clearly "frame pointers",
| from my point of view.
|
| Code eking out every possible cycle of performance can
| enable a no-frame-pointer optimization and see if it helps.
| But it's a bad default for libc, and for the kernel.
| loeg wrote:
| It's usually a lot less than 2%.
| inglor_cz wrote:
| The performance cost in your case may be much smaller than 2
| per cent.
|
| Don't completely trust the benchmarks on this; they are a bit
| synthetic and real-world applications tend to produce very
| different results.
|
| Plus, profiling is important. I was able to speed up various
| segments of my code by up to 20 per cent by profiling them
| carefully.
|
| And, at the end of the day, if your application is so sensitive
| about any loss of performance, you _can_ simply profile your
| code in your lab using frame pointers, then omit them in the
| version released to your customers.
| rwmj wrote:
| The measured overhead is slightly less than 1%. There have been
| some rare historical cases where frame pointers have caused
| performance to blow up but those are fixed.
| rwmj wrote:
| You can turn it on/off per function by attaching one of these
| GCC attribute to the function declaration (although it doesn't
| work on LLVM): __attribute__((optimize("no-
| omit-frame-pointer"))) __attribute__((optimize("omit-
| frame-pointer")))
| ndesaulniers wrote:
| The optimize fn attr causes other unintended side effects.
| Its usage is banned on the Linux kernel.
| tdullien wrote:
| As much as the return of frame pointers is a good thing, it's
| largely unnecessary -- it arrives at a point where multiple eBPF-
| based profilers are available that do fine using .eh_frame and
| also manually unwinding high level language runtime stacks: Both
| Parca from PolarSignals as well the artist formerly known as
| Prodfiler (now Elastic Universal Profiling) do fine.
|
| So this is a solution for a problem, and it arrives just at the
| moment that people have solved the problem more generically ;)
|
| (Prodfiler coauthor here, we had solved all of this by the time
| we launched in Summer 2021)
| Tomte wrote:
| You mean we don't need accessible profiling in free software
| because there are companies selling it to us. Cool.
| brancz wrote:
| Parca's user-space code is apache2 and the eBPF code is GPL.
| tdullien wrote:
| Parca is open-source, Prodfiler's eBPF code is GPL, and the
| rest of Prodfiler is currently going through OTel donation,
| so my point is: There's now multiple FOSS implementations of
| a more generic and powerful technique.
| int_19h wrote:
| PolarSignals is specifically discussed in the linked threads,
| and they conclude that their approach is not good enough for
| perf reasons.
| tdullien wrote:
| Oh nice, I can't find that - can you post a link?
| javierhonduco wrote:
| Curious to hear more about this. Full disclosure: I designed
| and implemented .eh_frame unwinding when I worked at Polar
| Signals.
| int_19h wrote:
| I think I might have confused two unrelated posts. The one
| that references Polar Signals is this one:
|
| https://gitlab.com/freedesktop-sdk/freedesktop-
| sdk/-/issues/...
|
| So not a perf issue there, but they don't think the
| workflow is suitable for whole-system profiling. Perf
| issues were in the context of `perf` using DWARF:
|
| https://gitlab.com/freedesktop-sdk/freedesktop-
| sdk/-/issues/...
| weinzierl wrote:
| Also I've heard that the whole .eh_frame unwinding is more
| fragile than a simple frame pointer. I've seen enough broken
| stack traces myself, but honestly I never tried if -fno-omit-
| frame-pointer would have helped.
| tdullien wrote:
| Yes and no. A simple frame pointer needs to be present in all
| libraries, and depending on build settings, this might not be
| the case. .eh_frame tends to be emitted almost everywhere...
|
| So it's both similarly fragile, but one is almost never
| disabled.
|
| The broader point is: For HLL runtimes you need to be able to
| switch between native and interpreted unwinds anyhow, so
| you'll always do some amount of lifting in eBPF land.
|
| And yes, having frame pointers removes a _lot_ of complexity,
| so it 's net a very good thing. It's just that the situation
| wasnt nearly as dire as described, because people that care
| about profiling had built solutions.
| quotemstr wrote:
| Forget eBPF even -- why do the job of userspace in the
| kernel? Instead of unwinding via eBPF, we should ask
| userspace to unwind itself using a synchronous signal
| delivered to userspace whenever we've requested a stack
| sample.
| bregma wrote:
| Context switches are incredibly expensive. Given the
| sampling rate of eBPF profilers all the useful
| information would get lost in the context switch noise.
|
| Things get even more complicated because context switches
| can mean CPU migrations, making many of your data
| useless.
| quotemstr wrote:
| What makes you think doing unwinding in userspace would
| do any more context switches (by which I think you mean
| privilege level transitions) than we do today? See my
| other comment on the subject.
|
| > Things get even more complicated because context
| switches can mean CPU migrations, making many of your
| data useless.
|
| No it doesn't. If a user space thread is blocked on doing
| kernel work, its stack isn't going to change, not even if
| that thread ends up resuming on a different thread.
| searealist wrote:
| I'm under the impression that eh_frame stack traces are much
| slower than frame pointer stack traces, which makes always-on
| profiling, such as seen in tcmalloc, impractical.
| felixge wrote:
| First of all, I think the .eh_frame unwinding y'all pioneered
| is great.
|
| But I think you're only thinking about CPU profiling at <= 100
| Hz / core. However, Brendan's article is also talking about
| Off-CPU profiling, and as far as I can tell, all known
| techniques (scheduler tracing, wall clock sampling) require
| stack unwinding to occur 1-3 orders of magnitude more often
| than for CPU profiling.
|
| For those use cases, I don't think .eh_frame unwinding will be
| good enough, at least not for continuous profiling. E.g. see
| [1][2] for an example of how frame pointer unwinding allowed
| the Go runtime to lower execution tracing overhead from 10-20%
| to 1-2%, even so it was already using a relatively fast lookup
| table approach.
|
| [1] https://go.dev/blog/execution-traces-2024
|
| [2] https://blog.felixge.de/reducing-gos-execution-tracer-
| overhe...
| nemetroid wrote:
| If you're sufficiently in control of your deployment details to
| ensure that BPF is available at all. CAP_SYS_PTRACE is
| available ~everywhere for everyone.
| 5- wrote:
| so what is the downside to using e.g. dwarf-based stack walking
| (supported by perf) for libc, which was the original stated
| problem?
|
| in the discussion the issue gets conflated with jit-ted
| languages, but that has nothing to do with the crusade to enable
| frame pointer for system libraries.
|
| and if you care that much for dwarf overhead... just cache the
| unwind information in your system-level profiler? no need to
| rebuild everything.
| yxhuvud wrote:
| The article explains why DWARF is not an option.
| menaerus wrote:
| Extremely light on the details, and also conflates it with
| the JIT which makes it harder to understand the point, so I
| was wondering about the same thing as well.
| brancz wrote:
| The way perf does it is slow, as the entire stack is copied
| into user-space and is then asynchronously unwound.
|
| This is solvable as Brendan calls out, we've created an eBPF-
| based profiler at Polar Signals, that essentially does what you
| said, it optimized the unwind tables, caches them in bpf maps,
| and then synchronously unwinds as opposed to copying the whole
| stack into user-space.
| stefan_ wrote:
| This conveniently sidesteps the whole issue of getting DWARF
| data in the first place, which is also still a broken
| disjointed mess on Linux. Hell, _Windows_ solved this many
| many years ago.
| bregma wrote:
| You'd need a pretty special distro to have enabled -fno-
| asynchronous-unwind-tables by default in its toolchain.
|
| By default on most Linux distros the frame tables are built
| into all the binaries, and end up in the GNU_EH_FRAME
| segment, which is always available in any running process.
| Doesn't sound a broken and disjointed mess to me. Sounds
| more like a smoothly running solved problem.
| Sesse__ wrote:
| It should also be said that you need some sort of DWARF-like
| information to understand inlining. If I have a function A
| that inlines B that in turn inlines C, I'd often like to
| understand that C takes a bunch of time, and with frame
| pointers only, that information gets lost.
| javierhonduco wrote:
| Inlined functions can be symbolized using DWARF line
| information[0] while unwinding requires DWARF unwind
| information (CFI), which the x86_64 ABI mandates in every
| single ELF in the `.eh_frame` section
|
| - [0] This line information might or might not be present
| in an executable but luckily there's debuginfod
| (https://sourceware.org/elfutils/Debuginfod.html)
| rwmj wrote:
| The downside is it doesn't work at all:
| https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...
| ReleaseCandidat wrote:
| That's one thing Apple did do right on ARM:
|
| > The frame pointer register (x29) must always address a valid
| frame record. Some functions -- such as leaf functions or tail
| calls -- may opt not to create an entry in this list. As a
| result, stack traces are always meaningful, even without debug
| information.
|
| https://developer.apple.com/documentation/xcode/writing-arm6...
| microtherion wrote:
| On Apple platforms, there is often an interpretability problem
| of another kind: Because of the prevalence of deeply nested
| blocks / closures, backtraces for Objective C / Swift apps are
| often spread across numerous threads. I don't know of a good
| solution for that yet.
| felixge wrote:
| I'm not very familiar with Objective C and Swift, so this
| might not make sense. But JS used to have a similar problem
| with async/await. The v8 engine solved it by walking the
| chain of JS promises to recover the "logical stack"
| developers are interested in [1].
|
| [1] https://v8.dev/blog/fast-async
| astrange wrote:
| Swift concurrency does a similar thing. For the older
| dispatch blocks, Xcode injects a library that records
| backtraces over thread hops.
| eqvinox wrote:
| This doesn't detract from the content at all but the register
| counts are off; SI and DI count as GPRs on i686 bringing it to
| 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).
| cesarb wrote:
| > [...] on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64
| has 14+BP (not 16+BP).
|
| That is, on i686 you have 7 GPRs without frame pointers, while
| on x86_64 you have 14 GPRs even with frame pointers.
|
| Copying a comment of mine from an older related discussion
| (https://news.ycombinator.com/item?id=38632848):
|
| "To emphasize this point: on 64-bit x86 with frame pointers,
| you have twice as many registers as on 32-bit x86 without frame
| pointers, and these registers are twice as wide. A 64-bit value
| (more common than you'd expect even when pointers are 32 bits)
| takes two registers on 32-bit x86, but only a single register
| on 64-bit x86."
| brendangregg wrote:
| Thanks!
| benreesman wrote:
| Brendan is such a treasure to the community (buy his book it's
| great).
|
| I wasn't doing extreme performance stuff when -fomit-frame-
| pointer became the norm, so maybe it was a big win for enough
| people to be a sane default, but even that seems dubious: "just
| works" profiling is how you figure out when you're in an extreme
| performance scenario (if you're an SG14 WG type, you know it and
| are used to all the defaults being wrong for you).
|
| I'm deeply grateful for all the legends who have worked on
| libunwind, gperf stuff, perftool, DTrace, eBPF: these are the
| too-often-unsung heroes of software that is still fast after
| decades of Moore's law free-riding.
|
| But they've been fighting an uphill battle against a weird
| alliance of people trying to game compiler benchmarks and the
| really irresponsible posture that "developer time is more
| expensive" which is only sometimes true and never true if you
| care about people on low-spec gear, which is the community of
| users who that is already the least-resourced part of the global
| community.
|
| I'm fortunate enough to have a fairly modern desktop, laptop, and
| phone: for me it's merely annoying that chat applications and
| music players and windowing systems offer nothing new except
| enshittification in terms of features while needing 10-100x the
| resources they did a decade ago.
|
| But for half of my career and 2/3rds of my time coding, I was on
| low-spec gear most of the time, and I would have been largely
| excluded if people didn't care a lot about old computers back
| then.
|
| I'm trying to help a couple of aspiring hackers get started right
| now it's a real struggle to get their environments set up with
| limitations like Intel Macs and WSL2 as the Linux option (WSL2 is
| very cool but it's not loved up enough by e.g. yarn projects).
|
| If you want new hackers, you need to make things work well on
| older computers.
|
| Thanks again Brendan et al!
| sesm wrote:
| glibc is only 2 MB, why Chrome relies on system glibc instead of
| statically linking their own version with frame pointers enabled?
| wruza wrote:
| https://stackoverflow.com/questions/57476533/why-is-statical...
|
| I guess the similar situation with msvcrt's.
| nolist_policy wrote:
| At the very least Chrome needs to link to the system libGL.so
| and friends for gpu acceleration, libva.so for video
| acceleration, and so on. And these are linked against glibc of
| course.
| dooglius wrote:
| having/omitting frame pointers doesn't change the ABI; it
| will work if you compile against glibc-nofp and link against
| glibc-withfp
| dsign wrote:
| I remember when the omission of stack frame pointers started
| spreading at the beginning of the 2000s. I was in college at the
| time, studying computer sciences in a very poor third-world
| country. Our computers were old and far from powerful. So, for
| most course projects, we would eschew interprets and use
| compilers. Mind you, what my college lacked in money it
| compensated by having interesting course work. We studied and
| implemented low level data-structures, compilers, assembly-code
| numerical routines and even a device driver for Minix.
|
| During my first two years in college, if one of our programs did
| something funny, I would attach gdb and see what was happening at
| assembly level. I got used to "walking the stack" manually,
| though the debugger often helped a lot. Happy times, until all of
| the sudden, "-fomit-frame-pointer" was all the rage, and stack
| traces stopped making sense. Just like that, debugging that
| segfault or illegal instruction became exponentially harder. A
| short time later, I started using Python for almost everything to
| avoid broken debugging sessions. So, I lost an order of magnitude
| or two with "-fomit-frame-pointer". But learning Python served me
| well for other adventures.
| rwmj wrote:
| I'm glad he mentioned Fedora because it's been a tiresome battle
| to keep frame pointers enabled in the whole distribution (eg
| https://pagure.io/fesco/issue/3084).
|
| There's a persistent myth that frame pointers have a huge
| overhead, because there was a single Python case that had a +10%
| slow down (now fixed). The actual measured overhead is under 1%,
| which is far outweighed by the benefits we've been able to make
| in certain applications.
| brendangregg wrote:
| Thanks; what was the Python fix?
| rwmj wrote:
| This was the investigation:
| https://discuss.python.org/t/python-3-11-performance-with-
| fr...
|
| Initially we just turned off frame pointers for the Python
| 3.9 interpreter in Fedora. They are back on in Python 3.12
| where it seems the upstream bug has been fixed, although I
| can't find the actual fix right now.
|
| Fedora tracking bug: https://bugzilla.redhat.com/2158729
|
| Fedora change in Python 3.9 to disable frame pointers: https:
| //src.fedoraproject.org/rpms/python3.9/c/9b71f8369141c...
| brendangregg wrote:
| Ah right, thanks, I remember I saw Andrii's analysis in the
| other thread.
| https://pagure.io/fesco/issue/2817#comment-826636
| menaerus wrote:
| I believe it's a misrepresentation to say that "actual measured
| overhead is under 1%". I don't think such a claim can be
| universally applied because this depends on the very workload
| you're measuring the overhead with.
|
| FWIW your results don't quite match the measurements from Linux
| kernel folks who claim that the overhead is anywhere between
| 5-10%. Source:
| https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...
| I didn't preserve the data involved but in a variety of
| workloads including netperf, page allocator microbenchmark,
| pgbench and sqlite, enabling framepointer introduced overhead
| of around the 5-10% mark.
|
| Significance in their results IMO is in the fact that they
| measured the impact by using PostgreSQL and SQLite. If
| anything, DBMS are one of the best ways to really stress out
| the system.
| brendangregg wrote:
| Those are microbenchmarks.
| menaerus wrote:
| pgbench is not a microbenchmark.
| brendangregg wrote:
| From the docs: "pgbench is a simple program for running
| benchmark tests on PostgreSQL. It runs the same sequence
| of SQL commands over and over"
|
| While it might call itself a benchmark, it behaves very
| microbenchmark-y.
|
| The other numbers I and others have shared have been from
| actual production workloads. Not a simple program that
| tests same sequence of commands over and over.
| menaerus wrote:
| While pgbench might be "simple" program, as in a test-
| runner, workloads that are run by it are far from it. It
| runs TPC-B by default but can also run your own arbitrary
| script that defines whatever the workload is? It also
| allows to run queries concurrently so I fail to
| understand the reasoning of it "being simple" or
| "microbenchmarkey". It's far from the truth I think.
| weebull wrote:
| Anything running a full database server is not micro.
| brendangregg wrote:
| If I call the same "get statistics" command over and over
| in a loop (with zero queries), or 100% the same invalid
| query (to test the error path performance), I believe
| we'd call that a micro-benchmark, despite involving a
| full database. It's a completely unrealistic artificial
| workload to test a particular type of operation.
|
| The pgbench docs make it sound microbenchmark-y by
| describing making the same call over and over. If people
| find that this simulates actual production workloads,
| then yes, it can be considered a macro-benchmark.
| anarazel wrote:
| The are loads of real world workloads that have similar
| patterns to pgbench, particularly read only pgbench.
| babel_ wrote:
| Those are numbers from 7 years ago, so they're beginning to
| get a bit stale as people start to put more weight behind
| having frame pointers and make upstream contributions to
| their compilers to improve their output. People put it at <1%
| from much more recent testing by the very R.W.M. Jones you're
| replying to [0] and separate testing by others like Brendan
| Gregg [1b], whose post this is commenting on (and included
| [1b] in the Appendix as well), with similar accounts by
| others in the last couple years. Oh, and if you use
| flamegraph, you might want to check the repo for a familiar
| name.
|
| Some programs, like Python, have reported worse, 2-7% [2],
| but there is traction on tackling that [1a] (see both rwmj's
| and brendangregg's replies to sibling comments, they've both
| done a lot of upstreamed work wrt. frame pointers,
| performance, and profiling).
|
| As has been frequently pointed out, the benefits from
| improved profiling cannot be understated, even a 10% cost to
| having frame pointers can be well worth it when you leverage
| that information to target the actual bottlenecks that are
| eating up your cycles. Plus, you can always disable it in
| specific hotspots later when needed, which is much easier
| than the reverse.
|
| Something, something, premature optimisation -- though in
| seriousness, this information benefits actual optimisation,
| exactly because we don't have the information and
| understanding that would allow truly universal claims,
| precisely because things like this haven't been available,
| and so haven't been widely used. We know frame pointers, from
| additional register pressure and extended function
| prologue/epilogue, can be a detriment in certain hotspots;
| that's why we have granular control. But without them, we
| often don't know which hotspots are actually affected, so I'm
| sure even the databases would benefit... though the "my
| database is the fastest database" problem has always been the
| result of endless micro-benchmarking, rather than actual end-
| to-end program performance and latency, so even a claimed
| "10%" drop there probably doesn't impact actual real-world
| usage, but that's a reason why some of the most interesting
| profiling work lately has been from ideas like causal
| profilers and continuous profilers, which answer exactly
| that.
|
| [0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-
| dwar... [1a]:
| https://pagure.io/fesco/issue/2817#comment-826636 [1b]:
| https://pagure.io/fesco/issue/2817#comment-826805 [2]:
| https://discuss.python.org/t/the-performance-of-python-
| with-...
| adrian_b wrote:
| While improved profiling is useful, achieving it by wasting
| a register is annoying, because it is just a very dumb
| solution.
|
| The choice made by Intel when they have designed 8086 to
| use 2 separate registers for the stack pointer and for the
| frame pointer was a big mistake.
|
| It is very easy to use a single register as both the stack
| pointer and the frame pointer, as it is standard for
| instance in IBM POWER.
|
| Unfortunately in the Intel/AMD CPUs using a single register
| is difficult, because the simplest implementation is
| unreliable since interrupts may occur between 2
| instructions that must form an atomic sequence (and they
| may clobber the stack before new space is allocated after
| writing the old frame pointer value in the stack).
|
| It would have been very easy to correct this in new CPUs by
| detecting that instruction sequence and blocking the
| interrupts between them.
|
| Intel had already done this once early in the history of
| the x86 CPUs, when they have discovered a mistake in the
| design of the ISA, that interrupts could occur between
| updating the stack segment and the stack pointer. Then they
| had corrected this by detecting such an instruction
| sequence and blocking the interrupts at the boundary
| between those instructions.
|
| The same could have been done now, to enable the use of the
| stack pointer as also the frame pointer. (This would be
| done by always saving the stack pointer in the top of the
| stack whenever stack space is allocated, so that the stack
| pointer always points to the previous frame pointer, i.e.
| to the start of the linked list containing all stack
| frames.)
| doctorpangloss wrote:
| > As has been frequently pointed out, the benefits from
| improved profiling cannot be understated, even a 10% cost
| to having frame pointers can be well worth it when you
| leverage that information to target the actual bottlenecks
| that are eating up your cycles.
|
| Few can leverage that information because the open source
| software you are talking about lacks telemetry in the self
| hosted case.
|
| The profiling issue really comes down to the cultural
| opposition in these communities to collecting telemetry and
| opening it for anyone to see and use. The average user
| struggles to ally with a trustworthy actor who will share
| the information like profiling freely and anonymize it at a
| per-user level, the level that is actually useful. Such
| things exist, like the Linux hardware site, but only
| because they have not attracted the attention of agitators.
|
| Basically users are okay with profiling, so long as it is
| quietly done by Amazon or Microsoft or Google, and not by
| the guy actually writing the code and giving it out for
| everyone to use for free. It's one of the most moronic
| cultural trends, and blame can be put squarely on product
| growth grifters who equivocate telemetry with privacy
| violations; open source maintainers, who have enough
| responsibilities as is, besides educating their users; and
| Apple, who have made their essentially vaporous claims
| about privacy a central part of their brand.
|
| Of course people know the answer to your question. Why
| doesn't Google publish every profile of every piece of open
| source software? What exactly is sensitive about their
| workloads? Meta publishes a whole library about every
| single one of its customers, for anyone to freely read. I
| don't buy into the holiness of the backend developer's
| "cleverness" or whatever is deemed sensitive, and it's so
| hypocritical.
| yjftsjthsd-h wrote:
| > Basically users are okay with profiling, so long as it
| is quietly done by Amazon or Microsoft or Google, and not
| by the guy actually writing the code and giving it out
| for everyone to use for free.
|
| No; the groups are approximately "cares whether software
| respects the user, including privacy", or "doesn't know
| or doesn't care". I seriously doubt that any meaningful
| number of people are okay with companies invading their
| privacy but not smaller projects.
| babel_ wrote:
| I think the kind of profiling information you're
| imagining is a little different from what I am.
|
| Continuous profiling of your system that gets relayed to
| someone else by telemetry is very different from
| continuous profiling of your own system, handled only by
| yourself (or, generalising, your
| community/group/company). You seem to be imagining we're
| operating more in the former, whereas I am imagining more
| in the latter.
|
| When it's our own system, better instrumented for our own
| uses, and we're the only ones getting the information,
| then there's nothing to worry about, and we can get much
| more meaningful and informative profiling done when more
| information about the system is available. I don't even
| need telemetry. When it's "someone else's" system, in
| other words, when we have no say in telemetry (or have to
| exercise a right to opt-out, rather than a more self-
| executing contract around opt-in), then we start to have
| exactly the kinds of issues you're envisaging.
|
| When it's not completely out of our hands, then we need
| to recognise different users, different demands,
| different contexts. Catering to the user matters, and
| when it comes to sensitive information, well, people have
| different priorities and threat models.
|
| If I'm opening a calendar on my phone, I don't expect it
| to be heavily instrumented and relaying all of that, I
| just want to see my calendar. When I open a calendar on
| my phone, and it is unreasonably slow, then I might want
| to submit relevant telemetry back in some capacity.
| Meanwhile, if I'm running the calendar server, I'm
| absolutely wanting to have all my instrumentation
| available and recording every morsel I reasonably can
| about that server, otherwise improving it or fixing it
| becomes much harder.
|
| From the other side, if I'm running the server, I may
| _want_ telemetry from users, but if it 's not essential,
| then I can "make do" with only the occasional opt-in
| telemetry. I also have other means of profiling real
| usage, not just scooping it all up from unknowing users
| (or begrudging users). Those often have some other
| "cost", but in turn, they don't have the "cost" of
| demanding it from users. For people to freely choose
| requires acknowledging the asymmetries present, and that
| means we can't just take the path of least resistance, as
| we may have to pay for it later.
|
| In short, it's a consent issue. Many violate that,
| knowingly, because they care not for the consequences.
| Many others don't even seem to think about it, and just
| go ahead regardless. And it's so much easier behind
| closed doors. Open source in comparison, even if not
| everything is public, must contend with the fact that the
| actions and consequences are (PRs, telemetry traffic,
| etc), so it inhabits a space in which violating consent
| is much more easily held accountable (though no
| guarantee).
|
| Of course, this does not mean it's always done properly
| in open source. It's often an uphill battle to get
| telemetry that's off-by-default, where users explicitly
| consent via opt-in, as people see how that could easily
| be undermined, or later invalidated. Many opt-in
| mechanisms (e.g. a toggle in the settings menu) often do
| not have expiration built in, so fail to check at a later
| point that someone still consents. Not to say that's the
| way you must do it, just giving an example of a way that
| people seem to be more in favour of, as with the
| generally favourable response to such features making
| their way into "permissions" on mobile.
|
| We can see how the suspicion creeps in, informed by
| experience... but that's also known by another word:
| vigilance.
|
| So, users are not "okay" with it. There's a power
| imbalance where these companies are afforded the impunity
| because many are left to conclude they have no choice but
| to let them get away with it. That hasn't formed in a
| vacuum, and it's not so simple that we just pull back the
| curtain and reveal the wizard for what he is. Most seem
| to already know.
|
| It's proven extremely difficult to push alternatives. One
| reason is that information is frequently not ready-to-
| hand for more typical users, but another is that said
| alternatives may not actually fulfil the needs of some
| users: notably, accessibility remains hugely inconsistent
| in open source, and is usually not funded on par with,
| say, projects that affect "backend" performance.
|
| The result? Many people just give their grandma an
| iPhone. That's what's telling about the state of open
| source, and of the actual cultural trends that made it
| this way. The threat model is fraudsters and scammers,
| not nation-state actors or corporate malfeasance. This
| app has tons of profiling and privacy issues? So what? At
| least grandma can use it, and we can stay in contact,
| dealing with the very real cultural trends towards
| isolation. On a certain level, it's just pragmatic.
| They'd choose differently if they could, but they don't
| feel like they can, and they've got bigger worries.
|
| Unless we _do_ different, the status quo will remain. If
| there 's any agitation to be had, it's in getting more
| people to care about improving things and then actually
| doing them, even if it's just taking small steps. There
| won't be a perfect solution that appears out of nowhere
| tomorrow, but we only have a low bar to clear. Besides,
| we've all thought "I could do better than that", so why
| not? Why not just aim for better?
|
| Who knows, we might actually achieve it.
| matheusmoreira wrote:
| "Agitators". We don't trust telemetry precisely because
| of comments like that. World is full of people like you
| who apparently see absolutely nothing wrong with
| exfiltrating identifying information from other people's
| computers. We have to actively resist such attempts, they
| are constant, never ending and it only seems to get worse
| over time but you dismiss it all as "cultural opposition"
| to telemetry.
|
| For the record I'm NOT OK with being profiled, measured
| or otherwise studied in any way without my explicit
| consent. That even extends to the unethical human
| experiments that corporations run on people and which
| they euphemistically call A/B tests. I don't care if it's
| Google or a hobbyist developer, I will block it if I can
| and I will not lose a second of sleep over it.
| rstuart4133 wrote:
| > World is full of people like you who apparently see
| absolutely nothing wrong with exfiltrating identifying
| information from other people's computers.
|
| True. But such people are like cockroaches. They know
| what they are doing will be unpopular with their targets,
| so they keep it hidden. This is easy enough to do in
| closed designs, car manufacturers selling your driving
| habits to insurance companies and health monitoring app
| selling menstrual cycle data to retailers selling to
| women.
|
| Compare that do say Debian and RedHat. They too collect
| performance data. But the code is open source, Debian has
| repeatable builds so you can be 100% sure that is the
| code in use, and every so often someone takes a look it.
| Guess what, the data they send back is so unidentifiable
| it satisfies even the most paranoid of their 1000's of
| members.
|
| All it takes is a little bit of sunlight to keep the
| cockroaches at bay, and then we can safely let the devs
| collect the data they need to improve code. And everyone
| benefits.
| barrkel wrote:
| This isn't an argument for a default.
| menaerus wrote:
| I was not even trying to make one. I was questioning the
| validity of "1% overhead" claim by providing the counter-
| example from respectable source.
| edwintorok wrote:
| You probably already know, but with OCaml 5 the only way to get
| flamegraphs working is to either:
|
| * use framepointers [1]
|
| * use LBR (but LBR has a limited depth, and may not work on on
| all CPUs, I'm assuming due to bugs in perf)
|
| * implement some deep changes in how perf works to handle the 2
| stacks in OCaml (I don't even know if this would be possible),
| or write/adapt some eBPF code to do it
|
| OCaml 5 has a separate stack for OCaml code and C code, and
| although GDB can link them based on DWARF info, perf DWARF
| call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563
| #issuecomment-193...)
|
| If you need more evidence to keep it enabled in future
| releases, you can use OCaml 5 as an example (unfortunately
| there aren't many OCaml applications, so that may not carry too
| much weight on its own).
|
| [1]: I haven't actually realised that Fedora39 has already
| enabled FP by default, nice! (I still do most of my day-to-day
| profiling on an ~CentOS 7 system with 'perf record --call-graph
| dwarf -F 47 -a', I was aware that there was a discussion to
| enable FP by default, but haven't noticed it has actually been
| done already)
| namibj wrote:
| No, LBR is an Intel-only feature.
| awaythrow999 wrote:
| Frame pointers are still a no-go on 32bit so anything that is
| IoT today.
|
| The reason we removed them was not a myth but comes from the
| pre-64 bit days. Not that long ago actually.
|
| Even today if you want to repurpose older 64 bit systems with a
| new life then this of optimization still makes sense.
|
| Ideally it should be the default also for security critical
| systems because not everything needs to be optimized for
| "observability"
| Narishma wrote:
| > Frame pointers are still a no-go on 32bit so anything that
| is IoT today.
|
| Isn't that just 32-bit x86, which isn't used in IoT? The
| other 32-bit ISAs aren't register-starved like x86.
| weebull wrote:
| It would be, yes. x86 had very few registers, so anything
| you could do to free them up was vital. Arm 32bit has 32
| general purpose registers I think, and RISC V certainly
| does. In fact there's no difference between 32 and 64 bit
| in that respect. If anything, 64-bit frame pointers make it
| marginally worse.
| CountSessine wrote:
| Sadly, no. 32-bit ARM only has 16 GPR's (two of which are
| zero and link), mostly because of the stupid predication
| bits in the instruction encoding.
|
| That said, I don't know how valuable getting rid of FP on
| ARM is - I once benchmarked ffmpeg on 32-bit x86 before
| and after enabling FP and PIC (basically removing 2 GPRs)
| and the difference was huge (>10%) but that's an extreme
| example.
| fanf2 wrote:
| Arm32 doesn't have a zero-value register. Its non-
| general-purpose registers are PC, LR, SP, FP - tho the
| link register can be used for temporary values.
| zzbn00 wrote:
| NiX (and I assume Guix) are very convenient for this as it is
| fairly easy to turn frame pointers on or off for parts or whole
| of the system.
| tkiolp4 wrote:
| Are his books (the one about Systems Performance and eBPF)
| relevant for normal software engineers who want to improve
| performance in normal services? I don't work for faang, and our
| usual performance issues are solved by adding indexes here and
| there, caching, and simple code analysis. Tools like Datadog help
| a lot already.
| wavemode wrote:
| Diving into flame graphs being worthwhile for optimization,
| assumes that your workload is CPU-bound. Most business software
| does not have such workloads, and rather (as you yourself have
| noted) spend most of their time waiting for I/O (database,
| network, filesystem, etc).
|
| And so, (as you again have noted), your best bet is to just use
| plain old logging and tracing (like what datadog provides) to
| find out where the waiting is happening.
| polio wrote:
| Profiling is a pretty basic technique that is applicable to all
| software engineering. I'm not sure what a "normal" service is
| here, but I think we all have an obligation to understand
| what's happening in the systems we own.
|
| Some people may believe that 100ms latency is acceptable for a
| CLI tool, but what if it could be 3ms? On some aesthetic level,
| it also feels good to be able to eliminate excess. Finally, you
| should learn it because you won't necessarily have that job
| forever.
| mgaunard wrote:
| You don't need frame pointers, all the relevant info is stored in
| dwarf debug data.
| DaveFlater wrote:
| GCC optimization causes the frame pointer push to move around,
| resulting in wrong call stacks. "Wontfix"
|
| https://news.ycombinator.com/item?id=38896343
| rwmj wrote:
| That was in 2012. Does it still occur on modern GCC?
|
| There definitely have been regressions with frame pointers
| being enabled, although we've fixed all the ones we've found in
| current (2024) Fedora.
| jart wrote:
| I think so and I vaguely seem to recall -fno-schedule-insns2
| being the only thing that fixes it. To get the full power of
| frame pointers and hackable binary, what I use is:
| -fno-schedule-insns2 -fno-omit-frame-pointers
| -fno-optimize-sibling-calls -mno-omit-leaf-frame-
| pointer -fpatchable-function-entry=18,16
| -fno-inline-functions-called-once
|
| The only flag that's potentially problematic is -fno-
| optimize-sibling-calls since it breaks the optimal approach
| to writing interpreters and slows down code that's written in
| a more mathematical style.
| ndesaulniers wrote:
| Pretty sure unwinding thumb generated by GCC is still non-
| unwindable via FPs. That's been a pain.
| tzot wrote:
| I am not sure, but I believe -fomit-frame-pointer in x86-64
| allows the compiler to use a _thirteenth_ register, not a
| _seventeenth_ .
| cesarb wrote:
| I disagree with this sentence of the article:
|
| "I could say that times have changed and now the original 2004
| reasons for omitting frame pointers are no longer valid in 2024."
|
| The original 2004 reason for omitting frame pointers is still
| valid in 2024: it's still a big performance win on the register-
| starved 32-bit x86 architecture. What has changed is that the
| 32-bit x86 architecture is much less relevant nowadays (other
| than legacy software, for most people it's only used for a small
| instant while starting up the firmware), and other common 32-bit
| architectures (like embedded 32-bit ARM) are not as register-
| starved as the 32-bit x86.
| IshKebab wrote:
| That's exactly what they were saying. You're not disagreeing at
| all.
| shaggie76 wrote:
| I thought we'd been using /Oy (Frame-Pointer Omission) for years
| on Windows and that there was a pdata section on x64 that was
| used for stack-walking however to my great surprise I just read
| on MSDN that "In x64 compilers, /Oy and /Oy- are not available."
|
| Does this mean Microsoft decided they weren't going to support
| breaking profilers and debuggers OR is there some magic in the
| pdata section that makes it work even if you omit the frame-
| pointer?
| MarkSweep wrote:
| Some Google found this:
| https://devblogs.microsoft.com/oldnewthing/20130906-00/?p=33...
|
| "Recovering a broken stack on x64 machines on Windows is
| trickier because the x64 uses unwind codes for stack walking
| rather than a frame pointer chain."
|
| More details are here: https://learn.microsoft.com/en-
| us/cpp/build/exception-handli...
| quotemstr wrote:
| Microsoft has had excellent universal unwinding support for
| decades now. I'm disappointed to see someone as prominent as
| this article's author present as infeasible what Microsoft has
| had working for so long.
| musjleman wrote:
| > In x64 compilers
|
| The default is omission. If you have a Windows machine, in all
| likelihood almost no 64 bit code running on it has frame
| pointers.
|
| > OR is there some magic in the pdata section that makes it
| work even if you omit the frame-pointer
|
| You haven't ever needed frame pointers to unwind using ...
| unwind information. The same thing exists for linux as
| `.eh_frame` section.
| javierhonduco wrote:
| Overall, I am for frame pointers, but after some years working in
| this space, I thought I would share some thoughts:
|
| * Many frame pointer unwinders don't account for a problem they
| have that DWARF unwind info doesn't have: the fact that the frame
| set-up is not atomic, it's done in two instructions, `push $rbp`
| and `mov $rsp $rbp`, and if when a snapshot is taken we are in
| the `push`, we'll miss the parent frame. I think this might be
| able to be fired by inspecting the code, but I think this might
| only be as good as a heuristic as there could be other `push
| %rbp` unrelated to the stack frame. I would love to hear if
| there's a better approach!
|
| * I developed the solution Brendan mentions which allows faster,
| in-kernel unwinding without frame pointers using BPF [0]. This
| doesn't use DWARF CFI (the unwind info) as-is but converts it
| into a random-access format that we can use in BPF. He mentions
| not supporting JVM languages, and while it's true that right now
| it only supports JIT sections that have frame pointers, I planned
| to implement a full JVM interpreter unwinder. I have left Polar
| Signals since and shifted priorities but it's feasible to get a
| JVM unwinder to work in lockstep with the native unwinder.
|
| * In an ideal world, enabling frame pointers should be done on a
| case-by-case. Benchmarking is key, and the tradeoffs that you
| make might change a lot depending on the industry you are in, and
| what your software is doing. In the past I have seen large
| projects enabling/disabling frame pointers not doing an in-depth
| assessment of losses/gains of performance, observability, and how
| they connect to business metrics. The Fedora folks have done a
| superb and rigorous job here.
|
| * Related to the previous point, having a build system that
| enables you to change this system-wide, including libraries your
| software depends on can be awesome to not only test these changes
| but also put them in production.
|
| * Lastly, I am quite excited about SFrame that Indu is working
| on. It's going to solve a lot of the problems we are facing right
| now while letting users decide whether they use frame pointers. I
| can't wait for it, but I am afraid it might take several years
| until all the infrastructure is in place and everybody upgrades
| to it.
|
| - [0]:
| https://web.archive.org/web/20231222054207/https://www.polar...
| rwmj wrote:
| On the third point, you have to do frame pointers across the
| whole Linux distro in order to be able to get good flamegraphs.
| You have to do whole system analysis to really understand
| what's going on. The way that current binary Linux distros
| (like Fedora and Debian) works makes any alternative
| impossible.
| felixge wrote:
| Great comments, thanks for sharing. The non-atomic frame setup
| is indeed problematic for CPU profilers, but it's not an issue
| for allocation profiling, Off-CPU profiling or other types off
| non-interrupt driven profiling. But as you mentioned, there
| might be ways to solve that problem.
| brancz wrote:
| Great comment! Just want to add we are making good progress on
| the JVM unwinder!
| secondcoming wrote:
| Can they not be disabled on a per-function basis?
| weebull wrote:
| Just as a general comment on this topic...
|
| The fact that people complain about the performance of the
| mechanism that enables the system to be profiled, and so
| performance problems be identified, is beyond ironic. Surely the
| epitome of premature optimisation.
| doubloon wrote:
| im sure in ancient mesopotamia there was somebody arguing about
| you could brew beer faster if you stop measuring the hops so
| carefully but then someone else was saying yes but if you dont
| measure the hops carefully then you dont know the efficiency of
| your overall beer making process so you cant isolate the
| bottlenecks.
|
| the funny thing is i am not sure if the world would actually
| work properly if we didn't have both of these kinds of people.
| AtlasBarfed wrote:
| So what are these other techniques the 2004 migration from
| frame pointers assumed would work for stack walking? Why don't
| they work today? I get that _64 has a lot more registers, so
| there's minimal value to +1 the register?
| loeg wrote:
| In 2004, the assumption made by the GCC developers was that
| you would be walking stacks very infrequently, in a debugger
| like GDB. Not sampling stacks 1000s of times a second for
| profiling.
| Cold_Miserable wrote:
| Not interesting. Enter/leave also does the same thing as your
| save/restore rbp.
|
| Far more interesting I recall there might be an instruction where
| rbp isn't allowed.
| titzer wrote:
| Virgil doesn't use frame pointers. If you don't have dynamic
| stack allocation, the frame of a given function has a fixed size
| can be found with a simple (binary-search) table lookup. Virgil's
| technique uses an additional page-indexed range that further
| restricts the lookup to be a few comparisons on average (O(log(#
| retpoints per page)). It combines the unwind info with stackmaps
| for GC. It takes very little space.
|
| The main driver is in
| (https://github.com/titzer/virgil/blob/master/rt/native/Nativ...
| the rest of the code in the directory implements the decoding of
| metadata.
|
| I think frame pointers only make sense if frames are dynamically-
| sized (i.e. have stack allocation of data). Otherwise it seems
| weird to me that a dynamic mechanism is used when a static
| mechanism would suffice; mostly because no one agreed on an ABI
| for the metadata encoding, or an unwind routine.
|
| I believe the 1-2% measurement number. That's in the same
| ballpark as pervasive checks for array bounds checks. It's weird
| that the odd debugging and profiling task gets special pleading
| for a 1% cost but adding a layer of security gets the finger.
| Very bizarre priorities.
| codeflo wrote:
| All of this information is static, there's no need to sacrifice a
| whole CPU register only to store data that's already known. A
| simple lookup data structure that maps an instruction address
| range to the stack offset of the return address should be enough
| to recover the stack layout. On Windows, you'd precompute that
| from PDB files, I'm sure you can do the same thing with whatever
| the equivalent debug data structure is on Linux.
| fsmv wrote:
| [deleted]
| loeg wrote:
| It isn't entirely static because of alloca().
| mikewarot wrote:
| I started programming in 1979, and I can't believe I've managed
| to avoid learning about stack frames all those EBP register
| tricks until now. I always had parameters to functions in
| registers, not on the stack, for the most part. The compiler hid
| a lot of things from me.
|
| Is it because I avoided Linux and C most of my life? Perhaps it's
| because I used debug, and Periscope before that... and never gdb?
| boulos wrote:
| JIT'ed code is sadly poorly supported, but LLVM has had great
| hooks for noting each method that is produced and its address. So
| you can build a simple mixed-mode unwinder, pretty easily, but
| mostly in process.
|
| I think Intel's DNN things dump their info out to some common
| file that perf can read instead, but because the *kernels*
| themselves reuse rbp throughout oneDNN, it's totally useless.
|
| Finally, can any JVM folks explain this claim about DWARF info
| from the article:
|
| > Doesn't exist for JIT'd runtimes like the Java JVM
|
| that just sounds surprising to me. Is it off by default or
| literally not available? (Google searches have mostly pointed to
| people wanting to include the JNI/C side of a JVM stack, like
| https://github.com/async-profiler/async-profiler/issues/215).
| POVKING_89 wrote:
| lol
| POVKING_89 wrote:
| l0l
| olliej wrote:
| Have compilers (or I guess x86?) gotten better at dealing with
| the frame pointer? Or are we just saying that taking a
| significant perf hit is acceptable if it lets you find other tiny
| perf problems? Because I recall -fomit-frame-pointer being a
| significant performance win, bigger than most of the things that
| you need a perfect profiler to spot.
| vlovich123 wrote:
| To this day I still believe that there should be a dedicated
| protected separate stack region for the call stack that only the
| CPU can write to/read from. Walking the stack then becomes
| trivially fast because you just need to do a very small memcpy.
| And stack memory overflows can never overwrite the return
| address.
| ndesaulniers wrote:
| This is a thing; it's called shadow call stack. Both ARM and
| now Intel have extensions for it.
| vlovich123 wrote:
| But the shadow stack concept seems much dumber to me. Why
| write the address to the regular stack and the shadow stack
| and then compare? Why not only use the shadow stack and not
| put return addresses on the main stack at all.
| BinaryRage wrote:
| I remember talking to Brendan about the PreserveFramePointer
| patch during my first months at Netflix in 2015. As of JDK 21,
| unfortunately it is no longer a general purpose solution for the
| JVM, because it prevents a fast path being taken for stack
| thawing for virtual threads:
| https://github.com/openjdk/jdk/blob/d32ce65781c1d7815a69ceac...
___________________________________________________________________
(page generated 2024-03-17 23:01 UTC)