[HN Gopher] The return of the frame pointers
       ___________________________________________________________________
        
       The return of the frame pointers
        
       Author : mfiguiere
       Score  : 517 points
       Date   : 2024-03-17 03:59 UTC (19 hours ago)
        
 (HTM) web link (www.brendangregg.com)
 (TXT) w3m dump (www.brendangregg.com)
        
       | adsharma wrote:
       | I was at Google in 2005 on the other side of the argument. My
       | view back then was simple:
       | 
       | Even if $BIG_COMPANY makes a decision to compile everything with
       | frame pointers, the rest of the community is not. So we'll be
       | stuck fighting an unwinnable argument with a much larger
       | community. Turns out that it was a ~20 year argument.
       | 
       | I ended up writing some patches to make libunwind work for
       | gperftools and maintained libunwind for some number of years as a
       | consequence of that work.
       | 
       | Having moved on to other areas of computing, I'm now a passive
       | observer. But it's fascinating to read history from the other
       | perspective.
        
         | starspangled wrote:
         | > So we'll be stuck fighting an unwinnable argument with a much
         | larger community.
         | 
         | In what way would you be stuck? What functional problems does
         | _adding_ frame pointers introduce?
        
           | tempay wrote:
           | It "wastes" a register when you're not actively using them.
           | On x86 that can make a big difference, though with the added
           | registers of x86_64 it much less significant.
        
             | nlewycky wrote:
             | It caused a problem when building inline assembly heavy
             | code that tried to use all the registers, frame pointer
             | register included.
        
             | starspangled wrote:
             | Right, but I was asking about functional problems (being
             | "stuck"), which sounded like a big issue for the choice.
        
             | charleshn wrote:
             | It's not just the loss of an architectural register, it's
             | also the added cost to the prologue/epilogue. Even on
             | x86_64, it can make a difference, in particular for small
             | functions, which might not be inlined for a variety of
             | reasons.
        
               | Asooka wrote:
               | If your small function is not getting inlined, you should
               | investigate why that is instead of globally breaking
               | performance analysis of your code.
        
               | Sesse__ wrote:
               | A typical case would be C++ virtual member functions.
               | (They can sometimes be devirtualized, or speculatively
               | partially devirtualized, using LTO+PGO, but there are
               | lots of legitimate cases where they cannot.)
        
               | kaba0 wrote:
               | CPUs spend an enormous amount of time waiting for IO and
               | memory, and push/pop and similar are just insanely well
               | optimized. As the article also claims, I would be very
               | surprised to see any effect, unless that more
               | instructions themselves would spill the I-cache.
        
               | charleshn wrote:
               | I've seen around 1-3% on non micro benchmarks, real
               | applications.
               | 
               | Aee also this benchmark from Phoronix [0]:
               | Of the 100 tests carried out for this article, when
               | taking the geometric mean of all these benchmarks it
               | equated to about a 14% performance penalty of the
               | software with -O2 compared to when adding -fno-omit-
               | frame-pointer.
               | 
               | I'm not saying these benchmarks or the workloads I've
               | seen are representative of the "real world", but people
               | keep repeating that frame pointers are basically free,
               | which is just not the case.
               | 
               | [0] https://www.phoronix.com/review/fedora-frame-pointer
        
             | inkyoto wrote:
             | Wasting a register on comparatively more modern ISA's (PA-
             | RISC 2.0, MIPS64, POWER, aarch64 etc - they are all more
             | modern and have an abundance of general purpose registers)
             | is not a concern.
             | 
             | The actual <<wastage>> is in having to generate a prologue
             | and an epilogue for each function - 2x instructions to
             | preserve the old frame pointer and set a new one up, and 2x
             | instruction at the point of return - to restore the
             | previous frame pointer.
             | 
             | Generally, it is not a big deal with an exception of a
             | pathological case of a very large number of very small
             | functions calling each other frequently where the extra 4x
             | instructions per each such a function will be filling up
             | the L1 instruction cache <<unnessarily>>.
        
               | weebull wrote:
               | Those pathological cases are really what inlining is for,
               | with the exception of any tiny recursive functions that
               | can't be tail call optimised.
        
           | adsharma wrote:
           | I wasn't talking about functional problems. It was a simple
           | observation that big companies were not going to convince
           | Linux distributors to add frame pointers anytime soon and
           | that what those distributors do is relevant.
           | 
           | All of the companies involved believed that they were special
           | and decided to build their own (poorly managed) distribution
           | called "third party code" and having to deal with it was not
           | my best experience working at these companies.
        
             | starspangled wrote:
             | Oh, I just assumed you were talking about Google's Linux
             | distribution and applications it runs on its fleet. I must
             | have mis-assumed. Re-reading... maybe you weren't talking
             | about any builds but just whether or not to oppose kernel
             | and toolchain defaulting to omit frame pointers?
        
               | adsharma wrote:
               | Google didn't have a Linux distribution for a long time
               | (the one everyone used on the desktop was an outdated rpm
               | based distro, we mostly ignored it for development
               | purposes).
               | 
               | What existed was a x86 to x86 cross compilation
               | environment and the libraries involved were manually
               | imported by developers who needed that particular
               | library.
               | 
               | My argument was about the cost of ensuring that those
               | libraries were compiled with frame pointers when much of
               | the open source community was defaulting to omit-fp.
        
               | dooglius wrote:
               | Would it not be easier to patch compilers to always
               | assume the equivalent of -fno-omit-frame-pointer
        
               | adsharma wrote:
               | That was done in 2005. But the task of auditing the
               | supply chain to ensure that every single shared library
               | you ever linked with was compiled a certain way was still
               | hard. Nothing prevented an intern or a new employee from
               | checking in a library without frame pointers into the
               | third-party repo.
               | 
               | In 2024, you'd probably create a "build container" that
               | all developers are required to use to build binaries or
               | pay a linux distributor to build that container.
               | 
               | But cross compilation was the preferred approach back
               | then. So all binaries had a rpath (run time search path
               | to look for shared library) that ignored the distributor
               | supplied libraries.
               | 
               | Having come from a open source background, I found this
               | system hard to digest. But there was a lot of social
               | pressure to work as a bee in a system that thousands of
               | other very competent engineers are using (quite
               | successfully).
               | 
               | I remember briefly talking to a chrome OS related group
               | who were using the "build your own custom distro"
               | approach, before deciding to move to another faang.
        
               | cruffle_duffle wrote:
               | > or pay a linux distributor to build that container.
               | 
               | What does this mean?
        
               | adsharma wrote:
               | I didn't mean anything nefarious here :)
               | 
               | Since Google would rather have the best brains in the
               | industry build the next search indexing algorithm or the
               | browser, they didn't have the time to invest human
               | capital into building a better open source friendly dev
               | environment.
               | 
               | A natural alternative is to contract out the work. Linux
               | distributors were good candidates for such contract work.
               | 
               | But the vibe back then was Google could build better
               | alternatives to some of these libraries and therefore
               | bridging the gap between dev experience as an open source
               | dev vs in house software engineer wasn't important.
               | 
               | You could see the same argument play out in git vs
               | monorepo etc, where people take up strong positions on a
               | particular narrow tech topic, whereas the larger issue
               | gets ignored as a result of these belief systems.
        
           | rwmj wrote:
           | You do get occasional regressions. eg. We found an extremely
           | obscure bug involving enabling frame pointers, valgrind,
           | glibc ifuncs and inlining (all at the same time):
           | 
           | https://bugzilla.redhat.com/show_bug.cgi?id=2267598
           | https://github.com/tukaani-
           | project/xz/commit/82ecc538193b380...
        
         | brcmthrowaway wrote:
         | What area?
        
       | pajko wrote:
       | There's another option: https://lesenechal.fr/en/linux/unwinding-
       | the-stack-the-hard-...
        
         | loeg wrote:
         | Brendan mentions DWARF unwinding, actually, and briefly
         | mentions why he considers it insufficient.
        
           | haberman wrote:
           | The biggest objection seems to be the Java/JIT case. eh_frame
           | supports a "personality function" which is AIUI basically a
           | callback for performing custom unwinding. If the personality
           | function could also support custom logic for producing
           | backtraces, then the profiling sampler could effectively read
           | the JVM's own metadata about the JIT'ted code, which I assume
           | it must have in order to produce backtraces for the JVM
           | itself.
        
             | loeg wrote:
             | This also seems like a big objection:
             | 
             | > The overhead to walk DWARF is also too high, as it was
             | designed for non-realtime use.
        
               | kouteiheika wrote:
               | Not a problem in practice. The way you solve it is to
               | just translate DWARF into a simpler representation that
               | doesn't require you to walk anything. (But I understand
               | why people don't want to do it. DWARF is insanely complex
               | and annoying to deal with.)
               | 
               | Source: I wrote multiple profilers.
        
               | loeg wrote:
               | In this thread[1] we're discussing problems with using
               | DWARF directly for unwinding, not possible translations
               | of the metadata into other formats (like ORC or
               | whatever).
               | 
               | [1]: https://news.ycombinator.com/item?id=39732010
        
               | kouteiheika wrote:
               | I wasn't talking about other formats. I was talking about
               | preloading the information contained in DWARF into a more
               | efficient in-memory representation once when your
               | profiler starts, and then the problem of "the overhead is
               | too high for realtime use" disappears.
        
               | menaerus wrote:
               | From https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf
               | DWARF-based unwinding can be a bottleneck for time-
               | sensitive program analysis tools. For instance the perf
               | profiler is forced to copy the whole stack on taking each
               | sample and to build the backtraces offline: this solution
               | has a memory and time overhead but also serious
               | confidentiality and security flaws.
               | 
               | So if I get this correctly, the problem with DWARF is
               | that building the backtrace online (on each sample) in
               | comparison to frame pointers is an expensive operation
               | which, however, can be mitigated by building the
               | backtrace offline at the expense of copying the stack.
               | 
               | However, paper also mentions                   Similarly,
               | the Linux kernel by default relies on a frame pointer to
               | provide reliable backtraces. This incurs in a space and
               | time overhead; for instance it has been reported
               | (https://lwn.net/Articles/727553/) that the kernel's
               | .text size increases by about 3.2%, resulting in a broad
               | kernel-wide slowdown.
               | 
               | and                   Measurements have shown a slowdown
               | of 5-10% for some workloads (https://lore.kernel.org/lkml
               | /20170602104048.jkkzssljsompjdwy@suse.de/T/#u).
        
               | haberman wrote:
               | But that one has at least some potential mitigation. Per
               | his analysis, the Java/JIT case is the only one that has
               | no mitigation:
               | 
               | > Javier Honduvilla Coto (Polar Signals) did some
               | interesting work using an eBPF walker to reduce the
               | overhead, but...Java.
        
         | rwmj wrote:
         | DWARF unwinding isn't practical:
         | https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...
        
           | rfoo wrote:
           | TBH this sounds more like perf's implementation is bad.
           | 
           | I'm waiting for this to happen: https://github.com/open-
           | telemetry/community/issues/1918
        
             | javierhonduco wrote:
             | There's always room for improvement, for example, Samply
             | [0] is a wonderful profiler that uses the same APIs that
             | `perf` uses, but unwinds the stacks as they come rather
             | than dumping them all to disk and then having to process
             | them in bulk.
             | 
             | Samply unwinds significantly faster than `perf` because it
             | caches unwind information.
             | 
             | That being said, this approach still has some limitations,
             | such as that very deep stacks won't be unwound, as the size
             | of the process stack the kernel sends is quite limited.
             | 
             | - [0]: https://github.com/mstange/samply
        
       | dap wrote:
       | Good post!
       | 
       | > Profiling has been broken for 20 years and we've only now just
       | fixed it.
       | 
       | It was a shame when they went away. Lots of people, certainly on
       | other systems and probably Linux too, have found the absence of
       | frame pointers painful this whole time and tried to keep them
       | available in as many environments as possible. It's validating
       | (if also kind of frustrating) to see mainstream Linux bring them
       | back.
        
         | trws wrote:
         | I'm sincerely curious. While I realize that using dwarf for
         | unwinding is annoying, why is it so bad that it's worth
         | pessimizing all code on the system? It's slow on Debian
         | derivatives because they package only the slow unwinding path
         | for perf for example, for license reasons, but with decent
         | tooling I barely notice the difference. What am I missing?
        
       | ngcc_hk wrote:
       | It said gcc. I noted the default of llvm said to default with
       | framepounter from 2011. Is this mainly a gcc issue?
        
         | bawolff wrote:
         | It doesn't really matter what the default of the compiler is,
         | but what distros chose.
        
       | WalterBright wrote:
       | Guess I'll add it back in to the DMD code generator!
        
       | Joker_vD wrote:
       | Of course, if you cede RBP to be a frame pointer, you may as well
       | have two stacks, one which is pointed into by RBP and stores the
       | activation frames, and the other one which is pointed into by RSP
       | and stores the return addresses only. At this point, you don't
       | even need to "walk the stack" because the call stack is literally
       | just a flat array of return addresses.
       | 
       | Why do we normally store the return addresses near to the local
       | variables in the first place, again? There are so many downsides.
        
         | naasking wrote:
         | It simplifies storage management. A stack frame is a simple
         | bump pointer which is always in cache and only one guard page
         | for overflow, in your proposal you need two guard pages and
         | double the stack manipulations and doubling the chance of a
         | cache miss.
        
           | Joker_vD wrote:
           | Yes, two guard pages are needed. No, the stack management
           | stays the same: it's just "CALL func" at the call site, "SUB
           | RBP, <frame_size>" at the prologue and "ADD RBP,
           | <frame_size>; RET" at the epilogue. As for chances of a cache
           | miss... probably, but I guess you also double them up when
           | you enable CFET/Shadow Stack so eh.
           | 
           | In exchange, it becomes very difficult for the stack smashing
           | to corrupt the return address.
        
           | imtringued wrote:
           | The reduceron had five stacks and it was faster because of
           | it.
        
         | dan-robertson wrote:
         | Note the 'shadow stacks' CPU feature mentioned briefly in the
         | article, though it's more for security reasons. It's pretty
         | similar to what you describe.
        
           | rwmj wrote:
           | Shadow stacks have been proposed as an alternative, although
           | it's my understanding that in current CPUs they hold only a
           | limited number of frames, like 16 or 32?
        
             | amluto wrote:
             | You may be thinking of the return stack buffer. The shadow
             | stack holds every return address.
        
         | astrobe_ wrote:
         | You may be ready for Forth [1] ;-). Strangely, the Wikipedia
         | article apparently doesn't put forward that Forth allows access
         | both to the parameter and the return stack, which is a major
         | feature of the model.
         | 
         | [1] https://en.wikipedia.org/wiki/Forth_(programming_language)
        
           | mikewarot wrote:
           | Forth has a parameter stack, return stack, vocabulary stack
           | 
           | STOIC, a variant of Forth, includes a file stack when loading
           | words
        
             | samatman wrote:
             | I'm not sure what you're referring to with "vocabulary
             | stack" here, perhaps the dictionary? More of a linked list,
             | really a distinctive data structure of its own.
        
               | astrobe_ wrote:
               | Maybe OP refers to vocabulary search order manipulation
               | [1]. It's sort of like namespaces, but "stacked". There's
               | also the old MARKER and FORGET pair [2].
               | 
               | The dictionary pointer can also be manipulated in some
               | dialects. That can be used directly as the stack variant
               | of the arena allocator idea. It is particularly useful
               | for text concatenation.
               | 
               | [1] https://forth-standard.org/standard/search [2]
               | https://forth-standard.org/standard/core/MARKER
        
           | samatman wrote:
           | That does seem like a significant oversight. >r and r>, and
           | cousins, are part of ANSI Forth, and I've never used a Forth
           | which doesn't have them.
        
         | stefan_ wrote:
         | While here, why do we grow the stack the wrong way so
         | misbehaved programs cause security issues? I know the reason of
         | course, like so many things it last made sense 30 years ago,
         | but the effects have been interesting.
        
         | sweetjuly wrote:
         | >Why do we normally store the return addresses near to the
         | local variables in the first place, again? There are so many
         | downsides.
         | 
         | The advantage of storing them elsewhere is not quite clear
         | (unless you have hardware support for things like shadow
         | stacks).
         | 
         | You'd have to argue that the cost of moving things to this
         | other page and managing two pointers (where one is less
         | powerful in the ISA) is meaningfully cheaper than the other
         | equally effective mitigation of stack cookies/protectors which
         | are already able to provide protection only where needed. There
         | is no real security benefit to doing this over what we
         | currently have with stack protectors since an arbitrary
         | read/write will still lead to a CFI bypass.
        
           | weebull wrote:
           | > The advantage of storing them elsewhere is not quite clear
           | (unless you have hardware support for things like shadow
           | stacks).
           | 
           | The classic buffer overflow issue should spring immediately
           | to mind. By having a separate return address stack it's far
           | less vulnerable to corruption through overflowing your data
           | structures. This stops a bunch of attacks which purposely put
           | crafted return addresses into position that will jump the
           | program to malicious code.
           | 
           | It's not a panacea, but generally keeping code pointers away
           | from data structures is a good idea.
        
       | claytonwramsey wrote:
       | That's very interesting to me - I had seen the `[unknown]`
       | mountain in my profiles but never knew why. I think it's a tough
       | thing to justify: 2% performance is actually a pretty big
       | difference.
       | 
       | It would be really nice to have fine-grained control over frame
       | pointer inclusion: provided fine-grained profiling, we could
       | determine whether we needed the frame pointers for a given
       | function or compilation unit. I wouldn't be surprised if we see
       | that only a handful of operations are dramatically slowed by
       | frame pointer inclusion while the rest don't really care.
        
         | naasking wrote:
         | > 2% performance is actually a pretty big difference.
         | 
         | No it's not, particularly when it can help you identify
         | hotspots via profiling that can net you improvements of 10% or
         | more.
        
           | pm215 wrote:
           | Sure, but how many of the people running distro compiled code
           | do perf analysis? And how many of the people who need to do
           | perf analysis are unable to use a with-frame-pointers version
           | when they need to? And how many of those 10% perf
           | improvements are in common distro code that get upstreamed to
           | improve general user experience, as opposed to being in
           | private application code?
           | 
           | If you're netflix then "enable frame pointers" is a no-
           | brainer. But if you're a distro who's building code for
           | millions of users, many of whom will likely never need to
           | fire up a profiler, I think the question is at least a little
           | trickier. The overall best tradeoff might end up being still
           | to enable frame pointers, but I can see the other side too.
        
             | jart wrote:
             | It's not a technical tradeoff, it's a refusal to
             | compromise. Lack of frame pointers prevents many groups
             | from using software built by distros altogether. If a
             | distro decides that they'd rather make things go 1% faster
             | for grandma, at the cost of alienating thousands of
             | engineers at places like Netflix and Google who simply want
             | to volunteer millions of dollars of their employers
             | resources helping distros to find 10x performance
             | improvements, then the distros are doing a great disservice
             | to both grandma and themselves.
        
               | quotemstr wrote:
               | Presenting people with false dichotomies is no way to
               | build something worthwhile
        
               | alerighi wrote:
               | I mean if you need to do performance analysis on a
               | software just recompile it. Why it's such a big deal?
               | 
               | In the end a 2% of performance of every application it's
               | a big deal. On a single computer may not be that
               | significant, think about all the computers, servers,
               | clusters, that run a Linux distro. And yes, I would ask a
               | Google engineer that if scaled on the I don't know how
               | many servers and computers that Google has a 2% increase
               | in CPU usage is not a big deal: we are probably talking
               | about hundreds of kW more of energy consumption!
               | 
               | We talk a lot of energy efficiency these days, to me
               | wasting a 2% only to make the performance analysis of
               | some software easier (that is that you can analyze
               | directly the package shipped by the distro and you don't
               | have to recompile it) it's something stupid.
        
             | samatman wrote:
             | I would say the question here is what should be the
             | default, and that the answer is clearly "frame pointers",
             | from my point of view.
             | 
             | Code eking out every possible cycle of performance can
             | enable a no-frame-pointer optimization and see if it helps.
             | But it's a bad default for libc, and for the kernel.
        
         | loeg wrote:
         | It's usually a lot less than 2%.
        
         | inglor_cz wrote:
         | The performance cost in your case may be much smaller than 2
         | per cent.
         | 
         | Don't completely trust the benchmarks on this; they are a bit
         | synthetic and real-world applications tend to produce very
         | different results.
         | 
         | Plus, profiling is important. I was able to speed up various
         | segments of my code by up to 20 per cent by profiling them
         | carefully.
         | 
         | And, at the end of the day, if your application is so sensitive
         | about any loss of performance, you _can_ simply profile your
         | code in your lab using frame pointers, then omit them in the
         | version released to your customers.
        
         | rwmj wrote:
         | The measured overhead is slightly less than 1%. There have been
         | some rare historical cases where frame pointers have caused
         | performance to blow up but those are fixed.
        
         | rwmj wrote:
         | You can turn it on/off per function by attaching one of these
         | GCC attribute to the function declaration (although it doesn't
         | work on LLVM):                 __attribute__((optimize("no-
         | omit-frame-pointer")))       __attribute__((optimize("omit-
         | frame-pointer")))
        
           | ndesaulniers wrote:
           | The optimize fn attr causes other unintended side effects.
           | Its usage is banned on the Linux kernel.
        
       | tdullien wrote:
       | As much as the return of frame pointers is a good thing, it's
       | largely unnecessary -- it arrives at a point where multiple eBPF-
       | based profilers are available that do fine using .eh_frame and
       | also manually unwinding high level language runtime stacks: Both
       | Parca from PolarSignals as well the artist formerly known as
       | Prodfiler (now Elastic Universal Profiling) do fine.
       | 
       | So this is a solution for a problem, and it arrives just at the
       | moment that people have solved the problem more generically ;)
       | 
       | (Prodfiler coauthor here, we had solved all of this by the time
       | we launched in Summer 2021)
        
         | Tomte wrote:
         | You mean we don't need accessible profiling in free software
         | because there are companies selling it to us. Cool.
        
           | brancz wrote:
           | Parca's user-space code is apache2 and the eBPF code is GPL.
        
           | tdullien wrote:
           | Parca is open-source, Prodfiler's eBPF code is GPL, and the
           | rest of Prodfiler is currently going through OTel donation,
           | so my point is: There's now multiple FOSS implementations of
           | a more generic and powerful technique.
        
         | int_19h wrote:
         | PolarSignals is specifically discussed in the linked threads,
         | and they conclude that their approach is not good enough for
         | perf reasons.
        
           | tdullien wrote:
           | Oh nice, I can't find that - can you post a link?
        
           | javierhonduco wrote:
           | Curious to hear more about this. Full disclosure: I designed
           | and implemented .eh_frame unwinding when I worked at Polar
           | Signals.
        
             | int_19h wrote:
             | I think I might have confused two unrelated posts. The one
             | that references Polar Signals is this one:
             | 
             | https://gitlab.com/freedesktop-sdk/freedesktop-
             | sdk/-/issues/...
             | 
             | So not a perf issue there, but they don't think the
             | workflow is suitable for whole-system profiling. Perf
             | issues were in the context of `perf` using DWARF:
             | 
             | https://gitlab.com/freedesktop-sdk/freedesktop-
             | sdk/-/issues/...
        
         | weinzierl wrote:
         | Also I've heard that the whole .eh_frame unwinding is more
         | fragile than a simple frame pointer. I've seen enough broken
         | stack traces myself, but honestly I never tried if -fno-omit-
         | frame-pointer would have helped.
        
           | tdullien wrote:
           | Yes and no. A simple frame pointer needs to be present in all
           | libraries, and depending on build settings, this might not be
           | the case. .eh_frame tends to be emitted almost everywhere...
           | 
           | So it's both similarly fragile, but one is almost never
           | disabled.
           | 
           | The broader point is: For HLL runtimes you need to be able to
           | switch between native and interpreted unwinds anyhow, so
           | you'll always do some amount of lifting in eBPF land.
           | 
           | And yes, having frame pointers removes a _lot_ of complexity,
           | so it 's net a very good thing. It's just that the situation
           | wasnt nearly as dire as described, because people that care
           | about profiling had built solutions.
        
             | quotemstr wrote:
             | Forget eBPF even -- why do the job of userspace in the
             | kernel? Instead of unwinding via eBPF, we should ask
             | userspace to unwind itself using a synchronous signal
             | delivered to userspace whenever we've requested a stack
             | sample.
        
               | bregma wrote:
               | Context switches are incredibly expensive. Given the
               | sampling rate of eBPF profilers all the useful
               | information would get lost in the context switch noise.
               | 
               | Things get even more complicated because context switches
               | can mean CPU migrations, making many of your data
               | useless.
        
               | quotemstr wrote:
               | What makes you think doing unwinding in userspace would
               | do any more context switches (by which I think you mean
               | privilege level transitions) than we do today? See my
               | other comment on the subject.
               | 
               | > Things get even more complicated because context
               | switches can mean CPU migrations, making many of your
               | data useless.
               | 
               | No it doesn't. If a user space thread is blocked on doing
               | kernel work, its stack isn't going to change, not even if
               | that thread ends up resuming on a different thread.
        
         | searealist wrote:
         | I'm under the impression that eh_frame stack traces are much
         | slower than frame pointer stack traces, which makes always-on
         | profiling, such as seen in tcmalloc, impractical.
        
         | felixge wrote:
         | First of all, I think the .eh_frame unwinding y'all pioneered
         | is great.
         | 
         | But I think you're only thinking about CPU profiling at <= 100
         | Hz / core. However, Brendan's article is also talking about
         | Off-CPU profiling, and as far as I can tell, all known
         | techniques (scheduler tracing, wall clock sampling) require
         | stack unwinding to occur 1-3 orders of magnitude more often
         | than for CPU profiling.
         | 
         | For those use cases, I don't think .eh_frame unwinding will be
         | good enough, at least not for continuous profiling. E.g. see
         | [1][2] for an example of how frame pointer unwinding allowed
         | the Go runtime to lower execution tracing overhead from 10-20%
         | to 1-2%, even so it was already using a relatively fast lookup
         | table approach.
         | 
         | [1] https://go.dev/blog/execution-traces-2024
         | 
         | [2] https://blog.felixge.de/reducing-gos-execution-tracer-
         | overhe...
        
         | nemetroid wrote:
         | If you're sufficiently in control of your deployment details to
         | ensure that BPF is available at all. CAP_SYS_PTRACE is
         | available ~everywhere for everyone.
        
       | 5- wrote:
       | so what is the downside to using e.g. dwarf-based stack walking
       | (supported by perf) for libc, which was the original stated
       | problem?
       | 
       | in the discussion the issue gets conflated with jit-ted
       | languages, but that has nothing to do with the crusade to enable
       | frame pointer for system libraries.
       | 
       | and if you care that much for dwarf overhead... just cache the
       | unwind information in your system-level profiler? no need to
       | rebuild everything.
        
         | yxhuvud wrote:
         | The article explains why DWARF is not an option.
        
           | menaerus wrote:
           | Extremely light on the details, and also conflates it with
           | the JIT which makes it harder to understand the point, so I
           | was wondering about the same thing as well.
        
         | brancz wrote:
         | The way perf does it is slow, as the entire stack is copied
         | into user-space and is then asynchronously unwound.
         | 
         | This is solvable as Brendan calls out, we've created an eBPF-
         | based profiler at Polar Signals, that essentially does what you
         | said, it optimized the unwind tables, caches them in bpf maps,
         | and then synchronously unwinds as opposed to copying the whole
         | stack into user-space.
        
           | stefan_ wrote:
           | This conveniently sidesteps the whole issue of getting DWARF
           | data in the first place, which is also still a broken
           | disjointed mess on Linux. Hell, _Windows_ solved this many
           | many years ago.
        
             | bregma wrote:
             | You'd need a pretty special distro to have enabled -fno-
             | asynchronous-unwind-tables by default in its toolchain.
             | 
             | By default on most Linux distros the frame tables are built
             | into all the binaries, and end up in the GNU_EH_FRAME
             | segment, which is always available in any running process.
             | Doesn't sound a broken and disjointed mess to me. Sounds
             | more like a smoothly running solved problem.
        
           | Sesse__ wrote:
           | It should also be said that you need some sort of DWARF-like
           | information to understand inlining. If I have a function A
           | that inlines B that in turn inlines C, I'd often like to
           | understand that C takes a bunch of time, and with frame
           | pointers only, that information gets lost.
        
             | javierhonduco wrote:
             | Inlined functions can be symbolized using DWARF line
             | information[0] while unwinding requires DWARF unwind
             | information (CFI), which the x86_64 ABI mandates in every
             | single ELF in the `.eh_frame` section
             | 
             | - [0] This line information might or might not be present
             | in an executable but luckily there's debuginfod
             | (https://sourceware.org/elfutils/Debuginfod.html)
        
         | rwmj wrote:
         | The downside is it doesn't work at all:
         | https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...
        
       | ReleaseCandidat wrote:
       | That's one thing Apple did do right on ARM:
       | 
       | > The frame pointer register (x29) must always address a valid
       | frame record. Some functions -- such as leaf functions or tail
       | calls -- may opt not to create an entry in this list. As a
       | result, stack traces are always meaningful, even without debug
       | information.
       | 
       | https://developer.apple.com/documentation/xcode/writing-arm6...
        
         | microtherion wrote:
         | On Apple platforms, there is often an interpretability problem
         | of another kind: Because of the prevalence of deeply nested
         | blocks / closures, backtraces for Objective C / Swift apps are
         | often spread across numerous threads. I don't know of a good
         | solution for that yet.
        
           | felixge wrote:
           | I'm not very familiar with Objective C and Swift, so this
           | might not make sense. But JS used to have a similar problem
           | with async/await. The v8 engine solved it by walking the
           | chain of JS promises to recover the "logical stack"
           | developers are interested in [1].
           | 
           | [1] https://v8.dev/blog/fast-async
        
             | astrange wrote:
             | Swift concurrency does a similar thing. For the older
             | dispatch blocks, Xcode injects a library that records
             | backtraces over thread hops.
        
       | eqvinox wrote:
       | This doesn't detract from the content at all but the register
       | counts are off; SI and DI count as GPRs on i686 bringing it to
       | 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).
        
         | cesarb wrote:
         | > [...] on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64
         | has 14+BP (not 16+BP).
         | 
         | That is, on i686 you have 7 GPRs without frame pointers, while
         | on x86_64 you have 14 GPRs even with frame pointers.
         | 
         | Copying a comment of mine from an older related discussion
         | (https://news.ycombinator.com/item?id=38632848):
         | 
         | "To emphasize this point: on 64-bit x86 with frame pointers,
         | you have twice as many registers as on 32-bit x86 without frame
         | pointers, and these registers are twice as wide. A 64-bit value
         | (more common than you'd expect even when pointers are 32 bits)
         | takes two registers on 32-bit x86, but only a single register
         | on 64-bit x86."
        
           | brendangregg wrote:
           | Thanks!
        
       | benreesman wrote:
       | Brendan is such a treasure to the community (buy his book it's
       | great).
       | 
       | I wasn't doing extreme performance stuff when -fomit-frame-
       | pointer became the norm, so maybe it was a big win for enough
       | people to be a sane default, but even that seems dubious: "just
       | works" profiling is how you figure out when you're in an extreme
       | performance scenario (if you're an SG14 WG type, you know it and
       | are used to all the defaults being wrong for you).
       | 
       | I'm deeply grateful for all the legends who have worked on
       | libunwind, gperf stuff, perftool, DTrace, eBPF: these are the
       | too-often-unsung heroes of software that is still fast after
       | decades of Moore's law free-riding.
       | 
       | But they've been fighting an uphill battle against a weird
       | alliance of people trying to game compiler benchmarks and the
       | really irresponsible posture that "developer time is more
       | expensive" which is only sometimes true and never true if you
       | care about people on low-spec gear, which is the community of
       | users who that is already the least-resourced part of the global
       | community.
       | 
       | I'm fortunate enough to have a fairly modern desktop, laptop, and
       | phone: for me it's merely annoying that chat applications and
       | music players and windowing systems offer nothing new except
       | enshittification in terms of features while needing 10-100x the
       | resources they did a decade ago.
       | 
       | But for half of my career and 2/3rds of my time coding, I was on
       | low-spec gear most of the time, and I would have been largely
       | excluded if people didn't care a lot about old computers back
       | then.
       | 
       | I'm trying to help a couple of aspiring hackers get started right
       | now it's a real struggle to get their environments set up with
       | limitations like Intel Macs and WSL2 as the Linux option (WSL2 is
       | very cool but it's not loved up enough by e.g. yarn projects).
       | 
       | If you want new hackers, you need to make things work well on
       | older computers.
       | 
       | Thanks again Brendan et al!
        
       | sesm wrote:
       | glibc is only 2 MB, why Chrome relies on system glibc instead of
       | statically linking their own version with frame pointers enabled?
        
         | wruza wrote:
         | https://stackoverflow.com/questions/57476533/why-is-statical...
         | 
         | I guess the similar situation with msvcrt's.
        
         | nolist_policy wrote:
         | At the very least Chrome needs to link to the system libGL.so
         | and friends for gpu acceleration, libva.so for video
         | acceleration, and so on. And these are linked against glibc of
         | course.
        
           | dooglius wrote:
           | having/omitting frame pointers doesn't change the ABI; it
           | will work if you compile against glibc-nofp and link against
           | glibc-withfp
        
       | dsign wrote:
       | I remember when the omission of stack frame pointers started
       | spreading at the beginning of the 2000s. I was in college at the
       | time, studying computer sciences in a very poor third-world
       | country. Our computers were old and far from powerful. So, for
       | most course projects, we would eschew interprets and use
       | compilers. Mind you, what my college lacked in money it
       | compensated by having interesting course work. We studied and
       | implemented low level data-structures, compilers, assembly-code
       | numerical routines and even a device driver for Minix.
       | 
       | During my first two years in college, if one of our programs did
       | something funny, I would attach gdb and see what was happening at
       | assembly level. I got used to "walking the stack" manually,
       | though the debugger often helped a lot. Happy times, until all of
       | the sudden, "-fomit-frame-pointer" was all the rage, and stack
       | traces stopped making sense. Just like that, debugging that
       | segfault or illegal instruction became exponentially harder. A
       | short time later, I started using Python for almost everything to
       | avoid broken debugging sessions. So, I lost an order of magnitude
       | or two with "-fomit-frame-pointer". But learning Python served me
       | well for other adventures.
        
       | rwmj wrote:
       | I'm glad he mentioned Fedora because it's been a tiresome battle
       | to keep frame pointers enabled in the whole distribution (eg
       | https://pagure.io/fesco/issue/3084).
       | 
       | There's a persistent myth that frame pointers have a huge
       | overhead, because there was a single Python case that had a +10%
       | slow down (now fixed). The actual measured overhead is under 1%,
       | which is far outweighed by the benefits we've been able to make
       | in certain applications.
        
         | brendangregg wrote:
         | Thanks; what was the Python fix?
        
           | rwmj wrote:
           | This was the investigation:
           | https://discuss.python.org/t/python-3-11-performance-with-
           | fr...
           | 
           | Initially we just turned off frame pointers for the Python
           | 3.9 interpreter in Fedora. They are back on in Python 3.12
           | where it seems the upstream bug has been fixed, although I
           | can't find the actual fix right now.
           | 
           | Fedora tracking bug: https://bugzilla.redhat.com/2158729
           | 
           | Fedora change in Python 3.9 to disable frame pointers: https:
           | //src.fedoraproject.org/rpms/python3.9/c/9b71f8369141c...
        
             | brendangregg wrote:
             | Ah right, thanks, I remember I saw Andrii's analysis in the
             | other thread.
             | https://pagure.io/fesco/issue/2817#comment-826636
        
         | menaerus wrote:
         | I believe it's a misrepresentation to say that "actual measured
         | overhead is under 1%". I don't think such a claim can be
         | universally applied because this depends on the very workload
         | you're measuring the overhead with.
         | 
         | FWIW your results don't quite match the measurements from Linux
         | kernel folks who claim that the overhead is anywhere between
         | 5-10%. Source:
         | https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...
         | I didn't preserve the data involved but in a variety of
         | workloads including netperf, page allocator microbenchmark,
         | pgbench and sqlite, enabling framepointer introduced overhead
         | of around the 5-10% mark.
         | 
         | Significance in their results IMO is in the fact that they
         | measured the impact by using PostgreSQL and SQLite. If
         | anything, DBMS are one of the best ways to really stress out
         | the system.
        
           | brendangregg wrote:
           | Those are microbenchmarks.
        
             | menaerus wrote:
             | pgbench is not a microbenchmark.
        
               | brendangregg wrote:
               | From the docs: "pgbench is a simple program for running
               | benchmark tests on PostgreSQL. It runs the same sequence
               | of SQL commands over and over"
               | 
               | While it might call itself a benchmark, it behaves very
               | microbenchmark-y.
               | 
               | The other numbers I and others have shared have been from
               | actual production workloads. Not a simple program that
               | tests same sequence of commands over and over.
        
               | menaerus wrote:
               | While pgbench might be "simple" program, as in a test-
               | runner, workloads that are run by it are far from it. It
               | runs TPC-B by default but can also run your own arbitrary
               | script that defines whatever the workload is? It also
               | allows to run queries concurrently so I fail to
               | understand the reasoning of it "being simple" or
               | "microbenchmarkey". It's far from the truth I think.
        
               | weebull wrote:
               | Anything running a full database server is not micro.
        
               | brendangregg wrote:
               | If I call the same "get statistics" command over and over
               | in a loop (with zero queries), or 100% the same invalid
               | query (to test the error path performance), I believe
               | we'd call that a micro-benchmark, despite involving a
               | full database. It's a completely unrealistic artificial
               | workload to test a particular type of operation.
               | 
               | The pgbench docs make it sound microbenchmark-y by
               | describing making the same call over and over. If people
               | find that this simulates actual production workloads,
               | then yes, it can be considered a macro-benchmark.
        
               | anarazel wrote:
               | The are loads of real world workloads that have similar
               | patterns to pgbench, particularly read only pgbench.
        
           | babel_ wrote:
           | Those are numbers from 7 years ago, so they're beginning to
           | get a bit stale as people start to put more weight behind
           | having frame pointers and make upstream contributions to
           | their compilers to improve their output. People put it at <1%
           | from much more recent testing by the very R.W.M. Jones you're
           | replying to [0] and separate testing by others like Brendan
           | Gregg [1b], whose post this is commenting on (and included
           | [1b] in the Appendix as well), with similar accounts by
           | others in the last couple years. Oh, and if you use
           | flamegraph, you might want to check the repo for a familiar
           | name.
           | 
           | Some programs, like Python, have reported worse, 2-7% [2],
           | but there is traction on tackling that [1a] (see both rwmj's
           | and brendangregg's replies to sibling comments, they've both
           | done a lot of upstreamed work wrt. frame pointers,
           | performance, and profiling).
           | 
           | As has been frequently pointed out, the benefits from
           | improved profiling cannot be understated, even a 10% cost to
           | having frame pointers can be well worth it when you leverage
           | that information to target the actual bottlenecks that are
           | eating up your cycles. Plus, you can always disable it in
           | specific hotspots later when needed, which is much easier
           | than the reverse.
           | 
           | Something, something, premature optimisation -- though in
           | seriousness, this information benefits actual optimisation,
           | exactly because we don't have the information and
           | understanding that would allow truly universal claims,
           | precisely because things like this haven't been available,
           | and so haven't been widely used. We know frame pointers, from
           | additional register pressure and extended function
           | prologue/epilogue, can be a detriment in certain hotspots;
           | that's why we have granular control. But without them, we
           | often don't know which hotspots are actually affected, so I'm
           | sure even the databases would benefit... though the "my
           | database is the fastest database" problem has always been the
           | result of endless micro-benchmarking, rather than actual end-
           | to-end program performance and latency, so even a claimed
           | "10%" drop there probably doesn't impact actual real-world
           | usage, but that's a reason why some of the most interesting
           | profiling work lately has been from ideas like causal
           | profilers and continuous profilers, which answer exactly
           | that.
           | 
           | [0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-
           | dwar... [1a]:
           | https://pagure.io/fesco/issue/2817#comment-826636 [1b]:
           | https://pagure.io/fesco/issue/2817#comment-826805 [2]:
           | https://discuss.python.org/t/the-performance-of-python-
           | with-...
        
             | adrian_b wrote:
             | While improved profiling is useful, achieving it by wasting
             | a register is annoying, because it is just a very dumb
             | solution.
             | 
             | The choice made by Intel when they have designed 8086 to
             | use 2 separate registers for the stack pointer and for the
             | frame pointer was a big mistake.
             | 
             | It is very easy to use a single register as both the stack
             | pointer and the frame pointer, as it is standard for
             | instance in IBM POWER.
             | 
             | Unfortunately in the Intel/AMD CPUs using a single register
             | is difficult, because the simplest implementation is
             | unreliable since interrupts may occur between 2
             | instructions that must form an atomic sequence (and they
             | may clobber the stack before new space is allocated after
             | writing the old frame pointer value in the stack).
             | 
             | It would have been very easy to correct this in new CPUs by
             | detecting that instruction sequence and blocking the
             | interrupts between them.
             | 
             | Intel had already done this once early in the history of
             | the x86 CPUs, when they have discovered a mistake in the
             | design of the ISA, that interrupts could occur between
             | updating the stack segment and the stack pointer. Then they
             | had corrected this by detecting such an instruction
             | sequence and blocking the interrupts at the boundary
             | between those instructions.
             | 
             | The same could have been done now, to enable the use of the
             | stack pointer as also the frame pointer. (This would be
             | done by always saving the stack pointer in the top of the
             | stack whenever stack space is allocated, so that the stack
             | pointer always points to the previous frame pointer, i.e.
             | to the start of the linked list containing all stack
             | frames.)
        
             | doctorpangloss wrote:
             | > As has been frequently pointed out, the benefits from
             | improved profiling cannot be understated, even a 10% cost
             | to having frame pointers can be well worth it when you
             | leverage that information to target the actual bottlenecks
             | that are eating up your cycles.
             | 
             | Few can leverage that information because the open source
             | software you are talking about lacks telemetry in the self
             | hosted case.
             | 
             | The profiling issue really comes down to the cultural
             | opposition in these communities to collecting telemetry and
             | opening it for anyone to see and use. The average user
             | struggles to ally with a trustworthy actor who will share
             | the information like profiling freely and anonymize it at a
             | per-user level, the level that is actually useful. Such
             | things exist, like the Linux hardware site, but only
             | because they have not attracted the attention of agitators.
             | 
             | Basically users are okay with profiling, so long as it is
             | quietly done by Amazon or Microsoft or Google, and not by
             | the guy actually writing the code and giving it out for
             | everyone to use for free. It's one of the most moronic
             | cultural trends, and blame can be put squarely on product
             | growth grifters who equivocate telemetry with privacy
             | violations; open source maintainers, who have enough
             | responsibilities as is, besides educating their users; and
             | Apple, who have made their essentially vaporous claims
             | about privacy a central part of their brand.
             | 
             | Of course people know the answer to your question. Why
             | doesn't Google publish every profile of every piece of open
             | source software? What exactly is sensitive about their
             | workloads? Meta publishes a whole library about every
             | single one of its customers, for anyone to freely read. I
             | don't buy into the holiness of the backend developer's
             | "cleverness" or whatever is deemed sensitive, and it's so
             | hypocritical.
        
               | yjftsjthsd-h wrote:
               | > Basically users are okay with profiling, so long as it
               | is quietly done by Amazon or Microsoft or Google, and not
               | by the guy actually writing the code and giving it out
               | for everyone to use for free.
               | 
               | No; the groups are approximately "cares whether software
               | respects the user, including privacy", or "doesn't know
               | or doesn't care". I seriously doubt that any meaningful
               | number of people are okay with companies invading their
               | privacy but not smaller projects.
        
               | babel_ wrote:
               | I think the kind of profiling information you're
               | imagining is a little different from what I am.
               | 
               | Continuous profiling of your system that gets relayed to
               | someone else by telemetry is very different from
               | continuous profiling of your own system, handled only by
               | yourself (or, generalising, your
               | community/group/company). You seem to be imagining we're
               | operating more in the former, whereas I am imagining more
               | in the latter.
               | 
               | When it's our own system, better instrumented for our own
               | uses, and we're the only ones getting the information,
               | then there's nothing to worry about, and we can get much
               | more meaningful and informative profiling done when more
               | information about the system is available. I don't even
               | need telemetry. When it's "someone else's" system, in
               | other words, when we have no say in telemetry (or have to
               | exercise a right to opt-out, rather than a more self-
               | executing contract around opt-in), then we start to have
               | exactly the kinds of issues you're envisaging.
               | 
               | When it's not completely out of our hands, then we need
               | to recognise different users, different demands,
               | different contexts. Catering to the user matters, and
               | when it comes to sensitive information, well, people have
               | different priorities and threat models.
               | 
               | If I'm opening a calendar on my phone, I don't expect it
               | to be heavily instrumented and relaying all of that, I
               | just want to see my calendar. When I open a calendar on
               | my phone, and it is unreasonably slow, then I might want
               | to submit relevant telemetry back in some capacity.
               | Meanwhile, if I'm running the calendar server, I'm
               | absolutely wanting to have all my instrumentation
               | available and recording every morsel I reasonably can
               | about that server, otherwise improving it or fixing it
               | becomes much harder.
               | 
               | From the other side, if I'm running the server, I may
               | _want_ telemetry from users, but if it 's not essential,
               | then I can "make do" with only the occasional opt-in
               | telemetry. I also have other means of profiling real
               | usage, not just scooping it all up from unknowing users
               | (or begrudging users). Those often have some other
               | "cost", but in turn, they don't have the "cost" of
               | demanding it from users. For people to freely choose
               | requires acknowledging the asymmetries present, and that
               | means we can't just take the path of least resistance, as
               | we may have to pay for it later.
               | 
               | In short, it's a consent issue. Many violate that,
               | knowingly, because they care not for the consequences.
               | Many others don't even seem to think about it, and just
               | go ahead regardless. And it's so much easier behind
               | closed doors. Open source in comparison, even if not
               | everything is public, must contend with the fact that the
               | actions and consequences are (PRs, telemetry traffic,
               | etc), so it inhabits a space in which violating consent
               | is much more easily held accountable (though no
               | guarantee).
               | 
               | Of course, this does not mean it's always done properly
               | in open source. It's often an uphill battle to get
               | telemetry that's off-by-default, where users explicitly
               | consent via opt-in, as people see how that could easily
               | be undermined, or later invalidated. Many opt-in
               | mechanisms (e.g. a toggle in the settings menu) often do
               | not have expiration built in, so fail to check at a later
               | point that someone still consents. Not to say that's the
               | way you must do it, just giving an example of a way that
               | people seem to be more in favour of, as with the
               | generally favourable response to such features making
               | their way into "permissions" on mobile.
               | 
               | We can see how the suspicion creeps in, informed by
               | experience... but that's also known by another word:
               | vigilance.
               | 
               | So, users are not "okay" with it. There's a power
               | imbalance where these companies are afforded the impunity
               | because many are left to conclude they have no choice but
               | to let them get away with it. That hasn't formed in a
               | vacuum, and it's not so simple that we just pull back the
               | curtain and reveal the wizard for what he is. Most seem
               | to already know.
               | 
               | It's proven extremely difficult to push alternatives. One
               | reason is that information is frequently not ready-to-
               | hand for more typical users, but another is that said
               | alternatives may not actually fulfil the needs of some
               | users: notably, accessibility remains hugely inconsistent
               | in open source, and is usually not funded on par with,
               | say, projects that affect "backend" performance.
               | 
               | The result? Many people just give their grandma an
               | iPhone. That's what's telling about the state of open
               | source, and of the actual cultural trends that made it
               | this way. The threat model is fraudsters and scammers,
               | not nation-state actors or corporate malfeasance. This
               | app has tons of profiling and privacy issues? So what? At
               | least grandma can use it, and we can stay in contact,
               | dealing with the very real cultural trends towards
               | isolation. On a certain level, it's just pragmatic.
               | They'd choose differently if they could, but they don't
               | feel like they can, and they've got bigger worries.
               | 
               | Unless we _do_ different, the status quo will remain. If
               | there 's any agitation to be had, it's in getting more
               | people to care about improving things and then actually
               | doing them, even if it's just taking small steps. There
               | won't be a perfect solution that appears out of nowhere
               | tomorrow, but we only have a low bar to clear. Besides,
               | we've all thought "I could do better than that", so why
               | not? Why not just aim for better?
               | 
               | Who knows, we might actually achieve it.
        
               | matheusmoreira wrote:
               | "Agitators". We don't trust telemetry precisely because
               | of comments like that. World is full of people like you
               | who apparently see absolutely nothing wrong with
               | exfiltrating identifying information from other people's
               | computers. We have to actively resist such attempts, they
               | are constant, never ending and it only seems to get worse
               | over time but you dismiss it all as "cultural opposition"
               | to telemetry.
               | 
               | For the record I'm NOT OK with being profiled, measured
               | or otherwise studied in any way without my explicit
               | consent. That even extends to the unethical human
               | experiments that corporations run on people and which
               | they euphemistically call A/B tests. I don't care if it's
               | Google or a hobbyist developer, I will block it if I can
               | and I will not lose a second of sleep over it.
        
               | rstuart4133 wrote:
               | > World is full of people like you who apparently see
               | absolutely nothing wrong with exfiltrating identifying
               | information from other people's computers.
               | 
               | True. But such people are like cockroaches. They know
               | what they are doing will be unpopular with their targets,
               | so they keep it hidden. This is easy enough to do in
               | closed designs, car manufacturers selling your driving
               | habits to insurance companies and health monitoring app
               | selling menstrual cycle data to retailers selling to
               | women.
               | 
               | Compare that do say Debian and RedHat. They too collect
               | performance data. But the code is open source, Debian has
               | repeatable builds so you can be 100% sure that is the
               | code in use, and every so often someone takes a look it.
               | Guess what, the data they send back is so unidentifiable
               | it satisfies even the most paranoid of their 1000's of
               | members.
               | 
               | All it takes is a little bit of sunlight to keep the
               | cockroaches at bay, and then we can safely let the devs
               | collect the data they need to improve code. And everyone
               | benefits.
        
           | barrkel wrote:
           | This isn't an argument for a default.
        
             | menaerus wrote:
             | I was not even trying to make one. I was questioning the
             | validity of "1% overhead" claim by providing the counter-
             | example from respectable source.
        
         | edwintorok wrote:
         | You probably already know, but with OCaml 5 the only way to get
         | flamegraphs working is to either:
         | 
         | * use framepointers [1]
         | 
         | * use LBR (but LBR has a limited depth, and may not work on on
         | all CPUs, I'm assuming due to bugs in perf)
         | 
         | * implement some deep changes in how perf works to handle the 2
         | stacks in OCaml (I don't even know if this would be possible),
         | or write/adapt some eBPF code to do it
         | 
         | OCaml 5 has a separate stack for OCaml code and C code, and
         | although GDB can link them based on DWARF info, perf DWARF
         | call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563
         | #issuecomment-193...)
         | 
         | If you need more evidence to keep it enabled in future
         | releases, you can use OCaml 5 as an example (unfortunately
         | there aren't many OCaml applications, so that may not carry too
         | much weight on its own).
         | 
         | [1]: I haven't actually realised that Fedora39 has already
         | enabled FP by default, nice! (I still do most of my day-to-day
         | profiling on an ~CentOS 7 system with 'perf record --call-graph
         | dwarf -F 47 -a', I was aware that there was a discussion to
         | enable FP by default, but haven't noticed it has actually been
         | done already)
        
           | namibj wrote:
           | No, LBR is an Intel-only feature.
        
         | awaythrow999 wrote:
         | Frame pointers are still a no-go on 32bit so anything that is
         | IoT today.
         | 
         | The reason we removed them was not a myth but comes from the
         | pre-64 bit days. Not that long ago actually.
         | 
         | Even today if you want to repurpose older 64 bit systems with a
         | new life then this of optimization still makes sense.
         | 
         | Ideally it should be the default also for security critical
         | systems because not everything needs to be optimized for
         | "observability"
        
           | Narishma wrote:
           | > Frame pointers are still a no-go on 32bit so anything that
           | is IoT today.
           | 
           | Isn't that just 32-bit x86, which isn't used in IoT? The
           | other 32-bit ISAs aren't register-starved like x86.
        
             | weebull wrote:
             | It would be, yes. x86 had very few registers, so anything
             | you could do to free them up was vital. Arm 32bit has 32
             | general purpose registers I think, and RISC V certainly
             | does. In fact there's no difference between 32 and 64 bit
             | in that respect. If anything, 64-bit frame pointers make it
             | marginally worse.
        
               | CountSessine wrote:
               | Sadly, no. 32-bit ARM only has 16 GPR's (two of which are
               | zero and link), mostly because of the stupid predication
               | bits in the instruction encoding.
               | 
               | That said, I don't know how valuable getting rid of FP on
               | ARM is - I once benchmarked ffmpeg on 32-bit x86 before
               | and after enabling FP and PIC (basically removing 2 GPRs)
               | and the difference was huge (>10%) but that's an extreme
               | example.
        
               | fanf2 wrote:
               | Arm32 doesn't have a zero-value register. Its non-
               | general-purpose registers are PC, LR, SP, FP - tho the
               | link register can be used for temporary values.
        
       | zzbn00 wrote:
       | NiX (and I assume Guix) are very convenient for this as it is
       | fairly easy to turn frame pointers on or off for parts or whole
       | of the system.
        
       | tkiolp4 wrote:
       | Are his books (the one about Systems Performance and eBPF)
       | relevant for normal software engineers who want to improve
       | performance in normal services? I don't work for faang, and our
       | usual performance issues are solved by adding indexes here and
       | there, caching, and simple code analysis. Tools like Datadog help
       | a lot already.
        
         | wavemode wrote:
         | Diving into flame graphs being worthwhile for optimization,
         | assumes that your workload is CPU-bound. Most business software
         | does not have such workloads, and rather (as you yourself have
         | noted) spend most of their time waiting for I/O (database,
         | network, filesystem, etc).
         | 
         | And so, (as you again have noted), your best bet is to just use
         | plain old logging and tracing (like what datadog provides) to
         | find out where the waiting is happening.
        
         | polio wrote:
         | Profiling is a pretty basic technique that is applicable to all
         | software engineering. I'm not sure what a "normal" service is
         | here, but I think we all have an obligation to understand
         | what's happening in the systems we own.
         | 
         | Some people may believe that 100ms latency is acceptable for a
         | CLI tool, but what if it could be 3ms? On some aesthetic level,
         | it also feels good to be able to eliminate excess. Finally, you
         | should learn it because you won't necessarily have that job
         | forever.
        
       | mgaunard wrote:
       | You don't need frame pointers, all the relevant info is stored in
       | dwarf debug data.
        
       | DaveFlater wrote:
       | GCC optimization causes the frame pointer push to move around,
       | resulting in wrong call stacks. "Wontfix"
       | 
       | https://news.ycombinator.com/item?id=38896343
        
         | rwmj wrote:
         | That was in 2012. Does it still occur on modern GCC?
         | 
         | There definitely have been regressions with frame pointers
         | being enabled, although we've fixed all the ones we've found in
         | current (2024) Fedora.
        
           | jart wrote:
           | I think so and I vaguely seem to recall -fno-schedule-insns2
           | being the only thing that fixes it. To get the full power of
           | frame pointers and hackable binary, what I use is:
           | -fno-schedule-insns2         -fno-omit-frame-pointers
           | -fno-optimize-sibling-calls         -mno-omit-leaf-frame-
           | pointer         -fpatchable-function-entry=18,16
           | -fno-inline-functions-called-once
           | 
           | The only flag that's potentially problematic is -fno-
           | optimize-sibling-calls since it breaks the optimal approach
           | to writing interpreters and slows down code that's written in
           | a more mathematical style.
        
           | ndesaulniers wrote:
           | Pretty sure unwinding thumb generated by GCC is still non-
           | unwindable via FPs. That's been a pain.
        
       | tzot wrote:
       | I am not sure, but I believe -fomit-frame-pointer in x86-64
       | allows the compiler to use a _thirteenth_ register, not a
       | _seventeenth_ .
        
       | cesarb wrote:
       | I disagree with this sentence of the article:
       | 
       | "I could say that times have changed and now the original 2004
       | reasons for omitting frame pointers are no longer valid in 2024."
       | 
       | The original 2004 reason for omitting frame pointers is still
       | valid in 2024: it's still a big performance win on the register-
       | starved 32-bit x86 architecture. What has changed is that the
       | 32-bit x86 architecture is much less relevant nowadays (other
       | than legacy software, for most people it's only used for a small
       | instant while starting up the firmware), and other common 32-bit
       | architectures (like embedded 32-bit ARM) are not as register-
       | starved as the 32-bit x86.
        
         | IshKebab wrote:
         | That's exactly what they were saying. You're not disagreeing at
         | all.
        
       | shaggie76 wrote:
       | I thought we'd been using /Oy (Frame-Pointer Omission) for years
       | on Windows and that there was a pdata section on x64 that was
       | used for stack-walking however to my great surprise I just read
       | on MSDN that "In x64 compilers, /Oy and /Oy- are not available."
       | 
       | Does this mean Microsoft decided they weren't going to support
       | breaking profilers and debuggers OR is there some magic in the
       | pdata section that makes it work even if you omit the frame-
       | pointer?
        
         | MarkSweep wrote:
         | Some Google found this:
         | https://devblogs.microsoft.com/oldnewthing/20130906-00/?p=33...
         | 
         | "Recovering a broken stack on x64 machines on Windows is
         | trickier because the x64 uses unwind codes for stack walking
         | rather than a frame pointer chain."
         | 
         | More details are here: https://learn.microsoft.com/en-
         | us/cpp/build/exception-handli...
        
         | quotemstr wrote:
         | Microsoft has had excellent universal unwinding support for
         | decades now. I'm disappointed to see someone as prominent as
         | this article's author present as infeasible what Microsoft has
         | had working for so long.
        
         | musjleman wrote:
         | > In x64 compilers
         | 
         | The default is omission. If you have a Windows machine, in all
         | likelihood almost no 64 bit code running on it has frame
         | pointers.
         | 
         | > OR is there some magic in the pdata section that makes it
         | work even if you omit the frame-pointer
         | 
         | You haven't ever needed frame pointers to unwind using ...
         | unwind information. The same thing exists for linux as
         | `.eh_frame` section.
        
       | javierhonduco wrote:
       | Overall, I am for frame pointers, but after some years working in
       | this space, I thought I would share some thoughts:
       | 
       | * Many frame pointer unwinders don't account for a problem they
       | have that DWARF unwind info doesn't have: the fact that the frame
       | set-up is not atomic, it's done in two instructions, `push $rbp`
       | and `mov $rsp $rbp`, and if when a snapshot is taken we are in
       | the `push`, we'll miss the parent frame. I think this might be
       | able to be fired by inspecting the code, but I think this might
       | only be as good as a heuristic as there could be other `push
       | %rbp` unrelated to the stack frame. I would love to hear if
       | there's a better approach!
       | 
       | * I developed the solution Brendan mentions which allows faster,
       | in-kernel unwinding without frame pointers using BPF [0]. This
       | doesn't use DWARF CFI (the unwind info) as-is but converts it
       | into a random-access format that we can use in BPF. He mentions
       | not supporting JVM languages, and while it's true that right now
       | it only supports JIT sections that have frame pointers, I planned
       | to implement a full JVM interpreter unwinder. I have left Polar
       | Signals since and shifted priorities but it's feasible to get a
       | JVM unwinder to work in lockstep with the native unwinder.
       | 
       | * In an ideal world, enabling frame pointers should be done on a
       | case-by-case. Benchmarking is key, and the tradeoffs that you
       | make might change a lot depending on the industry you are in, and
       | what your software is doing. In the past I have seen large
       | projects enabling/disabling frame pointers not doing an in-depth
       | assessment of losses/gains of performance, observability, and how
       | they connect to business metrics. The Fedora folks have done a
       | superb and rigorous job here.
       | 
       | * Related to the previous point, having a build system that
       | enables you to change this system-wide, including libraries your
       | software depends on can be awesome to not only test these changes
       | but also put them in production.
       | 
       | * Lastly, I am quite excited about SFrame that Indu is working
       | on. It's going to solve a lot of the problems we are facing right
       | now while letting users decide whether they use frame pointers. I
       | can't wait for it, but I am afraid it might take several years
       | until all the infrastructure is in place and everybody upgrades
       | to it.
       | 
       | - [0]:
       | https://web.archive.org/web/20231222054207/https://www.polar...
        
         | rwmj wrote:
         | On the third point, you have to do frame pointers across the
         | whole Linux distro in order to be able to get good flamegraphs.
         | You have to do whole system analysis to really understand
         | what's going on. The way that current binary Linux distros
         | (like Fedora and Debian) works makes any alternative
         | impossible.
        
         | felixge wrote:
         | Great comments, thanks for sharing. The non-atomic frame setup
         | is indeed problematic for CPU profilers, but it's not an issue
         | for allocation profiling, Off-CPU profiling or other types off
         | non-interrupt driven profiling. But as you mentioned, there
         | might be ways to solve that problem.
        
         | brancz wrote:
         | Great comment! Just want to add we are making good progress on
         | the JVM unwinder!
        
       | secondcoming wrote:
       | Can they not be disabled on a per-function basis?
        
       | weebull wrote:
       | Just as a general comment on this topic...
       | 
       | The fact that people complain about the performance of the
       | mechanism that enables the system to be profiled, and so
       | performance problems be identified, is beyond ironic. Surely the
       | epitome of premature optimisation.
        
         | doubloon wrote:
         | im sure in ancient mesopotamia there was somebody arguing about
         | you could brew beer faster if you stop measuring the hops so
         | carefully but then someone else was saying yes but if you dont
         | measure the hops carefully then you dont know the efficiency of
         | your overall beer making process so you cant isolate the
         | bottlenecks.
         | 
         | the funny thing is i am not sure if the world would actually
         | work properly if we didn't have both of these kinds of people.
        
         | AtlasBarfed wrote:
         | So what are these other techniques the 2004 migration from
         | frame pointers assumed would work for stack walking? Why don't
         | they work today? I get that _64 has a lot more registers, so
         | there's minimal value to +1 the register?
        
           | loeg wrote:
           | In 2004, the assumption made by the GCC developers was that
           | you would be walking stacks very infrequently, in a debugger
           | like GDB. Not sampling stacks 1000s of times a second for
           | profiling.
        
       | Cold_Miserable wrote:
       | Not interesting. Enter/leave also does the same thing as your
       | save/restore rbp.
       | 
       | Far more interesting I recall there might be an instruction where
       | rbp isn't allowed.
        
       | titzer wrote:
       | Virgil doesn't use frame pointers. If you don't have dynamic
       | stack allocation, the frame of a given function has a fixed size
       | can be found with a simple (binary-search) table lookup. Virgil's
       | technique uses an additional page-indexed range that further
       | restricts the lookup to be a few comparisons on average (O(log(#
       | retpoints per page)). It combines the unwind info with stackmaps
       | for GC. It takes very little space.
       | 
       | The main driver is in
       | (https://github.com/titzer/virgil/blob/master/rt/native/Nativ...
       | the rest of the code in the directory implements the decoding of
       | metadata.
       | 
       | I think frame pointers only make sense if frames are dynamically-
       | sized (i.e. have stack allocation of data). Otherwise it seems
       | weird to me that a dynamic mechanism is used when a static
       | mechanism would suffice; mostly because no one agreed on an ABI
       | for the metadata encoding, or an unwind routine.
       | 
       | I believe the 1-2% measurement number. That's in the same
       | ballpark as pervasive checks for array bounds checks. It's weird
       | that the odd debugging and profiling task gets special pleading
       | for a 1% cost but adding a layer of security gets the finger.
       | Very bizarre priorities.
        
       | codeflo wrote:
       | All of this information is static, there's no need to sacrifice a
       | whole CPU register only to store data that's already known. A
       | simple lookup data structure that maps an instruction address
       | range to the stack offset of the return address should be enough
       | to recover the stack layout. On Windows, you'd precompute that
       | from PDB files, I'm sure you can do the same thing with whatever
       | the equivalent debug data structure is on Linux.
        
         | fsmv wrote:
         | [deleted]
        
         | loeg wrote:
         | It isn't entirely static because of alloca().
        
       | mikewarot wrote:
       | I started programming in 1979, and I can't believe I've managed
       | to avoid learning about stack frames all those EBP register
       | tricks until now. I always had parameters to functions in
       | registers, not on the stack, for the most part. The compiler hid
       | a lot of things from me.
       | 
       | Is it because I avoided Linux and C most of my life? Perhaps it's
       | because I used debug, and Periscope before that... and never gdb?
        
       | boulos wrote:
       | JIT'ed code is sadly poorly supported, but LLVM has had great
       | hooks for noting each method that is produced and its address. So
       | you can build a simple mixed-mode unwinder, pretty easily, but
       | mostly in process.
       | 
       | I think Intel's DNN things dump their info out to some common
       | file that perf can read instead, but because the *kernels*
       | themselves reuse rbp throughout oneDNN, it's totally useless.
       | 
       | Finally, can any JVM folks explain this claim about DWARF info
       | from the article:
       | 
       | > Doesn't exist for JIT'd runtimes like the Java JVM
       | 
       | that just sounds surprising to me. Is it off by default or
       | literally not available? (Google searches have mostly pointed to
       | people wanting to include the JNI/C side of a JVM stack, like
       | https://github.com/async-profiler/async-profiler/issues/215).
        
       | POVKING_89 wrote:
       | lol
        
       | POVKING_89 wrote:
       | l0l
        
       | olliej wrote:
       | Have compilers (or I guess x86?) gotten better at dealing with
       | the frame pointer? Or are we just saying that taking a
       | significant perf hit is acceptable if it lets you find other tiny
       | perf problems? Because I recall -fomit-frame-pointer being a
       | significant performance win, bigger than most of the things that
       | you need a perfect profiler to spot.
        
       | vlovich123 wrote:
       | To this day I still believe that there should be a dedicated
       | protected separate stack region for the call stack that only the
       | CPU can write to/read from. Walking the stack then becomes
       | trivially fast because you just need to do a very small memcpy.
       | And stack memory overflows can never overwrite the return
       | address.
        
         | ndesaulniers wrote:
         | This is a thing; it's called shadow call stack. Both ARM and
         | now Intel have extensions for it.
        
           | vlovich123 wrote:
           | But the shadow stack concept seems much dumber to me. Why
           | write the address to the regular stack and the shadow stack
           | and then compare? Why not only use the shadow stack and not
           | put return addresses on the main stack at all.
        
       | BinaryRage wrote:
       | I remember talking to Brendan about the PreserveFramePointer
       | patch during my first months at Netflix in 2015. As of JDK 21,
       | unfortunately it is no longer a general purpose solution for the
       | JVM, because it prevents a fast path being taken for stack
       | thawing for virtual threads:
       | https://github.com/openjdk/jdk/blob/d32ce65781c1d7815a69ceac...
        
       ___________________________________________________________________
       (page generated 2024-03-17 23:01 UTC)