[HN Gopher] Frame pointers vs. DWARF - my verdict
       ___________________________________________________________________
        
       Frame pointers vs. DWARF - my verdict
        
       Author : rwmj
       Score  : 88 points
       Date   : 2023-02-14 13:54 UTC (1 days ago)
        
 (HTM) web link (rwmj.wordpress.com)
 (TXT) w3m dump (rwmj.wordpress.com)
        
       | fooblaster wrote:
       | There is an option far better than either suggested. In my
       | experience using the --callgraph=lbr option produces far more
       | reliable callstacks than relying on frame pointers. Sadly it's
       | only available on Intel cpus at the moment.
        
         | irogers wrote:
         | AMD will have support in Zen4 and Linux 6.1 (which is LTS):
         | 
         | https://lore.kernel.org/lkml/Yz%2FcpNTSacRMh1FK@gmail.com/
         | 
         | Further, precise events are fixed in Linux 6.2:
         | 
         | https://lore.kernel.org/lkml/Y5eQeR2tpZ%2FBos49@gmail.com/
        
       | sylware wrote:
       | dwarf = brain convolution
       | 
       | frame pointer = 1 less register on an already set of registers
       | under heavy strain (x86_64). I do envy risc-v with all their
       | registers: even the load-store model of risc-v won't use enough
       | more registers to end up with the same amount of strain as on
       | x86_64.
       | 
       | If I ever have the usage of dwarf, I would build the tables
       | manually only for what I wish to debug?... if it is possible
       | (without the use of specialized assembler directives, because it
       | increase significantly the technical cost of the assembler). That
       | said, does a real, accurate and complete specification of dward
       | exists? Because when I look at the sysv ABI or ELF, what a mess.
        
         | masklinn wrote:
         | > frame pointer = 1 less register on an already set of
         | registers under heavy strain (x86_64).
         | 
         | X86-64 is _not_ under heavy register pressure strain. That idea
         | is a legacy from x86 (plain, 32b).
         | 
         | x86-64 has the same register count as ARMv7 and few bothered
         | disabling the frame pointer there, even though it's a load-
         | store architecture.
        
           | tialaramex wrote:
           | Right, x86-64 offers eight extra register names+ (r8 through
           | r15). If you choose to go from x86-without-frame-pointer to
           | x86-64-with-frame-pointer you gained 7 register names which
           | is huge.
           | 
           | This makes the case where that one extra register name makes
           | all the difference much rarer, arguably turning it from "I
           | demand a compiler flag" to "Let's just hand-write the machine
           | code for this one very special routine if our performance
           | data suggests it's worth it".
           | 
           | + Internally a modern CPU has far more actual register, to
           | enable a feature called "register renaming". But we can only
           | talk about them using their canonical names, and x86-64 adds
           | eight more of those.
        
             | the_mitsuhiko wrote:
             | I think generally the talk about "there are not enough
             | registers" ignore pipelining and register renaming way too
             | much. The loss of performance of the frame pointer register
             | even on x86 is not that problematic, and on x86_64 it's
             | completely negligible unless you're in a tight switch heavy
             | interpreter loop.
        
               | zznzz wrote:
               | Register renaming doesn't significantly address the
               | impact of reducing the number of architectural registers
               | available to the compiler. With fewer register names
               | available, the compiler will spill locals to stack more
               | often, and register renaming doesn't help - memory
               | renaming is needed to really mitigate this.
               | 
               | But i agree the impact of preserving frame pointers is
               | generally quite small and doesn't often actually need
               | mitigation - on amd64 there's not much impact from losing
               | 1 more of 16 arch registers.
        
               | irogers wrote:
               | Agner speaks about memory renaming back on Zen 2:
               | 
               | https://www.agner.org/forum/viewtopic.php?t=41
               | 
               | Intel Alderlake has performance events for tracking it:
               | 
               | https://github.com/intel/perfmon/blob/974c69919b2a9dfd827
               | 8cf...
               | 
               | But even before this you had store to load forwarding on
               | x86. I'm not saying you have, but before inventing a
               | performance problem it is worth spending time trying to
               | diagnose it with thorough profiling (e.g. [1]). The
               | Fedora frame pointer patch did a thorough performance
               | analysis and performance will be revisited again.
               | Unfortunately there are a lot of arm chair performance
               | experts who haven't spent time looking into the details.
               | 
               | [1] https://perf.wiki.kernel.org/index.php/Top-
               | Down_Analysis
        
             | masklinn wrote:
             | And that's before accounting for SP which the compiler will
             | usually avoid touching. So for an x86 target, omitting the
             | frame pointer means going from 6 to 7 usable registers, on
             | x86-64 it's 14 to 15, a lot less necessary outside of very
             | specific work loads.
        
       | DaveFlater wrote:
       | Uh... [Begin flashback] 2012: A change made in GCC 4.7 allowed
       | optimization to reschedule and defer the push of the frame
       | pointer that previously occurred in the function prologue
       | whenever frame pointers were enabled. When binaries are profiled
       | using frame pointers, incorrect call chains are derived whenever
       | a sample is taken between the top of the function and the
       | instruction that pushes the frame pointer. I complained, but got
       | an immediate WONTFIX:
       | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55667
       | 
       | So did they fix that eventually or is everyone just oblivious?
       | 
       | How I found the problem in 2012: Configuration of profiling tools
       | for C/C++ applications under 64-bit Linux
       | https://doi.org/10.6028/NIST.TN.1790
        
         | zznzz wrote:
         | I cannot comment on whether "everyone" is oblivious but yes,
         | this is still the case - frame pointer based unwinding
         | sometimes skips the caller when the IP is sampled before the
         | callee sets up a frame.
         | 
         | This is also common for samples in leaf functions.
         | 
         | compiler & tool chain folks tend to think (quite justifiably
         | imo) that this and similar stuff is fine because dwarf allows
         | reconstructing everything perfectly. The problem is just that
         | the user experience of dwarf-based unwinding is poor, because
         | the only implemented method in Linux is sampling the contents
         | of the stack and doing the unwind in post processing.
        
           | DaveFlater wrote:
           | AFAIK, the optimization undermined the only use case that
           | -fno-omit-frame-pointer actually had on x86_64. Is there a
           | real use case that benefitted from allowing the frame pointer
           | push to wander? Why why why
        
             | sitkack wrote:
             | That is too bad, it does mention `-fno-schedule-insns2` in
             | the comments. I'll have to see how well that works in
             | practice.
             | 
             | Having high quality low cost stack straces is important for
             | continuous profiling. ref Knuth.
        
       | nneonneo wrote:
       | The article points out that the kernel uses ORC instead of DWARF
       | for unwinding. I wonder if that could ever become an option in
       | userspace? I imagine that if all you're interested in is stack
       | traces, instead of debugging (which DWARF is designed for), ORC
       | would be a very nice performance win. And it's not as if they're
       | mutually exclusive, either: there's no reason why a binary or
       | debuginfo couldn't just ship both.
       | 
       | In any case, the runtime performance hit of frame pointers is
       | quite high, and since we do have things like DWARF (and maybe ORC
       | someday?), I'd still argue that frame pointers aren't necessary.
       | It's nice to have that extra register free!
        
         | amluto wrote:
         | Native ORC generation from GCC or LLVM or userspace tooling
         | like objtool would be nifty.
         | 
         | FWIW, ORC was specifically designed to be efficient, but it was
         | not designed to be future-proof against complex toolchain
         | changes. Since all the ORC tooling is in the kernel, it can
         | evolve together if needed.
         | 
         | (I was a bit involved in the design -- I helped optimize the
         | format to reduce cache misses on lookup.)
        
         | the_mitsuhiko wrote:
         | > In any case, the runtime performance hit of frame pointers is
         | quite high
         | 
         | That's not true at all. It's in _some very rare cases_. Firefox
         | I think at this point ships with framepointers enabled and so
         | does every M1 /M2 mac app and all iOS applications as it's
         | mandatory for the calling convention on Apple.
        
           | remexre wrote:
           | On AArch64 I think it's cheaper because losing one register
           | doesn't hurt as much, since you've got twice as many.
        
             | yvdriess wrote:
             | nitpick: it's the number of physical registers that matter
             | more here, not ISA registers
        
               | remexre wrote:
               | Why wouldn't the lower ISA register count lead to more
               | spills, regardless of the physical count? Is store
               | forwarding really that effective?
        
               | gpderetta wrote:
               | It isn't.
        
           | nneonneo wrote:
           | The ORC documentation [1] cites a 5-10% perf hit [2] on x86.
           | Now, granted, we're talking about an architecture that has
           | only 8 logical registers to begin with, so losing one is
           | going to hurt a lot more than x86-64 (16 logical registers)
           | or AARCH64 (32 logical registers). But, 5-10% is definitely a
           | noticeable perf hit.
           | 
           | Phoronix [3] did a test of "-fno-omit-frame-pointer" during
           | the discussion on whether to enable the flag for Fedora, this
           | time on x86-64. They found an average 14% performance hit on
           | a wide variety of benchmarks.
           | 
           | [1] https://www.kernel.org/doc/html/latest/x86/orc-
           | unwinder.html [2] https://lore.kernel.org/all/20170602104048.
           | jkkzssljsompjdwy@... [3]
           | https://www.phoronix.com/review/fedora-frame-pointer/5
        
             | zznzz wrote:
             | See also https://lists.fedoraproject.org/archives/list/deve
             | l@lists.fe... - the botan phoronix results with frame
             | pointers were probably measuring debug builds.
        
         | tdullien wrote:
         | For prodfiler.com we convert .eh_frame (DWARF) unwinding format
         | to something that is more similar to ORC and optimized for
         | lookup in eBPF data structures.
        
       | ot wrote:
       | > Frame pointers have some corner cases which they don't handle
       | well (certain leaf and most inlined functions aren't collected),
       | but these don't matter a great deal in reality.
       | 
       | > DWARF unwinding can show inlined functions as if they are
       | separate stack frames. (Opinions differ as to whether or not this
       | is an advantage.)
       | 
       | This conflates unwinding and symbolization. Unwinding collects
       | the list of frames, which by definition cannot have inlined
       | functions (functions that don't have their own frame); the
       | unwinding mechanism does not matter here.
       | 
       | DWARF can be used to resolve the "stack" of inlined functions for
       | an instruction address, even if that was collected via frame
       | pointers. That can be done in post-processing, possibly on a
       | different machine, so the cost does not affect the workload being
       | profiled.
       | 
       | For example, using addr2line from the LLVM distribution
       | $ llvm-addr2line -pfi --demangle -e /path/to/unstripped/binary
       | # Enter instruction address in the form 0x...         # Outputs
       | inlined stack
       | 
       | To summarize, unwinding via frame pointers does not miss any
       | information that would be collected with DWARF unwinding.
       | Everything can be recovered later at symbolization time.
       | 
       | And on whether reporting inlined functions is important or not,
       | at least for C++, which heavily relies on inlining to mitigate
       | abstraction penalty, I'd say it is crucial.
        
         | the_mitsuhiko wrote:
         | > To summarize, unwinding via frame pointers does not miss any
         | information that would be collected with DWARF unwinding.
         | Everything can be recovered later at symbolization time.
         | 
         | The real issue with DWARF based unwinding/symbolication is that
         | it's really complex on the user experience around it. We at
         | Sentry support stack walking from minidumps, yet we often
         | cannot unwind on Linux platforms on the server because
         | executables and object files are typically not available.
         | 
         | Despite us supporting debuginfod (we're hitting the canonical
         | service today only) we are unable to collect executables/object
         | files from there, making it completely impossible to produce
         | proper stack traces if frame pointers are omitted.
         | 
         | In a world were DWARF unwinding is a thing and people want to
         | use, there has to be an ecosystem of sharing binaries too. This
         | issue is particular bad on Android where there are millions of
         | devices out there with tons of different proprietary system
         | libraries linked in, destroying stacks.
        
           | conconpon wrote:
           | Couldn't this be solved by uploading the binaries along with
           | the crash dumps, if you don't already have a copy of it, as
           | determined by checking hashes or something?
        
             | the_mitsuhiko wrote:
             | Exactly, except debuginfod from Canonical and some others
             | misses the executables.
        
             | account42 wrote:
             | The debug info is usually stripped out of the binaries for
             | Linux distros and has to be installed separately.
             | debuginfod is supposed to make possible to download the
             | debug info in a distro-agnostic way based on the build ID
             | embedded in the binaries.
        
       | pjc50 wrote:
       | Relevant discussion on profiling without frame pointers:
       | https://twitter.com/halvarflake/status/1577644229853151233 /
       | https://prodfiler.com/blog/introducing-prodfiler/ via the
       | "prodfiler" tool.
        
         | irogers wrote:
         | Twitter: "No need to recompile with frame pointers."
         | 
         | Web page: "always-on profiling powered by eBPF technology."
         | 
         | BPF stack traces are gathered using frame pointers:
         | 
         | https://github.com/torvalds/linux/blob/master/kernel/bpf/sta...
         | 
         | https://github.com/torvalds/linux/blob/master/arch/x86/event...
        
           | pjc50 wrote:
           | Someone from prodfiler appears to be explaining in this
           | thread https://news.ycombinator.com/item?id=34806693
        
             | irogers wrote:
             | For every running application turn DWARF data into BPF
             | maps. Does this scale?
        
               | javierhonduco wrote:
               | As surprising as it seems, it does! My colleague and I
               | spoke about this in our project in FOSDEM https://fosdem.
               | org/2023/schedule/event/walking_stack_without.... Let us
               | know if you have any questions or feedback :)
        
       | ishitatsuyuki wrote:
       | IMHO, perf's decision to write whole stacks directly to the disk
       | and unwinding them as a post-process is a really bad design. It
       | wastes disk space, and as the author pointed out, it also has a
       | lot of IO overhead.
       | 
       | As an alternative approach, https://github.com/mstange/samply
       | processes data streamed from perf and unwinds it in realtime. The
       | unwinding overhead is surprisingly low: it only takes around 1%
       | of (single) CPU per CPU profiled. Solving the disk waste alone
       | has been a tremendous improvement of profiling experience. As a
       | bonus, the unwinding and symbolization works reliably while I
       | frequently had postprocessing not terminating when using the perf
       | CLI directly.
        
         | irogers wrote:
         | This could be a great Linux perf GSoC project. Projects and
         | mentors are being looked for:
         | https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf
        
         | sitkack wrote:
         | Are you saying that Dwarf information should be unwound in
         | realtime or that it should use framepointers and debug
         | information to trivially sample the stacks and record the
         | symbols?
         | 
         | If you have framepointers and debug information, it is both
         | high resolution and fast. DWARF is a fallback for not having
         | framepointers.
         | 
         | If you are saying the DWARF information should be processed at
         | the point of use and not copied and processed later, then I
         | concur. But we should also encourage folks to compiled WITH
         | `-fno-omit-frame-pointer` and `-g`
        
       | js2 wrote:
       | Not having frame pointers is a complete PITA on Android.
       | 
       | I built my company's in-house mobile crash reporter ~ 2013, based
       | on experience building a similar system at an earlier startup. At
       | the startup I used Google breakpad on both Android and iOS, doing
       | all the unwinding and symbolication on the backend. iOS at least
       | made this easy because Apple makes dSYMs readily available. On
       | Android, you simply can't get system symbols. So you essentially
       | can't unwind OS symbols on Android.
       | 
       | At current company I used PLCrashReporter on iOS. Unwinding
       | occurs on device. Symbolication on the backend.
       | 
       | I tried everything with Android for native code crashes, starting
       | with breakpad minidumps, then using every available unwinder
       | option on Android: corkscrew, the Android fork of libunwind, the
       | official libunwind, whatever custom unwinder Android eventually
       | wrote. None of them work reliably for native code crashes. And
       | good luck tracing from native code back into the ART frames.
       | 
       | In the end what ending up being most reliable was including the
       | last few thousand lines of logcat (grabbed upon the app
       | restarting after a crash since you can't reliably grab it inside
       | the crash handler). Android's OOB crash handler for some crashes
       | (with recent Android versions) dumps full stacks of every thread
       | including native code and ART frames to logcat. So the crash SDK
       | looks for that in the logcat output and includes it in the
       | report. That at least provides a stack. Symbolicating anything
       | but application frames is still impossible though.
       | 
       | And this isn't even going into esoteric things Android has done
       | over the years just for Chrome like relocation packing:
       | 
       | https://android.googlesource.com/platform/bionic/+/f5e0ba94d...
       | 
       | To this day, I don't understand why both Apple and Google make it
       | so difficult for an application to get access to the stack traces
       | of its own crashes. And no, the reporting built-in to Google Play
       | and iTunes Connect (and Xcode) are not sufficient for large usage
       | apps or companies like mine that have lots of apps with shared
       | SDKs and need to correlate crashes across apps.
        
       | titzer wrote:
       | > But collecting the whole stack would consume far too much
       | storage, so by default it only collects the first 8K. Many
       | userspace stacks will be larger than this, in which case the data
       | collection will simply be incomplete - it will never be possible
       | to recover the full stack trace.
       | 
       | Yeah, doing an 8KB memcopy for every profile sample sounds like a
       | lot of overhead. Is DWARF unwinding so slow that that is actually
       | faster?
       | 
       | Virgil doesn't use a frame pointer, and I sort of regret it. It
       | uses custom unwinding information that is used during GC or
       | throwing an exception (i.e. controlled crash). I spent
       | considerable time optimizing both the space and time of that
       | lookup, to the point where it's only 32 bits of metadata per call
       | site and a few dozen instructions to walk each frame. But that
       | was a major, major pain to debug and I found a bug in it at late
       | as last year.
       | 
       | A frame pointer is also required for stack allocation of objects
       | that aren't fully scalar-replaceable. Virgil doesn't do that yet,
       | but aspires to someday.
        
         | zznzz wrote:
         | > Is DWARF unwinding so slow that that is actually faster?
         | 
         | No. The only reason it works like this is because the upstream
         | Linux kernel has thus far rejected in-kernel dwarf unwinders,
         | but copying the stack is simpler and available / implemented.
        
           | irogers wrote:
           | DWARF bytecode is a full VM. Do compiler writers test their
           | DWARF output? (my experience is not - especially for
           | architectures out of the big 2 or 3) How does the kernel
           | access the ELF file pages with the DWARF information in when
           | in an NMI handler? You could mlock all your debug information
           | when a program loads but the memory overhead wouldn't be
           | nice. It is hard enough getting a build ID.
           | 
           | The elephant in the room btw is LBR call stacks, but they
           | aren't exposed in the kernel/BPF yet. Userland perf has them
           | and they recently became available on AMD.
        
             | zznzz wrote:
             | It is not required to unwind the user space stack in the
             | NMI handler. It can be done later before returning to user
             | space in a context that can handle faults.
        
               | irogers wrote:
               | Allowing processes to sniff each others stacks has some
               | fairly obvious security issues.
        
               | zznzz wrote:
               | I don't understand your concern - what about this would
               | involve one process sniffing another process's memory?
               | The kernel would still be doing the unwinding, just not
               | in the NMI handler.
        
               | irogers wrote:
               | Wouldn't all your kernel stacks then end up in whatever
               | this handler is? Why not implement your approach and mail
               | it to LKML :-)
        
               | zznzz wrote:
               | Yes, this only works for user space stacks, but that is
               | sufficient since with ORC kernel stacks are solved (IMO)
               | and it avoids all the issues with trying to mlock
               | debuginfo of all processes that you mentioned. The NMI
               | handler would still unwind the kernel stack.
               | 
               | > Why not implement your approach and mail it to LKML :-)
               | 
               | because this would still be an in-kernel dwarf unwinder
               | and I would expect an instant reject, and because I am
               | lazy and/or don't care enough about this problem or linux
               | to work on it. Even if people could be persuaded, I don't
               | have the interest or temperance to debate this with LKML.
        
           | titzer wrote:
           | Why is profiling done in the kernel for userspace stacks?
        
             | zznzz wrote:
             | because this is about PMU based sampling, which involves
             | triggering interrupts at some interval and doing the
             | sampling while handling the interrupt
        
       | tdullien wrote:
       | What everybody in this discussion seems to miss is that _you don
       | 't need to unwind the DWARF data structures during profiling
       | time_, you are free to convert DWARF to a fast-lookup data
       | structure on the machine.
       | 
       | DWARF needs to support every CPU under the sun. Every unwinder on
       | the other hand is CPU-specific. For prodfiler.com's continuous
       | in-production unwinding, we convert DWARF into something compact
       | and fast-to-lookup that is then placed in eBPF maps.
       | 
       | It all works like a charm. We can have our cake (e.g. use RBP as
       | GPR) and eat it too (e.g. use .eh_frame, converted at runtime
       | into a fast-to-lookup format) to do reliable whole-system
       | unwinding in production.
        
         | irogers wrote:
         | prodfiler clearly has a market. It would be interesting to see
         | the approach as something standard in the kernel tree, perhaps
         | it can be added to perf's synthesis, etc. There is already BPF
         | based profiling within perf to avoid file descriptor overheads.
         | If engineering resources are the issue then this could be a
         | good GSoC project:
         | https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf
        
           | javierhonduco wrote:
           | This would be ideal. There's some great work by folks at
           | Oracle in this space: SFrame
           | (https://www.phoronix.com/news/GNU-Binutils-SFrame) nee
           | ctf_frame that I hope will be integrated in the kernel.
           | 
           | As this will take few years, in the meantime I've developed a
           | DWARF-based unwinder in BPF [0]. Some perf maintainers showed
           | interest in this, so thanks for bringing up the GSoC project
           | idea, didn't occur to me!
           | 
           | [0]: https://news.ycombinator.com/item?id=33788794
        
       | javierhonduco wrote:
       | The pervasive lack of frame pointers is the reason why we've
       | developed a custom format derived from DWARF unwind information
       | thanks to some insights: DWARF unwind information is incredibly
       | flexible, it supports many architecture and allows restoring any
       | arbitrary register. But we only need 3: the frame pointer, the
       | stack pointer, and in non-x86 the return address.
       | 
       | While DWARF unwind info doesn't use that many bytes, but
       | unfortunately reading and parsing that information is quite
       | expensive.
       | 
       | For that reason I've developed a new unwinder that uses custom
       | unwind information derived from DWARF
       | (https://www.polarsignals.com/blog/posts/2022/11/29/profiling...,
       | previously discussed in
       | https://news.ycombinator.com/item?id=33788794) that runs in BPF.
       | This new compact representation can be binary searched easily and
       | each unwind row has a size of 16 bytes. I am currently working on
       | reducing it down to ~10 bytes.
       | 
       | All the code is fully OSS (Apache 2.0 for userspace and GPL for
       | BPF), and part of the Parca project (https://github.com/parca-
       | dev/parca-agent). We've also given a talk this year in FOSDEM
       | going deeper into how we made it scale for many big processes.
        
       | dezgeg wrote:
       | Nice analysis.
       | 
       | It would be interesting to see comparison to Intel LBR.
       | 
       | Also would be nice to know how profiling unwinding is done on
       | Windows (maybe someone knows how to summon Bruce Dawson).
        
         | rwmj wrote:
         | AIUI the problems with LBR are two-fold. It only works on
         | newish CPUs, and it only handles a limited number of stack
         | frames (I heard 8, but maybe more on recent CPUs).
        
           | amluto wrote:
           | CET's shadow stacks, on the other hand, will solve this
           | entire problem, both exactly and extremely efficiently.
        
         | zznzz wrote:
         | IIUC Windows implements in-kernel unwinding using FPO debug
         | data embedded in every executable.
        
       | bregma wrote:
       | Big advantage for DWARF2 over frame pointers is that it actually
       | works for unwinding on aarch64.
        
         | zznzz wrote:
         | Can you give more detail?
         | 
         | AFAIK frame pointers work fine for unwinding on aarch64. And on
         | aarch64 the gcc default is not to omit frame pointers and IIRC
         | when the default was switched at some point it was treated as a
         | bug and reverted (not sure if required in the ABI or just
         | strongly preferred by the community). So IME generally
         | unwinding with frame pointers on aarch64 works more often than
         | on amd64 since you don't have to recompile the world.
        
         | jmrjmr wrote:
         | Why don't frame pointers work on aarch64?
        
         | the_mitsuhiko wrote:
         | Apple's version of AARCH64 requires that the frame pointer
         | register frame record. DWARF Information is not necessary to
         | unwind there. It's lovely and I applaud Apple for that
         | decision.
        
       ___________________________________________________________________
       (page generated 2023-02-15 23:02 UTC)