[HN Gopher] Frame pointers vs. DWARF - my verdict
___________________________________________________________________
Frame pointers vs. DWARF - my verdict
Author : rwmj
Score : 88 points
Date : 2023-02-14 13:54 UTC (1 days ago)
(HTM) web link (rwmj.wordpress.com)
(TXT) w3m dump (rwmj.wordpress.com)
| fooblaster wrote:
| There is an option far better than either suggested. In my
| experience using the --callgraph=lbr option produces far more
| reliable callstacks than relying on frame pointers. Sadly it's
| only available on Intel cpus at the moment.
| irogers wrote:
| AMD will have support in Zen4 and Linux 6.1 (which is LTS):
|
| https://lore.kernel.org/lkml/Yz%2FcpNTSacRMh1FK@gmail.com/
|
| Further, precise events are fixed in Linux 6.2:
|
| https://lore.kernel.org/lkml/Y5eQeR2tpZ%2FBos49@gmail.com/
| sylware wrote:
| dwarf = brain convolution
|
| frame pointer = 1 less register on an already set of registers
| under heavy strain (x86_64). I do envy risc-v with all their
| registers: even the load-store model of risc-v won't use enough
| more registers to end up with the same amount of strain as on
| x86_64.
|
| If I ever have the usage of dwarf, I would build the tables
| manually only for what I wish to debug?... if it is possible
| (without the use of specialized assembler directives, because it
| increase significantly the technical cost of the assembler). That
| said, does a real, accurate and complete specification of dward
| exists? Because when I look at the sysv ABI or ELF, what a mess.
| masklinn wrote:
| > frame pointer = 1 less register on an already set of
| registers under heavy strain (x86_64).
|
| X86-64 is _not_ under heavy register pressure strain. That idea
| is a legacy from x86 (plain, 32b).
|
| x86-64 has the same register count as ARMv7 and few bothered
| disabling the frame pointer there, even though it's a load-
| store architecture.
| tialaramex wrote:
| Right, x86-64 offers eight extra register names+ (r8 through
| r15). If you choose to go from x86-without-frame-pointer to
| x86-64-with-frame-pointer you gained 7 register names which
| is huge.
|
| This makes the case where that one extra register name makes
| all the difference much rarer, arguably turning it from "I
| demand a compiler flag" to "Let's just hand-write the machine
| code for this one very special routine if our performance
| data suggests it's worth it".
|
| + Internally a modern CPU has far more actual register, to
| enable a feature called "register renaming". But we can only
| talk about them using their canonical names, and x86-64 adds
| eight more of those.
| the_mitsuhiko wrote:
| I think generally the talk about "there are not enough
| registers" ignore pipelining and register renaming way too
| much. The loss of performance of the frame pointer register
| even on x86 is not that problematic, and on x86_64 it's
| completely negligible unless you're in a tight switch heavy
| interpreter loop.
| zznzz wrote:
| Register renaming doesn't significantly address the
| impact of reducing the number of architectural registers
| available to the compiler. With fewer register names
| available, the compiler will spill locals to stack more
| often, and register renaming doesn't help - memory
| renaming is needed to really mitigate this.
|
| But i agree the impact of preserving frame pointers is
| generally quite small and doesn't often actually need
| mitigation - on amd64 there's not much impact from losing
| 1 more of 16 arch registers.
| irogers wrote:
| Agner speaks about memory renaming back on Zen 2:
|
| https://www.agner.org/forum/viewtopic.php?t=41
|
| Intel Alderlake has performance events for tracking it:
|
| https://github.com/intel/perfmon/blob/974c69919b2a9dfd827
| 8cf...
|
| But even before this you had store to load forwarding on
| x86. I'm not saying you have, but before inventing a
| performance problem it is worth spending time trying to
| diagnose it with thorough profiling (e.g. [1]). The
| Fedora frame pointer patch did a thorough performance
| analysis and performance will be revisited again.
| Unfortunately there are a lot of arm chair performance
| experts who haven't spent time looking into the details.
|
| [1] https://perf.wiki.kernel.org/index.php/Top-
| Down_Analysis
| masklinn wrote:
| And that's before accounting for SP which the compiler will
| usually avoid touching. So for an x86 target, omitting the
| frame pointer means going from 6 to 7 usable registers, on
| x86-64 it's 14 to 15, a lot less necessary outside of very
| specific work loads.
| DaveFlater wrote:
| Uh... [Begin flashback] 2012: A change made in GCC 4.7 allowed
| optimization to reschedule and defer the push of the frame
| pointer that previously occurred in the function prologue
| whenever frame pointers were enabled. When binaries are profiled
| using frame pointers, incorrect call chains are derived whenever
| a sample is taken between the top of the function and the
| instruction that pushes the frame pointer. I complained, but got
| an immediate WONTFIX:
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55667
|
| So did they fix that eventually or is everyone just oblivious?
|
| How I found the problem in 2012: Configuration of profiling tools
| for C/C++ applications under 64-bit Linux
| https://doi.org/10.6028/NIST.TN.1790
| zznzz wrote:
| I cannot comment on whether "everyone" is oblivious but yes,
| this is still the case - frame pointer based unwinding
| sometimes skips the caller when the IP is sampled before the
| callee sets up a frame.
|
| This is also common for samples in leaf functions.
|
| compiler & tool chain folks tend to think (quite justifiably
| imo) that this and similar stuff is fine because dwarf allows
| reconstructing everything perfectly. The problem is just that
| the user experience of dwarf-based unwinding is poor, because
| the only implemented method in Linux is sampling the contents
| of the stack and doing the unwind in post processing.
| DaveFlater wrote:
| AFAIK, the optimization undermined the only use case that
| -fno-omit-frame-pointer actually had on x86_64. Is there a
| real use case that benefitted from allowing the frame pointer
| push to wander? Why why why
| sitkack wrote:
| That is too bad, it does mention `-fno-schedule-insns2` in
| the comments. I'll have to see how well that works in
| practice.
|
| Having high quality low cost stack straces is important for
| continuous profiling. ref Knuth.
| nneonneo wrote:
| The article points out that the kernel uses ORC instead of DWARF
| for unwinding. I wonder if that could ever become an option in
| userspace? I imagine that if all you're interested in is stack
| traces, instead of debugging (which DWARF is designed for), ORC
| would be a very nice performance win. And it's not as if they're
| mutually exclusive, either: there's no reason why a binary or
| debuginfo couldn't just ship both.
|
| In any case, the runtime performance hit of frame pointers is
| quite high, and since we do have things like DWARF (and maybe ORC
| someday?), I'd still argue that frame pointers aren't necessary.
| It's nice to have that extra register free!
| amluto wrote:
| Native ORC generation from GCC or LLVM or userspace tooling
| like objtool would be nifty.
|
| FWIW, ORC was specifically designed to be efficient, but it was
| not designed to be future-proof against complex toolchain
| changes. Since all the ORC tooling is in the kernel, it can
| evolve together if needed.
|
| (I was a bit involved in the design -- I helped optimize the
| format to reduce cache misses on lookup.)
| the_mitsuhiko wrote:
| > In any case, the runtime performance hit of frame pointers is
| quite high
|
| That's not true at all. It's in _some very rare cases_. Firefox
| I think at this point ships with framepointers enabled and so
| does every M1 /M2 mac app and all iOS applications as it's
| mandatory for the calling convention on Apple.
| remexre wrote:
| On AArch64 I think it's cheaper because losing one register
| doesn't hurt as much, since you've got twice as many.
| yvdriess wrote:
| nitpick: it's the number of physical registers that matter
| more here, not ISA registers
| remexre wrote:
| Why wouldn't the lower ISA register count lead to more
| spills, regardless of the physical count? Is store
| forwarding really that effective?
| gpderetta wrote:
| It isn't.
| nneonneo wrote:
| The ORC documentation [1] cites a 5-10% perf hit [2] on x86.
| Now, granted, we're talking about an architecture that has
| only 8 logical registers to begin with, so losing one is
| going to hurt a lot more than x86-64 (16 logical registers)
| or AARCH64 (32 logical registers). But, 5-10% is definitely a
| noticeable perf hit.
|
| Phoronix [3] did a test of "-fno-omit-frame-pointer" during
| the discussion on whether to enable the flag for Fedora, this
| time on x86-64. They found an average 14% performance hit on
| a wide variety of benchmarks.
|
| [1] https://www.kernel.org/doc/html/latest/x86/orc-
| unwinder.html [2] https://lore.kernel.org/all/20170602104048.
| jkkzssljsompjdwy@... [3]
| https://www.phoronix.com/review/fedora-frame-pointer/5
| zznzz wrote:
| See also https://lists.fedoraproject.org/archives/list/deve
| l@lists.fe... - the botan phoronix results with frame
| pointers were probably measuring debug builds.
| tdullien wrote:
| For prodfiler.com we convert .eh_frame (DWARF) unwinding format
| to something that is more similar to ORC and optimized for
| lookup in eBPF data structures.
| ot wrote:
| > Frame pointers have some corner cases which they don't handle
| well (certain leaf and most inlined functions aren't collected),
| but these don't matter a great deal in reality.
|
| > DWARF unwinding can show inlined functions as if they are
| separate stack frames. (Opinions differ as to whether or not this
| is an advantage.)
|
| This conflates unwinding and symbolization. Unwinding collects
| the list of frames, which by definition cannot have inlined
| functions (functions that don't have their own frame); the
| unwinding mechanism does not matter here.
|
| DWARF can be used to resolve the "stack" of inlined functions for
| an instruction address, even if that was collected via frame
| pointers. That can be done in post-processing, possibly on a
| different machine, so the cost does not affect the workload being
| profiled.
|
| For example, using addr2line from the LLVM distribution
| $ llvm-addr2line -pfi --demangle -e /path/to/unstripped/binary
| # Enter instruction address in the form 0x... # Outputs
| inlined stack
|
| To summarize, unwinding via frame pointers does not miss any
| information that would be collected with DWARF unwinding.
| Everything can be recovered later at symbolization time.
|
| And on whether reporting inlined functions is important or not,
| at least for C++, which heavily relies on inlining to mitigate
| abstraction penalty, I'd say it is crucial.
| the_mitsuhiko wrote:
| > To summarize, unwinding via frame pointers does not miss any
| information that would be collected with DWARF unwinding.
| Everything can be recovered later at symbolization time.
|
| The real issue with DWARF based unwinding/symbolication is that
| it's really complex on the user experience around it. We at
| Sentry support stack walking from minidumps, yet we often
| cannot unwind on Linux platforms on the server because
| executables and object files are typically not available.
|
| Despite us supporting debuginfod (we're hitting the canonical
| service today only) we are unable to collect executables/object
| files from there, making it completely impossible to produce
| proper stack traces if frame pointers are omitted.
|
| In a world were DWARF unwinding is a thing and people want to
| use, there has to be an ecosystem of sharing binaries too. This
| issue is particular bad on Android where there are millions of
| devices out there with tons of different proprietary system
| libraries linked in, destroying stacks.
| conconpon wrote:
| Couldn't this be solved by uploading the binaries along with
| the crash dumps, if you don't already have a copy of it, as
| determined by checking hashes or something?
| the_mitsuhiko wrote:
| Exactly, except debuginfod from Canonical and some others
| misses the executables.
| account42 wrote:
| The debug info is usually stripped out of the binaries for
| Linux distros and has to be installed separately.
| debuginfod is supposed to make possible to download the
| debug info in a distro-agnostic way based on the build ID
| embedded in the binaries.
| pjc50 wrote:
| Relevant discussion on profiling without frame pointers:
| https://twitter.com/halvarflake/status/1577644229853151233 /
| https://prodfiler.com/blog/introducing-prodfiler/ via the
| "prodfiler" tool.
| irogers wrote:
| Twitter: "No need to recompile with frame pointers."
|
| Web page: "always-on profiling powered by eBPF technology."
|
| BPF stack traces are gathered using frame pointers:
|
| https://github.com/torvalds/linux/blob/master/kernel/bpf/sta...
|
| https://github.com/torvalds/linux/blob/master/arch/x86/event...
| pjc50 wrote:
| Someone from prodfiler appears to be explaining in this
| thread https://news.ycombinator.com/item?id=34806693
| irogers wrote:
| For every running application turn DWARF data into BPF
| maps. Does this scale?
| javierhonduco wrote:
| As surprising as it seems, it does! My colleague and I
| spoke about this in our project in FOSDEM https://fosdem.
| org/2023/schedule/event/walking_stack_without.... Let us
| know if you have any questions or feedback :)
| ishitatsuyuki wrote:
| IMHO, perf's decision to write whole stacks directly to the disk
| and unwinding them as a post-process is a really bad design. It
| wastes disk space, and as the author pointed out, it also has a
| lot of IO overhead.
|
| As an alternative approach, https://github.com/mstange/samply
| processes data streamed from perf and unwinds it in realtime. The
| unwinding overhead is surprisingly low: it only takes around 1%
| of (single) CPU per CPU profiled. Solving the disk waste alone
| has been a tremendous improvement of profiling experience. As a
| bonus, the unwinding and symbolization works reliably while I
| frequently had postprocessing not terminating when using the perf
| CLI directly.
| irogers wrote:
| This could be a great Linux perf GSoC project. Projects and
| mentors are being looked for:
| https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf
| sitkack wrote:
| Are you saying that Dwarf information should be unwound in
| realtime or that it should use framepointers and debug
| information to trivially sample the stacks and record the
| symbols?
|
| If you have framepointers and debug information, it is both
| high resolution and fast. DWARF is a fallback for not having
| framepointers.
|
| If you are saying the DWARF information should be processed at
| the point of use and not copied and processed later, then I
| concur. But we should also encourage folks to compiled WITH
| `-fno-omit-frame-pointer` and `-g`
| js2 wrote:
| Not having frame pointers is a complete PITA on Android.
|
| I built my company's in-house mobile crash reporter ~ 2013, based
| on experience building a similar system at an earlier startup. At
| the startup I used Google breakpad on both Android and iOS, doing
| all the unwinding and symbolication on the backend. iOS at least
| made this easy because Apple makes dSYMs readily available. On
| Android, you simply can't get system symbols. So you essentially
| can't unwind OS symbols on Android.
|
| At current company I used PLCrashReporter on iOS. Unwinding
| occurs on device. Symbolication on the backend.
|
| I tried everything with Android for native code crashes, starting
| with breakpad minidumps, then using every available unwinder
| option on Android: corkscrew, the Android fork of libunwind, the
| official libunwind, whatever custom unwinder Android eventually
| wrote. None of them work reliably for native code crashes. And
| good luck tracing from native code back into the ART frames.
|
| In the end what ending up being most reliable was including the
| last few thousand lines of logcat (grabbed upon the app
| restarting after a crash since you can't reliably grab it inside
| the crash handler). Android's OOB crash handler for some crashes
| (with recent Android versions) dumps full stacks of every thread
| including native code and ART frames to logcat. So the crash SDK
| looks for that in the logcat output and includes it in the
| report. That at least provides a stack. Symbolicating anything
| but application frames is still impossible though.
|
| And this isn't even going into esoteric things Android has done
| over the years just for Chrome like relocation packing:
|
| https://android.googlesource.com/platform/bionic/+/f5e0ba94d...
|
| To this day, I don't understand why both Apple and Google make it
| so difficult for an application to get access to the stack traces
| of its own crashes. And no, the reporting built-in to Google Play
| and iTunes Connect (and Xcode) are not sufficient for large usage
| apps or companies like mine that have lots of apps with shared
| SDKs and need to correlate crashes across apps.
| titzer wrote:
| > But collecting the whole stack would consume far too much
| storage, so by default it only collects the first 8K. Many
| userspace stacks will be larger than this, in which case the data
| collection will simply be incomplete - it will never be possible
| to recover the full stack trace.
|
| Yeah, doing an 8KB memcopy for every profile sample sounds like a
| lot of overhead. Is DWARF unwinding so slow that that is actually
| faster?
|
| Virgil doesn't use a frame pointer, and I sort of regret it. It
| uses custom unwinding information that is used during GC or
| throwing an exception (i.e. controlled crash). I spent
| considerable time optimizing both the space and time of that
| lookup, to the point where it's only 32 bits of metadata per call
| site and a few dozen instructions to walk each frame. But that
| was a major, major pain to debug and I found a bug in it at late
| as last year.
|
| A frame pointer is also required for stack allocation of objects
| that aren't fully scalar-replaceable. Virgil doesn't do that yet,
| but aspires to someday.
| zznzz wrote:
| > Is DWARF unwinding so slow that that is actually faster?
|
| No. The only reason it works like this is because the upstream
| Linux kernel has thus far rejected in-kernel dwarf unwinders,
| but copying the stack is simpler and available / implemented.
| irogers wrote:
| DWARF bytecode is a full VM. Do compiler writers test their
| DWARF output? (my experience is not - especially for
| architectures out of the big 2 or 3) How does the kernel
| access the ELF file pages with the DWARF information in when
| in an NMI handler? You could mlock all your debug information
| when a program loads but the memory overhead wouldn't be
| nice. It is hard enough getting a build ID.
|
| The elephant in the room btw is LBR call stacks, but they
| aren't exposed in the kernel/BPF yet. Userland perf has them
| and they recently became available on AMD.
| zznzz wrote:
| It is not required to unwind the user space stack in the
| NMI handler. It can be done later before returning to user
| space in a context that can handle faults.
| irogers wrote:
| Allowing processes to sniff each others stacks has some
| fairly obvious security issues.
| zznzz wrote:
| I don't understand your concern - what about this would
| involve one process sniffing another process's memory?
| The kernel would still be doing the unwinding, just not
| in the NMI handler.
| irogers wrote:
| Wouldn't all your kernel stacks then end up in whatever
| this handler is? Why not implement your approach and mail
| it to LKML :-)
| zznzz wrote:
| Yes, this only works for user space stacks, but that is
| sufficient since with ORC kernel stacks are solved (IMO)
| and it avoids all the issues with trying to mlock
| debuginfo of all processes that you mentioned. The NMI
| handler would still unwind the kernel stack.
|
| > Why not implement your approach and mail it to LKML :-)
|
| because this would still be an in-kernel dwarf unwinder
| and I would expect an instant reject, and because I am
| lazy and/or don't care enough about this problem or linux
| to work on it. Even if people could be persuaded, I don't
| have the interest or temperance to debate this with LKML.
| titzer wrote:
| Why is profiling done in the kernel for userspace stacks?
| zznzz wrote:
| because this is about PMU based sampling, which involves
| triggering interrupts at some interval and doing the
| sampling while handling the interrupt
| tdullien wrote:
| What everybody in this discussion seems to miss is that _you don
| 't need to unwind the DWARF data structures during profiling
| time_, you are free to convert DWARF to a fast-lookup data
| structure on the machine.
|
| DWARF needs to support every CPU under the sun. Every unwinder on
| the other hand is CPU-specific. For prodfiler.com's continuous
| in-production unwinding, we convert DWARF into something compact
| and fast-to-lookup that is then placed in eBPF maps.
|
| It all works like a charm. We can have our cake (e.g. use RBP as
| GPR) and eat it too (e.g. use .eh_frame, converted at runtime
| into a fast-to-lookup format) to do reliable whole-system
| unwinding in production.
| irogers wrote:
| prodfiler clearly has a market. It would be interesting to see
| the approach as something standard in the kernel tree, perhaps
| it can be added to perf's synthesis, etc. There is already BPF
| based profiling within perf to avoid file descriptor overheads.
| If engineering resources are the issue then this could be a
| good GSoC project:
| https://wiki.linuxfoundation.org/gsoc/2023-gsoc-perf
| javierhonduco wrote:
| This would be ideal. There's some great work by folks at
| Oracle in this space: SFrame
| (https://www.phoronix.com/news/GNU-Binutils-SFrame) nee
| ctf_frame that I hope will be integrated in the kernel.
|
| As this will take few years, in the meantime I've developed a
| DWARF-based unwinder in BPF [0]. Some perf maintainers showed
| interest in this, so thanks for bringing up the GSoC project
| idea, didn't occur to me!
|
| [0]: https://news.ycombinator.com/item?id=33788794
| javierhonduco wrote:
| The pervasive lack of frame pointers is the reason why we've
| developed a custom format derived from DWARF unwind information
| thanks to some insights: DWARF unwind information is incredibly
| flexible, it supports many architecture and allows restoring any
| arbitrary register. But we only need 3: the frame pointer, the
| stack pointer, and in non-x86 the return address.
|
| While DWARF unwind info doesn't use that many bytes, but
| unfortunately reading and parsing that information is quite
| expensive.
|
| For that reason I've developed a new unwinder that uses custom
| unwind information derived from DWARF
| (https://www.polarsignals.com/blog/posts/2022/11/29/profiling...,
| previously discussed in
| https://news.ycombinator.com/item?id=33788794) that runs in BPF.
| This new compact representation can be binary searched easily and
| each unwind row has a size of 16 bytes. I am currently working on
| reducing it down to ~10 bytes.
|
| All the code is fully OSS (Apache 2.0 for userspace and GPL for
| BPF), and part of the Parca project (https://github.com/parca-
| dev/parca-agent). We've also given a talk this year in FOSDEM
| going deeper into how we made it scale for many big processes.
| dezgeg wrote:
| Nice analysis.
|
| It would be interesting to see comparison to Intel LBR.
|
| Also would be nice to know how profiling unwinding is done on
| Windows (maybe someone knows how to summon Bruce Dawson).
| rwmj wrote:
| AIUI the problems with LBR are two-fold. It only works on
| newish CPUs, and it only handles a limited number of stack
| frames (I heard 8, but maybe more on recent CPUs).
| amluto wrote:
| CET's shadow stacks, on the other hand, will solve this
| entire problem, both exactly and extremely efficiently.
| zznzz wrote:
| IIUC Windows implements in-kernel unwinding using FPO debug
| data embedded in every executable.
| bregma wrote:
| Big advantage for DWARF2 over frame pointers is that it actually
| works for unwinding on aarch64.
| zznzz wrote:
| Can you give more detail?
|
| AFAIK frame pointers work fine for unwinding on aarch64. And on
| aarch64 the gcc default is not to omit frame pointers and IIRC
| when the default was switched at some point it was treated as a
| bug and reverted (not sure if required in the ABI or just
| strongly preferred by the community). So IME generally
| unwinding with frame pointers on aarch64 works more often than
| on amd64 since you don't have to recompile the world.
| jmrjmr wrote:
| Why don't frame pointers work on aarch64?
| the_mitsuhiko wrote:
| Apple's version of AARCH64 requires that the frame pointer
| register frame record. DWARF Information is not necessary to
| unwind there. It's lovely and I applaud Apple for that
| decision.
___________________________________________________________________
(page generated 2023-02-15 23:02 UTC)