[HN Gopher] Unwinding the Stack the Hard Way
       ___________________________________________________________________
        
       Unwinding the Stack the Hard Way
        
       Author : todsacerdoti
       Score  : 63 points
       Date   : 2023-04-16 17:51 UTC (5 hours ago)
        
 (HTM) web link (lesenechal.fr)
 (TXT) w3m dump (lesenechal.fr)
        
       | zX41ZdbW wrote:
       | We have implemented asynchronous signal-safe in-process stack
       | unwinding for always-on profiler in ClickHouse:
       | https://clickhouse.com/docs/en/operations/optimizing-perform...
       | 
       | The downside - it required many patches to LLVM's libunwind, and
       | not all of them are accepted yet:
       | https://bugs.llvm.org/show_bug.cgi?id=48186
       | 
       | ClickHouse source code: https://github.com/ClickHouse/ClickHouse
        
       | brancz wrote:
       | For the always-on open-source profiler we happen to work on at
       | work we had to do similar things and it was even more involved
       | since we base the whole thing on eBPF to lower the overhead and
       | therefore needed to get the verifier to accept it. [1]
       | 
       | We really wish frame pointers were always present, but here we
       | are.
       | 
       | [1]
       | https://www.polarsignals.com/blog/posts/2022/11/29/profiling...
        
       | the_mitsuhiko wrote:
       | Please just stop omitting frame pointers. You might lose out on 6
       | months of CPU speed advances, but from then on out you will reap
       | the benefits of better production observability for years to
       | come.
        
       | rwmj wrote:
       | I would suggest _not_ omitting the frame pointer. Fedora recently
       | changed the default and it makes collecting stack traces vastly
       | simpler, leading to better profiling support
       | (https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...)
       | 
       | Since then I gave a short (15 min) talk about producing and
       | understanding flame graphs:
       | http://oirase.annexia.org/tmp/2023-03-08-flamegraphs.mp4
        
         | fweimer wrote:
         | I still hope we can revert that once we have complete, CPU-
         | verified backtraces. As a linked list, frame pointers are still
         | somewhat slow. Just copying the hardware address stack should
         | be quite a bit faster, and more importantly, the CPU will
         | enforce that the addresses are correct.
        
         | boulos wrote:
         | Yeah, mandating rbp on amd64 would have saved years of
         | headache. There are still cases like tail calls that wouldn't
         | work, but the _vast majority_ of code doesn 't need to care
         | about dedicating a register to having usable backtraces.
        
       | rco8786 wrote:
       | I figured this article would be about Golang.
       | 
       | if err := nil { return err }
        
       | userbinator wrote:
       | There's a far simpler and also _very_ generic method that I 'm
       | surprised no one has really mentioned much, although it's well-
       | known in some corners of the RE/Asm community and I believe some
       | debuggers (but not gdb?) use it: scan the stack for values that
       | look like valid code addresses, then disassemble backwards from
       | there to see if they were actually return addresses written there
       | by a call instruction. You will find a chain that leads back to
       | the entry point.
       | 
       | Of course it won't work in edge cases like handwritten Asm that
       | uses the stack in more clever ways, but when you're dealing with
       | compiler output, it'll be fine. No need for all this complexity,
       | and works in almost all cases.
        
         | loeg wrote:
         | Do you analyze the dynamic loader segments (to determine code-
         | like addresses) or just do some hard-coded approximations per
         | platform? Pretty cute idea and I agree it will often work.
        
           | userbinator wrote:
           | It just needs to be on an executable page. The processor
           | doesn't care about anything else, so this will work with
           | things like JIT-generated code too. The key idea is that in
           | essentially all cases of compiler-generated code, every
           | return address will be immediately after the call that wrote
           | it.
        
             | loeg wrote:
             | Yeah, but how do you look up executable page mappings?
             | (Dynamic loader segments will have at least one executable
             | Seg mapped somewhere.)
        
               | userbinator wrote:
               | Linux: /proc/$pid/maps
               | 
               | Windows: VirtualQuery()
               | 
               | Mac: vm_region()
               | 
               | Your own OS (like this article): you should know.
        
               | loeg wrote:
               | Ok, some platform specific mechanisms. Cool.
        
         | the_mitsuhiko wrote:
         | That method is pretty much guarnateed to give you completely
         | broken stacks. We (Sentry) use it as a fallback if nothing else
         | works, and the success rate is awful.
        
           | loeg wrote:
           | Please elaborate on what breaks.
        
             | xuhu wrote:
             | Probably most valid code addresses on the stack are from
             | older calls that are no longer part of the current call
             | chain.
        
             | the_mitsuhiko wrote:
             | The act of stackwalking breaks. You often end up in
             | completely nonsensical stacks halfway through and then all
             | is lost.
        
               | loeg wrote:
               | Is this just an artifact of old return addresses on the
               | stack not being overwritten?
        
           | userbinator wrote:
           | Are you also doing back-disassembly to link the call chain?
           | I've had very good success with it. I agree that without
           | analysing the potential call sites and making sure that they
           | are actually possible, it won't work.
        
             | the_mitsuhiko wrote:
             | Generally what we do when we fall back to scanning depends
             | on the architecture. For x86 you can find the logic here:
             | https://github.com/rust-minidump/rust-
             | minidump/blob/77638ab7...
             | 
             | Since we do post-hoc stack walking our ability to actually
             | look at the assembly is limited. In most cases where we
             | have no CFI we also do not have the binary to begin with,
             | so we're in random memory land.
        
       ___________________________________________________________________
       (page generated 2023-04-16 23:00 UTC)