[HN Gopher] Fast Shadow Stacks for Go
       ___________________________________________________________________
        
       Fast Shadow Stacks for Go
        
       Author : ingve
       Score  : 117 points
       Date   : 2024-05-29 07:52 UTC (1 days ago)
        
 (HTM) web link (blog.felixge.de)
 (TXT) w3m dump (blog.felixge.de)
        
       | felixge wrote:
       | OP here, happy to answer any question.
        
         | nickcw wrote:
         | Very interesting article thank you.
         | 
         | Do you see this speeding up real world Go programs?
        
           | felixge wrote:
           | Thanks! And to answer you question: No, it won't speed up Go
           | programs for now. This was mostly a fun research project for
           | me.
           | 
           | The low hanging fruits to speed up stack unwinding in the Go
           | runtime is to switch to frame pointer unwinding in more
           | places. In go1.21 we contributed patches to do this for the
           | execution tracer. For the upcoming go1.23 release, my
           | colleague Nick contributed patches to upgrade the block and
           | mutex profiler. Once the go1.24 tree opens, we're hoping to
           | tackle the memory profiler as well as copystack. The latter
           | would benefit all Go programs, even those not using
           | profiling. But it's likely going to be relative small win (<=
           | 1%).
           | 
           | Once all of this is done, shadow stacks have the potential to
           | make things even faster. But the problem is that we'll be
           | deeply in diminishing returns territory at that point.
           | Speeding up stack capturing is great when it makes up 80-90%
           | of your overhead (this was the case for the execution tracer
           | before frame pointers). But once we're down to 1-2% (the
           | current situation for the execution tracer), another 8x
           | speedup is not going to buy us much, especially when it has
           | downsides.
           | 
           | The only future in which shadow stacks could speed up real Go
           | programs is one where we decide to drop frame pointer support
           | in the compiler, which could provide 1-2% speedup for all Go
           | programs. Once hardware shadow stacks become widely available
           | and accessible, I think that would be worth considering. But
           | that's likely to be a few years down the road from now.
        
             | aerfio wrote:
             | Do you think/know of any areas in Go codebase that would
             | enable jump in performance bigger than e.g 10%? I'm very
             | grateful for any work done in Go codebase, for me this
             | language is plenty fast, I'm just curious what's the state
             | of Go internals, are there any techniques left to speed it
             | up significantly or some parts of codebase/old
             | architectures holding it back? And thank you for your work!
        
               | felixge wrote:
               | I don't think any obvious 10%+ opportunities have been
               | overlooked. Go is optimizing for fast and simple builds,
               | which is a bit at odds with optimal code gen. So I think
               | the biggest opportunity is to use Go implementations that
               | are based on aggressively optimizing compilers such as
               | LLVM and GCC. But those implementations tend to be a few
               | major versions behind and are likely to be less stable
               | than the official compiler.
               | 
               | That being said, I'm sure there are a lot of remaining
               | incremental optimization opportunities that could add up
               | to 10% over time. For example a faster map implementation
               | [1]. I'm sure there is more.
               | 
               | Another recent perf opportunity is using pgo [2] which
               | can get you 10% in some cases. Shameless plug: We
               | recently GA'ed our support for it at Datadog [3].
               | 
               | [1] https://github.com/golang/go/issues/54766 [2]
               | https://go.dev/doc/pgo [3]
               | https://www.datadoghq.com/blog/datadog-pgo-go/
        
               | dolmen wrote:
               | Reminder: the Go team consider optimizations in code
               | generation only as far the the compiler is kept fast.
               | That's why the Go compiler doesn't have as many
               | optimizations phases as C/C++/Rust compilers.
               | 
               | As a developer I like that approach as it keeps a great
               | developer experience and helps me stayed focus and gives
               | me great productivity.
        
               | neonsunset wrote:
               | Go limitation is it's a high-level language with a very
               | simple compiler where providing true zero-cost
               | abstractions (full monomorphization rather than GC
               | stenciling) and advanced optimizations is a bridge it
               | wouldn't cross because it means much greater engineering
               | effort spent on the compiler and increasing LOCs by a
               | factor of 5, especially if a compiler wants to preserve
               | its throughput.
               | 
               | Though I find it unfortunate that the industry considers
               | Go as a choice for _performance-sensitive_ scenarios when
               | C# exists which went the above route and does not
               | sacrifice performance and ability to offer performance-
               | specific APIs (like crossplat SIMD) by paying the price
               | of higher effort /complexity compiler implementation. It
               | also does in-runtime PGO (DynamicPGO) given long-running
               | server workloads are usually using JIT where it's
               | available, so you don't need to carefully craft a sample
               | workload hoping it would match production behavior - JIT
               | does it for you and it yields anything from 10% to 35%
               | depending on how abstraction-heavy the codebase is.
        
         | nikolayasdf123 wrote:
         | Did you get any response from Go core team? Might make sense to
         | open issue in GitHub https://github.com/golang/go/issues/ and
         | start thread in https://groups.google.com/g/golang-dev
        
           | felixge wrote:
           | I know that at least two engineers from the runtime team have
           | seen the post in the #darkarts channel of gopher slack. One
           | of them left a fire emoji :).
           | 
           | I'll probably bring it up in the by-weekly Go runtime
           | diagnostics sync [1] next Thursday, but my guess is that
           | they'll have the same conclusion as me: Neat trick, but not a
           | good idea for the runtime until hardware shadow stacks become
           | widely available and accessible.
           | 
           | [1] https://github.com/golang/go/issues/57175
        
       | 10000truths wrote:
       | In an ideal world, we'd be using an ABI that keeps the code
       | (return addresses) and data (spilled function params/locals) in
       | two separate stacks. Then the profiling would be as
       | straightforward as a memory copy of the code stack.
        
         | felixge wrote:
         | That's what hardware shadow stacks in modern intel/arm CPUs can
         | do! It just needs to be exposed to user space and become widely
         | available.
        
           | fweimer wrote:
           | Fedora 40 and later have shadow stack support in userspace.
           | It's currently opt-in with glibc (`export
           | GLIBC_TUNABLES=glibc.cpu.x86_shstk=on` is one way to switch
           | it on I believe). The plan is to make this self-tuning
           | eventually in glibc upstream, once the quirks have been
           | ironed out.
           | 
           | It will not work with Go as-is because the Go scheduler will
           | have to be taught to switch the shadow stack along with the
           | regular stack, panic/recover needs to walk the shadow stack.
           | But for binaries that do not use CGO, it would be possible to
           | enable this fairly quickly. Hardware support is already
           | widely available. The SHSTK-specific code paths are well-
           | isolated. You would not lose compatibility with older CPUs or
           | kernels.
        
             | felixge wrote:
             | Thanks for the reply!
             | 
             | What does the API for accessing the shadow stack from user
             | space look like? I didn't see anything for it in the kernel
             | docs [1].
             | 
             | I agree about the need for switching the shadow stacks in
             | the Go scheduler. But this would probably require an API
             | that is a bit at odds with the security goals of the kernel
             | feature.
             | 
             | I'm not sure I follow your thoughts on CGO and how this
             | would work on older CPUs and kernels.
             | 
             | [1] https://docs.kernel.org/next/x86/shstk.html
        
               | fweimer wrote:
               | You can get the shadow stack pointer using the RDSSPQ
               | instruction. The kernel documentation shows how the
               | shadow stack size is obtained for the main thread.
               | Threads created explicitly using clone3 have the
               | specified shadow stack size. I think this is sufficient
               | to determine the shadow stack boundaries.
               | 
               | Regarding older CPUs, what I wanted to point out is that
               | the code to enable and maintain shadow stacks will not be
               | smeared across the instruction stream (unlike using APX
               | instructions, or direct use of LSE atomics on AArch64).
               | It's possible to execute the shadow stack code only
               | conditionally.
        
               | felixge wrote:
               | Thank you so much, this is very helpful and interesting.
               | I'll try to experiment with this at some point.
        
               | fweimer wrote:
               | Glad to be of help.
               | 
               | Note that it's actually
               | GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK to enable it with
               | Fedora 40 glibc (and the program needs to be built with
               | -fcf-protection).
        
         | mmaniac wrote:
         | Would that not complicate function prologues & epilogues?
         | 
         | I suppose hardware support (e.g. dedicated call/ret
         | instructions accessing a second stack pointer) is also
         | desirable if you want to do this sort of thing. I recall that
         | Itanium had something like this.
        
           | Joker_vD wrote:
           | How would it complicate the prologue/epilogue? It would still
           | be                   push    RBP         mov     RBP, RDSP  ;
           | DSP stands for "data-stack pointer"         sub     RDSP,
           | <frame-size>         ...                ; locals are [RBP-n],
           | args are [RBP+n]         add     RDSP, <frame-size>
           | pop     RBP         ret                ; pops address from
           | [RCSP], where CSP stands for "call-stack pointer"
           | 
           | or even, if you don't want a specific base pointer, like
           | this:                   sub     RDSP, <frame-size>
           | ...                ; locals are [RBP+n], args are
           | [RBP+frame_size+n]         add     RDSP, <frame-size>
           | ret
           | 
           | In fact, you can do this even now on x86-64: make a second
           | stack segment, point RBP there, and use RBP as if it were
           | that hypothetical RDSP register. The return addresses
           | naturally go into the RSP-pointed area, as they always did.
           | The only tricky part is when you need to call the code that
           | uses SysV x64 ABI, you'll need to conditionally re-align that
           | RSP-pointed area to be 16-bytes aligned which is a bit
           | annoying (in your own code, having RBP always aligned at 16
           | bytes is trivial since function calls don't break that
           | alignment).
           | 
           | There are two main benefits of this layout, as I see it:
           | first, smashing the return address is now way harder that it
           | normally is; second, "the act of traversing all call frames
           | on the stack and collecting the return addresses that are
           | stored inside of them" is now trivial, since the call stack
           | at RSP is now just a dense array of just the return
           | addresses, no pointer chasing needed.
        
             | jerf wrote:
             | And Go is already using a non-standard ABI and has to take
             | real action to convert back to standard ABIs if you want to
             | call external code. That's part of why cgo calls are
             | expensive, the other major bit relating to having to
             | interact with OS threads differently than native Go code
             | does.
        
         | friendzis wrote:
         | Isn't that what mainly differentiates Harvard from Von Neumann?
        
           | immibis wrote:
           | No. They store code and data in separate address spaces. That
           | isn't the same as storing code addresses and data addresses
           | in separate parts of the data space.
        
             | eru wrote:
             | Yes. And all four combinations of Harvard / von-Neumann X
             | single stack / double stack are potentially viable options.
             | 
             | Btw, I think Forth has a separate data stack and return
             | address stack?
        
               | p_l wrote:
               | Yes. You can also freely use the return stack, IIRC, so
               | long as you don't mess up a return itself.
        
       | malkia wrote:
       | You can totally access these stacks from userspace -
       | https://github.com/gabriellandau/ShadowStackWalk
        
         | felixge wrote:
         | That seems to be windows only? My main target OS is Linux.
        
           | malkia wrote:
           | Ah - sorry, now I understand. Also TIL - This might not be
           | right away accessible on Linux, unless certain config is
           | done/set. Thanks!
        
       | naasking wrote:
       | How common is it for stack frames to be deeper than 32 calls,
       | where these techniques cross in the benchmark? I don't program in
       | Go, but it doesn't seem that common in the languages I've used,
       | so I'm skeptical it's worth it.
        
         | malkia wrote:
         | On Windows, I tend to often see callstack frames going down a
         | lot, especially in C++ debug builds where inlines are no longer
         | such.
         | 
         | On top of that if you have work-stealing task scheduler, that
         | if your current thread waits on I/O reassigns it to some other
         | task to work it may bulb even more.
        
           | naasking wrote:
           | Can you clarify what you mean by "going down a lot"? I
           | interpret that to mean that the stacks are flatter, but then
           | debug builds without inlining would grow the stack, not
           | flatten it, so I'm not clear on what you're saying.
           | 
           | Scheduling tasks on a work-stealing queue tends to flatten
           | the stack first, eg. a halted task has a nearly empty stack
           | if it's an async coroutine. The scheduler itself may have 3-4
           | calls deep before resuming the task itself, so that doesn't
           | seem too bad. Unless you're talking about full blown threads.
        
             | malkia wrote:
             | Sorry I meant to say is that inlines are now individual
             | functions, so the amount of callstack entry items is going
             | to grow.
             | 
             | But then, if you capture the stack (using sampling), and
             | catch for example ThreadContext switch while CreateFileA is
             | happening, you can expect lots of callstack items going all
             | the way down to the kernel
        
         | felixge wrote:
         | I'm skeptical that it's worth it myself, this was just a fun
         | research project for me. But once hardware shadow stacks are
         | available, I think this could be great.
         | 
         | To answer your first question: For most Go applications, the
         | average stack trace depth for profiling/execution tracing is
         | below 32 frames. But some applications use heavy middleware
         | layers that can push the average above this limit.
         | 
         | That being said, I think this technique will amortize much
         | earlier when the fixed cost per frame walk is higher, e.g. when
         | using DWARF or gopclntab unwinding. For Go that doesn't really
         | matter while the compiler emits frame pointers. But it's always
         | good to have options when it comes to evolving the compiler and
         | runtime ...
        
       ___________________________________________________________________
       (page generated 2024-05-30 23:01 UTC)