[HN Gopher] Fast Shadow Stacks for Go
___________________________________________________________________
Fast Shadow Stacks for Go
Author : ingve
Score : 117 points
Date : 2024-05-29 07:52 UTC (1 days ago)
(HTM) web link (blog.felixge.de)
(TXT) w3m dump (blog.felixge.de)
| felixge wrote:
| OP here, happy to answer any question.
| nickcw wrote:
| Very interesting article thank you.
|
| Do you see this speeding up real world Go programs?
| felixge wrote:
| Thanks! And to answer you question: No, it won't speed up Go
| programs for now. This was mostly a fun research project for
| me.
|
| The low hanging fruits to speed up stack unwinding in the Go
| runtime is to switch to frame pointer unwinding in more
| places. In go1.21 we contributed patches to do this for the
| execution tracer. For the upcoming go1.23 release, my
| colleague Nick contributed patches to upgrade the block and
| mutex profiler. Once the go1.24 tree opens, we're hoping to
| tackle the memory profiler as well as copystack. The latter
| would benefit all Go programs, even those not using
| profiling. But it's likely going to be relative small win (<=
| 1%).
|
| Once all of this is done, shadow stacks have the potential to
| make things even faster. But the problem is that we'll be
| deeply in diminishing returns territory at that point.
| Speeding up stack capturing is great when it makes up 80-90%
| of your overhead (this was the case for the execution tracer
| before frame pointers). But once we're down to 1-2% (the
| current situation for the execution tracer), another 8x
| speedup is not going to buy us much, especially when it has
| downsides.
|
| The only future in which shadow stacks could speed up real Go
| programs is one where we decide to drop frame pointer support
| in the compiler, which could provide 1-2% speedup for all Go
| programs. Once hardware shadow stacks become widely available
| and accessible, I think that would be worth considering. But
| that's likely to be a few years down the road from now.
| aerfio wrote:
| Do you think/know of any areas in Go codebase that would
| enable jump in performance bigger than e.g 10%? I'm very
| grateful for any work done in Go codebase, for me this
| language is plenty fast, I'm just curious what's the state
| of Go internals, are there any techniques left to speed it
| up significantly or some parts of codebase/old
| architectures holding it back? And thank you for your work!
| felixge wrote:
| I don't think any obvious 10%+ opportunities have been
| overlooked. Go is optimizing for fast and simple builds,
| which is a bit at odds with optimal code gen. So I think
| the biggest opportunity is to use Go implementations that
| are based on aggressively optimizing compilers such as
| LLVM and GCC. But those implementations tend to be a few
| major versions behind and are likely to be less stable
| than the official compiler.
|
| That being said, I'm sure there are a lot of remaining
| incremental optimization opportunities that could add up
| to 10% over time. For example a faster map implementation
| [1]. I'm sure there is more.
|
| Another recent perf opportunity is using pgo [2] which
| can get you 10% in some cases. Shameless plug: We
| recently GA'ed our support for it at Datadog [3].
|
| [1] https://github.com/golang/go/issues/54766 [2]
| https://go.dev/doc/pgo [3]
| https://www.datadoghq.com/blog/datadog-pgo-go/
| dolmen wrote:
| Reminder: the Go team consider optimizations in code
| generation only as far the the compiler is kept fast.
| That's why the Go compiler doesn't have as many
| optimizations phases as C/C++/Rust compilers.
|
| As a developer I like that approach as it keeps a great
| developer experience and helps me stayed focus and gives
| me great productivity.
| neonsunset wrote:
| Go limitation is it's a high-level language with a very
| simple compiler where providing true zero-cost
| abstractions (full monomorphization rather than GC
| stenciling) and advanced optimizations is a bridge it
| wouldn't cross because it means much greater engineering
| effort spent on the compiler and increasing LOCs by a
| factor of 5, especially if a compiler wants to preserve
| its throughput.
|
| Though I find it unfortunate that the industry considers
| Go as a choice for _performance-sensitive_ scenarios when
| C# exists which went the above route and does not
| sacrifice performance and ability to offer performance-
| specific APIs (like crossplat SIMD) by paying the price
| of higher effort /complexity compiler implementation. It
| also does in-runtime PGO (DynamicPGO) given long-running
| server workloads are usually using JIT where it's
| available, so you don't need to carefully craft a sample
| workload hoping it would match production behavior - JIT
| does it for you and it yields anything from 10% to 35%
| depending on how abstraction-heavy the codebase is.
| nikolayasdf123 wrote:
| Did you get any response from Go core team? Might make sense to
| open issue in GitHub https://github.com/golang/go/issues/ and
| start thread in https://groups.google.com/g/golang-dev
| felixge wrote:
| I know that at least two engineers from the runtime team have
| seen the post in the #darkarts channel of gopher slack. One
| of them left a fire emoji :).
|
| I'll probably bring it up in the by-weekly Go runtime
| diagnostics sync [1] next Thursday, but my guess is that
| they'll have the same conclusion as me: Neat trick, but not a
| good idea for the runtime until hardware shadow stacks become
| widely available and accessible.
|
| [1] https://github.com/golang/go/issues/57175
| 10000truths wrote:
| In an ideal world, we'd be using an ABI that keeps the code
| (return addresses) and data (spilled function params/locals) in
| two separate stacks. Then the profiling would be as
| straightforward as a memory copy of the code stack.
| felixge wrote:
| That's what hardware shadow stacks in modern intel/arm CPUs can
| do! It just needs to be exposed to user space and become widely
| available.
| fweimer wrote:
| Fedora 40 and later have shadow stack support in userspace.
| It's currently opt-in with glibc (`export
| GLIBC_TUNABLES=glibc.cpu.x86_shstk=on` is one way to switch
| it on I believe). The plan is to make this self-tuning
| eventually in glibc upstream, once the quirks have been
| ironed out.
|
| It will not work with Go as-is because the Go scheduler will
| have to be taught to switch the shadow stack along with the
| regular stack, panic/recover needs to walk the shadow stack.
| But for binaries that do not use CGO, it would be possible to
| enable this fairly quickly. Hardware support is already
| widely available. The SHSTK-specific code paths are well-
| isolated. You would not lose compatibility with older CPUs or
| kernels.
| felixge wrote:
| Thanks for the reply!
|
| What does the API for accessing the shadow stack from user
| space look like? I didn't see anything for it in the kernel
| docs [1].
|
| I agree about the need for switching the shadow stacks in
| the Go scheduler. But this would probably require an API
| that is a bit at odds with the security goals of the kernel
| feature.
|
| I'm not sure I follow your thoughts on CGO and how this
| would work on older CPUs and kernels.
|
| [1] https://docs.kernel.org/next/x86/shstk.html
| fweimer wrote:
| You can get the shadow stack pointer using the RDSSPQ
| instruction. The kernel documentation shows how the
| shadow stack size is obtained for the main thread.
| Threads created explicitly using clone3 have the
| specified shadow stack size. I think this is sufficient
| to determine the shadow stack boundaries.
|
| Regarding older CPUs, what I wanted to point out is that
| the code to enable and maintain shadow stacks will not be
| smeared across the instruction stream (unlike using APX
| instructions, or direct use of LSE atomics on AArch64).
| It's possible to execute the shadow stack code only
| conditionally.
| felixge wrote:
| Thank you so much, this is very helpful and interesting.
| I'll try to experiment with this at some point.
| fweimer wrote:
| Glad to be of help.
|
| Note that it's actually
| GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK to enable it with
| Fedora 40 glibc (and the program needs to be built with
| -fcf-protection).
| mmaniac wrote:
| Would that not complicate function prologues & epilogues?
|
| I suppose hardware support (e.g. dedicated call/ret
| instructions accessing a second stack pointer) is also
| desirable if you want to do this sort of thing. I recall that
| Itanium had something like this.
| Joker_vD wrote:
| How would it complicate the prologue/epilogue? It would still
| be push RBP mov RBP, RDSP ;
| DSP stands for "data-stack pointer" sub RDSP,
| <frame-size> ... ; locals are [RBP-n],
| args are [RBP+n] add RDSP, <frame-size>
| pop RBP ret ; pops address from
| [RCSP], where CSP stands for "call-stack pointer"
|
| or even, if you don't want a specific base pointer, like
| this: sub RDSP, <frame-size>
| ... ; locals are [RBP+n], args are
| [RBP+frame_size+n] add RDSP, <frame-size>
| ret
|
| In fact, you can do this even now on x86-64: make a second
| stack segment, point RBP there, and use RBP as if it were
| that hypothetical RDSP register. The return addresses
| naturally go into the RSP-pointed area, as they always did.
| The only tricky part is when you need to call the code that
| uses SysV x64 ABI, you'll need to conditionally re-align that
| RSP-pointed area to be 16-bytes aligned which is a bit
| annoying (in your own code, having RBP always aligned at 16
| bytes is trivial since function calls don't break that
| alignment).
|
| There are two main benefits of this layout, as I see it:
| first, smashing the return address is now way harder that it
| normally is; second, "the act of traversing all call frames
| on the stack and collecting the return addresses that are
| stored inside of them" is now trivial, since the call stack
| at RSP is now just a dense array of just the return
| addresses, no pointer chasing needed.
| jerf wrote:
| And Go is already using a non-standard ABI and has to take
| real action to convert back to standard ABIs if you want to
| call external code. That's part of why cgo calls are
| expensive, the other major bit relating to having to
| interact with OS threads differently than native Go code
| does.
| friendzis wrote:
| Isn't that what mainly differentiates Harvard from Von Neumann?
| immibis wrote:
| No. They store code and data in separate address spaces. That
| isn't the same as storing code addresses and data addresses
| in separate parts of the data space.
| eru wrote:
| Yes. And all four combinations of Harvard / von-Neumann X
| single stack / double stack are potentially viable options.
|
| Btw, I think Forth has a separate data stack and return
| address stack?
| p_l wrote:
| Yes. You can also freely use the return stack, IIRC, so
| long as you don't mess up a return itself.
| malkia wrote:
| You can totally access these stacks from userspace -
| https://github.com/gabriellandau/ShadowStackWalk
| felixge wrote:
| That seems to be windows only? My main target OS is Linux.
| malkia wrote:
| Ah - sorry, now I understand. Also TIL - This might not be
| right away accessible on Linux, unless certain config is
| done/set. Thanks!
| naasking wrote:
| How common is it for stack frames to be deeper than 32 calls,
| where these techniques cross in the benchmark? I don't program in
| Go, but it doesn't seem that common in the languages I've used,
| so I'm skeptical it's worth it.
| malkia wrote:
| On Windows, I tend to often see callstack frames going down a
| lot, especially in C++ debug builds where inlines are no longer
| such.
|
| On top of that if you have work-stealing task scheduler, that
| if your current thread waits on I/O reassigns it to some other
| task to work it may bulb even more.
| naasking wrote:
| Can you clarify what you mean by "going down a lot"? I
| interpret that to mean that the stacks are flatter, but then
| debug builds without inlining would grow the stack, not
| flatten it, so I'm not clear on what you're saying.
|
| Scheduling tasks on a work-stealing queue tends to flatten
| the stack first, eg. a halted task has a nearly empty stack
| if it's an async coroutine. The scheduler itself may have 3-4
| calls deep before resuming the task itself, so that doesn't
| seem too bad. Unless you're talking about full blown threads.
| malkia wrote:
| Sorry I meant to say is that inlines are now individual
| functions, so the amount of callstack entry items is going
| to grow.
|
| But then, if you capture the stack (using sampling), and
| catch for example ThreadContext switch while CreateFileA is
| happening, you can expect lots of callstack items going all
| the way down to the kernel
| felixge wrote:
| I'm skeptical that it's worth it myself, this was just a fun
| research project for me. But once hardware shadow stacks are
| available, I think this could be great.
|
| To answer your first question: For most Go applications, the
| average stack trace depth for profiling/execution tracing is
| below 32 frames. But some applications use heavy middleware
| layers that can push the average above this limit.
|
| That being said, I think this technique will amortize much
| earlier when the fixed cost per frame walk is higher, e.g. when
| using DWARF or gopclntab unwinding. For Go that doesn't really
| matter while the compiler emits frame pointers. But it's always
| good to have options when it comes to evolving the compiler and
| runtime ...
___________________________________________________________________
(page generated 2024-05-30 23:01 UTC)