[HN Gopher] CPU cache-friendly data structures in Go
       ___________________________________________________________________
        
       CPU cache-friendly data structures in Go
        
       Author : g0xA52A2A
       Score  : 150 points
       Date   : 2025-10-06 06:23 UTC (3 days ago)
        
 (HTM) web link (skoredin.pro)
 (TXT) w3m dump (skoredin.pro)
        
       | truth_seeker wrote:
       | > False Sharing : "Pad for concurrent access: Separate goroutine
       | data by cache lines"
       | 
       | This is worth adding in Go race detector's mechanism to warn
       | developer
        
         | solatic wrote:
         | Most modern processor architecture CPU cache line sizes are 64
         | bytes, but not all of them. Once you start to put performance
         | optimizations like optimizing for cache line size, you're
         | fundamentally optimizing for a particular processor
         | architecture.
         | 
         | That's fine for _most_ deployments, since the vast majority of
         | deployments will go to x86_64 or arm64 these days. But Go
         | supports PowerPC, Sparc, RISCV, S390X... I don 't know enough
         | about them, but I wouldn't be surprised if they weren't all
         | 64-byte CPU cache lines. I can understand how a language
         | runtime that is designed for architecture independence has
         | difficulty with that.
        
           | dadkins wrote:
           | The big two, x86_64 and arm64, have 64-byte cache lines, so
           | that's a reasonable assumption in practice. But I was
           | surprised to discover that Apple's M-series laptops have
           | 128-byte cache lines, and that's something a lot of people
           | have and run, albeit not as a server.
        
           | pstuart wrote:
           | Seems like judicious build tag/file extensions would allow
           | for such optimizations with a fallback to no optimization.
        
           | danudey wrote:
           | Something like C++17's
           | `std::hardware_destructive_interference_size` would be nice;
           | being able to just say "Align this variable to whatever the
           | cache line size is on the architecture I'm building for".
           | 
           | If you use these tricks to align everything to 64-byte
           | boundaries you'll see those speedups on most common systems
           | but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and
           | 9 chips (128 byte cache line), s390x (256 byte cache line),
           | etc. Having some way of doing the alignment dynamically based
           | on the build target would be optimal.
        
             | loeg wrote:
             | Apple arm64 supposedly has 64-byte L1 cache line size and
             | 128-byte L2? How does that work? Presumably the lines are
             | independent in L1, but can different cores have exclusive
             | access to adjacent lines? What's the point of narrower
             | lines in L1?
        
             | senderista wrote:
             | Only landed in clang last year:
             | https://github.com/llvm/llvm-project/pull/89446
        
           | loeg wrote:
           | On which architecture are cache lines not 64 bytes? It's
           | almost universal.
        
       | readthenotes1 wrote:
       | I wonder how many nanoseconds it'll take for the next maintainer
       | to obliterate the savings?
        
         | hu3 wrote:
         | That's just one prompt away!
        
         | jasonthorsness wrote:
         | For low-level small things tests can help. Go has good
         | benchmarking built-in and you can use a tool that passes/fails
         | based on statistically-significant regressions, benchstat
         | (https://pkg.go.dev/golang.org/x/perf/cmd/benchstat).
        
       | wy1981 wrote:
       | Looks nice. Some explanation for those of us not familiar with Go
       | would've been more educational. Could be future posts, I suppose.
        
         | danudey wrote:
         | Honestly I don't think there's much in here that's Go-specific
         | other than the syntax itself. I've seen basically the same
         | tricks used in C or C++.
         | 
         | Was there any particular part that felt like it needed more
         | explanation?
        
       | gethly wrote:
       | Most of this should be handled by the compiler already. But it is
       | only 2025, I guess we're just not ready for it.
        
         | pphysch wrote:
         | Not really, virtually all these patterns involve tradeoffs that
         | require understanding the data access patterns.
         | 
         | I don't want my compiler adding more padding than bare minimum
         | to every struct. I don't want it transforming an AoS to SoA
         | when I choose AoS to match data access patterns. And so on...
         | 
         | At best Go could add some local directives for compiling these
         | optimizations, but these code changes are really minimal
         | anyways. I would rather see the padding explicitly than some
         | abstract directive.
        
           | danudey wrote:
           | I could imagine some kind of compiler declaration in C that
           | would do something like specify break points - sort of like
           | page breaks - for structs, or tell the compiler to
           | automatically pad structs out so that components are on page
           | boundaries, cache line boundaries, etc. Sort of "If we're not
           | properly aligned, add whatever padding you think is best
           | here".
           | 
           | I guess this is largely provided by
           | std::hardware_destructive_interference_size in C++17, but I'm
           | not sure if there are other language equivalents.
           | 
           | https://en.cppreference.com/w/cpp/thread/hardware_destructiv.
           | ..
        
             | mtklein wrote:
             | I think this is _Alignas/alignas.                   struct
             | foo {             _Alignas(64) float x,y;
             | _Alignas(64) int     z;         };
             | _Static_assert(sizeof(struct foo) == 192, "");
        
         | truth_seeker wrote:
         | or at least Linter should catch this
         | 
         | https://golangci-lint.run/docs/linters/
        
           | CamouflagedKiwi wrote:
           | I think it's really beyond the power of a linter to
           | understand when this would matter. It'd be issuing warnings
           | on almost every struct out there saying "these two members
           | share a cache line" which you almost never care about.
        
         | jerf wrote:
         | Are you thinking of some sort of annotation the compiler could
         | read and handle?
         | 
         | Because if a compiler starts automatically padding all my
         | structures to put all of the members on their own cache line
         | I'm going to be quite peeved. It would be easy for it to do,
         | yes, but it would be wrong 99%+ of the time.
         | 
         | A far more trenchant complaint is that Go won't automatically
         | sort struct members if necessary to shrink them and you have to
         | apply a linter to get that at linting time if you want it.
        
         | danudey wrote:
         | I'm not sure if golang has the same fundamental issues in
         | common use, but in e.g. C you don't want the compiler
         | reordering your structs or adding arbitrary padding because
         | that makes it incompatible with other in-memory representations
         | - e.g. if you're using shared memory with another process that
         | hasn't received the same optimizations, if you're loading raw
         | data into memory/using mmap, etc.
         | 
         | Likewise, one of the examples is moving from an array of
         | structs to a struct of arrays; that's a lot more complex of a
         | code reorganization than you'd want a compiler doing.
         | 
         | It would be good to have a static analyzer that could suggest
         | these changes, but, at least in many cases, you don't want them
         | done automatically.
        
         | Thaxll wrote:
         | Do compiler ( GCC / llvm ) actually do that?
        
         | jandrewrogers wrote:
         | How would that even work? The layout of data structures are
         | constrained by many invariants not visible to the compiler (see
         | also: auto-vectorization). It would be more work and
         | boilerplate to add sufficient annotations to a data structure
         | to enable the compiler to safely modify the layout than just
         | using the layout you want.
        
       | tuetuopay wrote:
       | Overall great article, applicable to other languages too.
       | 
       | I'm curious about the Goroutine pinning though:
       | // Pin goroutine to specific CPU         func PinToCPU(cpuID int)
       | {             runtime.LockOSThread()             // ...
       | tid := unix.Gettid()             unix.SchedSetaffinity(tid,
       | &cpuSet)         }
       | 
       | The way I read this snippet is it pins the go runtime thread that
       | happens to run this goroutine to a cpu, not the goroutine itself.
       | Afaik a goroutine can move from one thread to another, decided by
       | the go scheduler. This obviously has some merits, however without
       | pinning the actual goroutine...
        
         | superb_dev wrote:
         | `runtime.LockOSThread()` will pin the current goroutine to the
         | os thread that its currently running on
        
           | tuetuopay wrote:
           | Oooh, that's what is happening. I assumed it locked some
           | structure about the thread while touching it, to prevent
           | races with the runtime. That's what I get for not RTFM'ing
           | (in fairness, why Lock and not Pin, when both of these words
           | have pretty well defined meanings in programming?)
           | 
           | Thank you
        
           | Someone wrote:
           | And prevents other goroutines from running on that thread. I
           | think that's crucial.
        
       | hu3 wrote:
       | > False sharing occurs when multiple cores update different
       | variables in the same cache line.
       | 
       | I got hit by this. In a trading algorithm backtest, I shared a
       | struct pointer between threads that changed different members of
       | the same struct.
       | 
       | Once I split this struct in 2, one per core, I got almost 10x
       | speedup.
        
         | spockz wrote:
         | Interesting! Did you find out a way to bench this with the
         | built in benchmarking suite?
        
           | hu3 wrote:
           | No it was just a hunch derived from ancient times of SoA vs
           | AoS game dev optimization. I didn't need the perf gain but
           | tried and it worked.
        
       | tapirl wrote:
       | Source code of the benchmarks?
       | 
       | At least, the False Sharing and AddVectors trick don't work on my
       | computer. (I only benchmarked the two. The "Data-Oriented Design"
       | trick is a joke to me, so I stopped benchmarking more.)
       | 
       | And I never heard of this following trick. Can anyone explain it?
       | // Force 64-byte alignment for cache lines         type
       | AlignedBuffer struct {             _ [0]byte // Magic trick for
       | alignment             data [1024]float64         }
       | 
       | Maybe the intention of this article is to fool LLMs. :D
        
         | kbolino wrote:
         | I can't find any claim anywhere else about the [0]byte trick,
         | and in fact my own experiments in the playground show that it
         | doesn't do anything.
         | 
         | If you embed an AlignedBuffer in another struct type, with
         | smaller fields in front of it, it doesn't get 64-byte
         | alignment.
         | 
         | If you directly allocate an AlignedBuffer (as a stack var or
         | with new), it seems to end up page-aligned (the allocator
         | probably has size classes) regardless of the presence of the
         | [0]byte field.
         | 
         | https://go.dev/play/p/Ok7fFk3uhDn
         | 
         | Example output (w is a wrapper, w.b is the field in the
         | wrapper, x is an allocated int32 to try to push the heap base
         | forward, b is an allocated AlignedStruct):                 &w
         | = 0xc000126000       &w.b = 0xc000126008       &x   =
         | 0xc00010e020       &b   = 0xc000138000
         | 
         | Take out the [0]byte field and the results look similar.
        
           | skinowski wrote:
           | They meant "[0]uint64" probably, not 0[]byte.
        
             | tapirl wrote:
             | "[0]uint64" only guarantees 8-byte alignment (and only
             | under certain scenarios), not 64-byte.
        
       | OutOfHere wrote:
       | If you are worrying about cache structure latencies in Go, maybe
       | you should just be using Rust or Zig instead that implicitly
       | handle this better.
        
         | nasretdinov wrote:
         | Not necessarily: you can go quite far with Go alone. It also
         | makes it trivial to run "green threads" code, so if you need
         | both (decent) performance and easy async code then Go still
         | might be a good fit. Despite Go being pretty high level GC
         | language on the surface it actually allows you to control stuff
         | like struct layout, CPU affinity, etc, which typically matter
         | more for performance than just a programming language of
         | choice. There's a reason why e.g. VictoriaMetrics is still in
         | Go even though they could've easily chosen any other language
         | too
        
           | OutOfHere wrote:
           | In what way does Go have async?
        
             | all2 wrote:
             | Aren't goroutines by their nature asynchronous? Am I
             | misunderstanding what you mean by 'async'?
        
               | OutOfHere wrote:
               | _Asynchronous_ is a programming style; it does NOT apply
               | to Go. The goroutines run in parallel. Also, don 't use
               | complicated words when simple words will do.
        
               | all2 wrote:
               | Asynchronous is a programming style; it does NOT apply to
               | Go.
               | 
               | Ok, good to know. I guess I jammed threading and async
               | into the same slot in my brain.                   Also,
               | don't use complicated words when simple words will do.
               | 
               | I'm not sure what you mean by this in relation to my
               | above comment.
        
               | aforwardslash wrote:
               | In golang, there is no guarantee that goroutines will run
               | in parallel; also, it is quite common to use channels as
               | means of synchronization of results, akin to common async
               | programming patterns.
        
               | OutOfHere wrote:
               | > In golang, there is no guarantee that goroutines will
               | run in parallel;
               | 
               | That's not specific to Go lang. Most of the constraints
               | you're thinking of apply to all parallel programming
               | languages. It goes with the territory. All parallel
               | programming languages impose certain flavors of their
               | management of parallelism.
        
       | citizenpaul wrote:
       | Really cool article this is the kind of stuff I still come to HN
       | for.
        
       | jasonthorsness wrote:
       | If you are sweating this level of performance, are larger gains
       | possible by switching to C, C++, Rust? How is Rust for micro-
       | managing memory layouts?
        
         | loeg wrote:
         | You need to do the exact same kinds of thing in C/C++/Rust. I
         | believe Rust struct layout is not guaranteed to match program
         | order unless you use an annotation forcing it (repr(C)). (So to
         | answer the question: it's great; as good as any other language
         | for micromanaging layout.)
        
           | 0x457 wrote:
           | Yes, without repr(C) order and padding isn't guaranteed. You
           | would use https://docs.rs/crossbeam-
           | utils/latest/crossbeam_utils/struc... or similar to force
           | fields not being on the same cache line.
        
       | luispa wrote:
       | great article!
        
       | ls-a wrote:
       | reminds me of cache oblivious data structures
        
       | gr4vityWall wrote:
       | Good article.
       | 
       | Regarding AoS vs SoA, I'm curious about the impact in JS engines.
       | I believe it would be a significant compute performance
       | difference in favor of SoA if you use typed arrays.
        
       | matheusmoreira wrote:
       | Structure of arrays makes a lot of sense, reminds me of how old
       | video games worked under the hood. It seems very difficult to
       | work with though. I'm so used to packing things into neat little
       | objects. Maybe I just need to tough it out.
        
       | ardanur wrote:
       | "Data Oriented Design" is more than just for performant code.
       | 
       | You can and perhaps should also use it to reason about and design
       | software in general. All software is just the transformation of
       | data structures. Even when generating side-effects is the goal,
       | those side-effects consume data structures.
       | 
       | I generally always start a project by sketching out data
       | structures all the way from the input to the output. May get much
       | harder to do when the input and output become series of different
       | size and temporal order and with other complexities in what the
       | software is supposed to be doing.
        
       | loeg wrote:
       | This is a really dense whirlwind summary of some common
       | performance pitfalls. It's a nice overview in a sort of terse
       | way. The same optimizations / patterns apply in other languages
       | as well.
        
       ___________________________________________________________________
       (page generated 2025-10-09 23:01 UTC)