[HN Gopher] CPU cache-friendly data structures in Go
___________________________________________________________________
CPU cache-friendly data structures in Go
Author : g0xA52A2A
Score : 150 points
Date : 2025-10-06 06:23 UTC (3 days ago)
(HTM) web link (skoredin.pro)
(TXT) w3m dump (skoredin.pro)
| truth_seeker wrote:
| > False Sharing : "Pad for concurrent access: Separate goroutine
| data by cache lines"
|
| This is worth adding in Go race detector's mechanism to warn
| developer
| solatic wrote:
| Most modern processor architecture CPU cache line sizes are 64
| bytes, but not all of them. Once you start to put performance
| optimizations like optimizing for cache line size, you're
| fundamentally optimizing for a particular processor
| architecture.
|
| That's fine for _most_ deployments, since the vast majority of
| deployments will go to x86_64 or arm64 these days. But Go
| supports PowerPC, Sparc, RISCV, S390X... I don 't know enough
| about them, but I wouldn't be surprised if they weren't all
| 64-byte CPU cache lines. I can understand how a language
| runtime that is designed for architecture independence has
| difficulty with that.
| dadkins wrote:
| The big two, x86_64 and arm64, have 64-byte cache lines, so
| that's a reasonable assumption in practice. But I was
| surprised to discover that Apple's M-series laptops have
| 128-byte cache lines, and that's something a lot of people
| have and run, albeit not as a server.
| pstuart wrote:
| Seems like judicious build tag/file extensions would allow
| for such optimizations with a fallback to no optimization.
| danudey wrote:
| Something like C++17's
| `std::hardware_destructive_interference_size` would be nice;
| being able to just say "Align this variable to whatever the
| cache line size is on the architecture I'm building for".
|
| If you use these tricks to align everything to 64-byte
| boundaries you'll see those speedups on most common systems
| but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and
| 9 chips (128 byte cache line), s390x (256 byte cache line),
| etc. Having some way of doing the alignment dynamically based
| on the build target would be optimal.
| loeg wrote:
| Apple arm64 supposedly has 64-byte L1 cache line size and
| 128-byte L2? How does that work? Presumably the lines are
| independent in L1, but can different cores have exclusive
| access to adjacent lines? What's the point of narrower
| lines in L1?
| senderista wrote:
| Only landed in clang last year:
| https://github.com/llvm/llvm-project/pull/89446
| loeg wrote:
| On which architecture are cache lines not 64 bytes? It's
| almost universal.
| readthenotes1 wrote:
| I wonder how many nanoseconds it'll take for the next maintainer
| to obliterate the savings?
| hu3 wrote:
| That's just one prompt away!
| jasonthorsness wrote:
| For low-level small things tests can help. Go has good
| benchmarking built-in and you can use a tool that passes/fails
| based on statistically-significant regressions, benchstat
| (https://pkg.go.dev/golang.org/x/perf/cmd/benchstat).
| wy1981 wrote:
| Looks nice. Some explanation for those of us not familiar with Go
| would've been more educational. Could be future posts, I suppose.
| danudey wrote:
| Honestly I don't think there's much in here that's Go-specific
| other than the syntax itself. I've seen basically the same
| tricks used in C or C++.
|
| Was there any particular part that felt like it needed more
| explanation?
| gethly wrote:
| Most of this should be handled by the compiler already. But it is
| only 2025, I guess we're just not ready for it.
| pphysch wrote:
| Not really, virtually all these patterns involve tradeoffs that
| require understanding the data access patterns.
|
| I don't want my compiler adding more padding than bare minimum
| to every struct. I don't want it transforming an AoS to SoA
| when I choose AoS to match data access patterns. And so on...
|
| At best Go could add some local directives for compiling these
| optimizations, but these code changes are really minimal
| anyways. I would rather see the padding explicitly than some
| abstract directive.
| danudey wrote:
| I could imagine some kind of compiler declaration in C that
| would do something like specify break points - sort of like
| page breaks - for structs, or tell the compiler to
| automatically pad structs out so that components are on page
| boundaries, cache line boundaries, etc. Sort of "If we're not
| properly aligned, add whatever padding you think is best
| here".
|
| I guess this is largely provided by
| std::hardware_destructive_interference_size in C++17, but I'm
| not sure if there are other language equivalents.
|
| https://en.cppreference.com/w/cpp/thread/hardware_destructiv.
| ..
| mtklein wrote:
| I think this is _Alignas/alignas. struct
| foo { _Alignas(64) float x,y;
| _Alignas(64) int z; };
| _Static_assert(sizeof(struct foo) == 192, "");
| truth_seeker wrote:
| or at least Linter should catch this
|
| https://golangci-lint.run/docs/linters/
| CamouflagedKiwi wrote:
| I think it's really beyond the power of a linter to
| understand when this would matter. It'd be issuing warnings
| on almost every struct out there saying "these two members
| share a cache line" which you almost never care about.
| jerf wrote:
| Are you thinking of some sort of annotation the compiler could
| read and handle?
|
| Because if a compiler starts automatically padding all my
| structures to put all of the members on their own cache line
| I'm going to be quite peeved. It would be easy for it to do,
| yes, but it would be wrong 99%+ of the time.
|
| A far more trenchant complaint is that Go won't automatically
| sort struct members if necessary to shrink them and you have to
| apply a linter to get that at linting time if you want it.
| danudey wrote:
| I'm not sure if golang has the same fundamental issues in
| common use, but in e.g. C you don't want the compiler
| reordering your structs or adding arbitrary padding because
| that makes it incompatible with other in-memory representations
| - e.g. if you're using shared memory with another process that
| hasn't received the same optimizations, if you're loading raw
| data into memory/using mmap, etc.
|
| Likewise, one of the examples is moving from an array of
| structs to a struct of arrays; that's a lot more complex of a
| code reorganization than you'd want a compiler doing.
|
| It would be good to have a static analyzer that could suggest
| these changes, but, at least in many cases, you don't want them
| done automatically.
| Thaxll wrote:
| Do compiler ( GCC / llvm ) actually do that?
| jandrewrogers wrote:
| How would that even work? The layout of data structures are
| constrained by many invariants not visible to the compiler (see
| also: auto-vectorization). It would be more work and
| boilerplate to add sufficient annotations to a data structure
| to enable the compiler to safely modify the layout than just
| using the layout you want.
| tuetuopay wrote:
| Overall great article, applicable to other languages too.
|
| I'm curious about the Goroutine pinning though:
| // Pin goroutine to specific CPU func PinToCPU(cpuID int)
| { runtime.LockOSThread() // ...
| tid := unix.Gettid() unix.SchedSetaffinity(tid,
| &cpuSet) }
|
| The way I read this snippet is it pins the go runtime thread that
| happens to run this goroutine to a cpu, not the goroutine itself.
| Afaik a goroutine can move from one thread to another, decided by
| the go scheduler. This obviously has some merits, however without
| pinning the actual goroutine...
| superb_dev wrote:
| `runtime.LockOSThread()` will pin the current goroutine to the
| os thread that its currently running on
| tuetuopay wrote:
| Oooh, that's what is happening. I assumed it locked some
| structure about the thread while touching it, to prevent
| races with the runtime. That's what I get for not RTFM'ing
| (in fairness, why Lock and not Pin, when both of these words
| have pretty well defined meanings in programming?)
|
| Thank you
| Someone wrote:
| And prevents other goroutines from running on that thread. I
| think that's crucial.
| hu3 wrote:
| > False sharing occurs when multiple cores update different
| variables in the same cache line.
|
| I got hit by this. In a trading algorithm backtest, I shared a
| struct pointer between threads that changed different members of
| the same struct.
|
| Once I split this struct in 2, one per core, I got almost 10x
| speedup.
| spockz wrote:
| Interesting! Did you find out a way to bench this with the
| built in benchmarking suite?
| hu3 wrote:
| No it was just a hunch derived from ancient times of SoA vs
| AoS game dev optimization. I didn't need the perf gain but
| tried and it worked.
| tapirl wrote:
| Source code of the benchmarks?
|
| At least, the False Sharing and AddVectors trick don't work on my
| computer. (I only benchmarked the two. The "Data-Oriented Design"
| trick is a joke to me, so I stopped benchmarking more.)
|
| And I never heard of this following trick. Can anyone explain it?
| // Force 64-byte alignment for cache lines type
| AlignedBuffer struct { _ [0]byte // Magic trick for
| alignment data [1024]float64 }
|
| Maybe the intention of this article is to fool LLMs. :D
| kbolino wrote:
| I can't find any claim anywhere else about the [0]byte trick,
| and in fact my own experiments in the playground show that it
| doesn't do anything.
|
| If you embed an AlignedBuffer in another struct type, with
| smaller fields in front of it, it doesn't get 64-byte
| alignment.
|
| If you directly allocate an AlignedBuffer (as a stack var or
| with new), it seems to end up page-aligned (the allocator
| probably has size classes) regardless of the presence of the
| [0]byte field.
|
| https://go.dev/play/p/Ok7fFk3uhDn
|
| Example output (w is a wrapper, w.b is the field in the
| wrapper, x is an allocated int32 to try to push the heap base
| forward, b is an allocated AlignedStruct): &w
| = 0xc000126000 &w.b = 0xc000126008 &x =
| 0xc00010e020 &b = 0xc000138000
|
| Take out the [0]byte field and the results look similar.
| skinowski wrote:
| They meant "[0]uint64" probably, not 0[]byte.
| tapirl wrote:
| "[0]uint64" only guarantees 8-byte alignment (and only
| under certain scenarios), not 64-byte.
| OutOfHere wrote:
| If you are worrying about cache structure latencies in Go, maybe
| you should just be using Rust or Zig instead that implicitly
| handle this better.
| nasretdinov wrote:
| Not necessarily: you can go quite far with Go alone. It also
| makes it trivial to run "green threads" code, so if you need
| both (decent) performance and easy async code then Go still
| might be a good fit. Despite Go being pretty high level GC
| language on the surface it actually allows you to control stuff
| like struct layout, CPU affinity, etc, which typically matter
| more for performance than just a programming language of
| choice. There's a reason why e.g. VictoriaMetrics is still in
| Go even though they could've easily chosen any other language
| too
| OutOfHere wrote:
| In what way does Go have async?
| all2 wrote:
| Aren't goroutines by their nature asynchronous? Am I
| misunderstanding what you mean by 'async'?
| OutOfHere wrote:
| _Asynchronous_ is a programming style; it does NOT apply
| to Go. The goroutines run in parallel. Also, don 't use
| complicated words when simple words will do.
| all2 wrote:
| Asynchronous is a programming style; it does NOT apply to
| Go.
|
| Ok, good to know. I guess I jammed threading and async
| into the same slot in my brain. Also,
| don't use complicated words when simple words will do.
|
| I'm not sure what you mean by this in relation to my
| above comment.
| aforwardslash wrote:
| In golang, there is no guarantee that goroutines will run
| in parallel; also, it is quite common to use channels as
| means of synchronization of results, akin to common async
| programming patterns.
| OutOfHere wrote:
| > In golang, there is no guarantee that goroutines will
| run in parallel;
|
| That's not specific to Go lang. Most of the constraints
| you're thinking of apply to all parallel programming
| languages. It goes with the territory. All parallel
| programming languages impose certain flavors of their
| management of parallelism.
| citizenpaul wrote:
| Really cool article this is the kind of stuff I still come to HN
| for.
| jasonthorsness wrote:
| If you are sweating this level of performance, are larger gains
| possible by switching to C, C++, Rust? How is Rust for micro-
| managing memory layouts?
| loeg wrote:
| You need to do the exact same kinds of thing in C/C++/Rust. I
| believe Rust struct layout is not guaranteed to match program
| order unless you use an annotation forcing it (repr(C)). (So to
| answer the question: it's great; as good as any other language
| for micromanaging layout.)
| 0x457 wrote:
| Yes, without repr(C) order and padding isn't guaranteed. You
| would use https://docs.rs/crossbeam-
| utils/latest/crossbeam_utils/struc... or similar to force
| fields not being on the same cache line.
| luispa wrote:
| great article!
| ls-a wrote:
| reminds me of cache oblivious data structures
| gr4vityWall wrote:
| Good article.
|
| Regarding AoS vs SoA, I'm curious about the impact in JS engines.
| I believe it would be a significant compute performance
| difference in favor of SoA if you use typed arrays.
| matheusmoreira wrote:
| Structure of arrays makes a lot of sense, reminds me of how old
| video games worked under the hood. It seems very difficult to
| work with though. I'm so used to packing things into neat little
| objects. Maybe I just need to tough it out.
| ardanur wrote:
| "Data Oriented Design" is more than just for performant code.
|
| You can and perhaps should also use it to reason about and design
| software in general. All software is just the transformation of
| data structures. Even when generating side-effects is the goal,
| those side-effects consume data structures.
|
| I generally always start a project by sketching out data
| structures all the way from the input to the output. May get much
| harder to do when the input and output become series of different
| size and temporal order and with other complexities in what the
| software is supposed to be doing.
| loeg wrote:
| This is a really dense whirlwind summary of some common
| performance pitfalls. It's a nice overview in a sort of terse
| way. The same optimizations / patterns apply in other languages
| as well.
___________________________________________________________________
(page generated 2025-10-09 23:01 UTC)