[HN Gopher] Implementing a Struct of Arrays
       ___________________________________________________________________
        
       Implementing a Struct of Arrays
        
       Author : mpweiher
       Score  : 110 points
       Date   : 2025-05-09 10:52 UTC (12 hours ago)
        
 (HTM) web link (brevzin.github.io)
 (TXT) w3m dump (brevzin.github.io)
        
       | TheHideout wrote:
       | So, basically it's allowing you to use a struct like you would in
       | OOP, but get the array benefits of ECS when that struct is in a
       | vector?
        
         | throwawaymaths wrote:
         | Isn't ECS often Type-Heterogeneous? I think you mean DoD.
        
         | dgb23 wrote:
         | It's not necessary to think about the data interface in terms
         | of object orientation.
         | 
         | You can think about it as being a composition of fields, which
         | are individually stored in their respective array.
         | 
         | (Slightly beside the point: Often they are also stored in pairs
         | or larger, for example coordinates, slices and so on are almost
         | always operated on in the same function or block.)
         | 
         | The benefit comes from execution. If you have functions that
         | iterate over these structs, they only need to load the arrays
         | that contain the fields they need.
         | 
         | Zig does some magic here based on comptime to make this happen
         | automatically.
         | 
         | An ECS does something similar at a fundamental level. But
         | usually there's a whole bunch of additional stuff on top, such
         | as mapping ids to indices, registering individual components on
         | the fly, selecting a components for entities and so on. So it
         | can be a lot more complicated than what is presented here and
         | more stuff happens at runtime. It is also a bit of a one size
         | fits all kind of deal.
         | 
         | The article recommends watching Andrew Kelley's Talk on DoD,
         | which inspired the post. I agree wholeheartedly, it's a very
         | fun and interesting one.
         | 
         | One of the key takeaways for me is that he didn't just slap on
         | a design pattern (like ECS), but went to each piece
         | individually, thought about memory layout, execution, trade
         | offs in storing information versus recalculating, doing
         | measurements and back of the envelope calculations etc.
         | 
         | So the end result is a conglomerate of cleverly applied
         | principles and learnings.
        
         | jayd16 wrote:
         | More like they used reflection to take a struct and generate a
         | SOA collection for that type. Funnily enough, they skip the
         | part where you can actually get at the arrays and focus on the
         | struct type deconstruction and construction.
        
       | monkeyelite wrote:
       | Interest in SOA is bringing to mind the "art of the meta object
       | protocol" which argues for a stage between class definition and
       | implementation that would allow you to choose the layout and
       | access method for instances of a class.
        
         | corysama wrote:
         | Yep. We're in a situation where C-like languages couple layout
         | and access interface very tightly. But, now cache is such an
         | overriding issue in optimization, you really want to rapidly
         | experiment with different layouts without rewriting your whole
         | algorithm every time. AOS, SOA, AOSOA, hot/cold data for
         | different stages, etc...
         | 
         | Jon Blow's Jai language famously added a feature to references
         | that allowed you to easily experiment with moving data members
         | between hot/cold arrays of structs.
         | 
         | https://halide-lang.org/ tackles a related problem. It
         | decouples the math to be done from the access order so as to
         | allow you to rapidly test looping over data in complicated ways
         | to achieve cache-friendly access patterns for your specific
         | hardware target without rewriting your whole core loop every
         | time.
         | 
         | Halide is primarily an about image processing convolution
         | kernels. I'm not sure how general purpose it can get.
        
           | tialaramex wrote:
           | AIUI Jai no longer has that SOA versus AOS functionality
           | which Jon was very proud of when he invented it, I expect
           | it's one of those "hot idea" things where for one week it
           | seems as though this is a breakthrough with applications
           | everywhere and then a week later you realise it was not as
           | fundamental and you were just seeing applications everywhere
           | due to recency illusion.
           | 
           | The week after I first saw a Bloom Filter I imagined lots of
           | places this would surely be great, in the decades since I
           | probably have one bona fide use for a Bloom Filter per year,
           | maybe less.
        
             | monkeyelite wrote:
             | Exactly. The SOA thing makes sense in so few cases -
             | usually the one big data type your program operates on.
        
           | jayd16 wrote:
           | I suppose something like this would be considered a violation
           | of OOO...
           | 
           | That said Java might be able to naturally decompose record
           | types into SoA collections without a big lift. Same for C#
           | struct.
           | 
           | You might even be able to access views types that still code
           | like objects but point to the backing fields in the SoA
           | collection.
        
         | mpweiher wrote:
         | Yep. Objective-S[1] takes the ideas of "the art of the
         | metaobject protocol" and expands on them...using software
         | architectural principles [2].
         | 
         | One early result was the idea of _storage combinators_ [3], and
         | with those, AoS/SoA pretty much just falls out.
         | 
         | Storage Combinators are basically an interface for data. Any
         | kind of data, so generalized a bit from the variations of
         | "instances of a class" that you get in CLOS's MOP.
         | 
         | If you code against that interface, it doesn't matter how your
         | data is actually stored: objects, dictionaries, computed on-
         | demand, environment variables, files, http servers, whatever.
         | And the "combinator" part means that these can be
         | stacked/combined.
         | 
         | While you can do this using a library, and a lot is in fact
         | just implemented in libraries, you need the language support so
         | that things like objects/structs/variables can be accessed
         | quickly using this mechanism.
         | 
         | SoA storage for tables has been on the list of things to do for
         | a long time. I just started to work on some table ideas, so
         | might actually get around to it Real Soon Now(tm).
         | 
         | Currently I am working on other aspects of the table
         | abstraction, so for example just integrated query, so I ca
         | write the following:                   invoices[{Amount>30}]
         | 
         | And have that query work the same against an array of objects
         | (also a kind of table) and a SQL database.
         | 
         | [1] https://objective.st/
         | 
         | [2] https://dl.acm.org/doi/10.1145/3689492.3690052
         | 
         | [3] https://2019.splashcon.org/details/splash-2019-Onward-
         | papers...
        
       | sph wrote:
       | define_aggregate(^^Pointers,                 nsdms(^^T)
       | | std::views::transform([](std::meta::info member){
       | return data_member_spec(add_pointer(type_of(member)),
       | {.name = identifier_of(member)});                 }));
       | 
       | What operator is ^^type?
        
         | diath wrote:
         | https://isocpp.org/files/papers/P2996R4.html#the-reflection-...
        
           | tialaramex wrote:
           | back in R4 this operator is spelled ^ but in the accepted
           | draft it was spelled ^^
        
         | tialaramex wrote:
         | This is the reflection operator. The committee+ spent some time
         | bikeshedding different squiggles for this operator, but
         | eventually it looks like ^^ won because it was least awful.
         | 
         | + WG21 of SC22 of JTC1 between ISO and the IEC, "the C++
         | committee".
         | 
         | See P2996R11 for the latest draft of the work -
         | https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p29...
        
       | PessimalDecimal wrote:
       | I started thinking this would be a rehashing of how column-
       | oriented storage formats can be far more efficient for certain
       | access patterns. But it really was a showcase of what a bloated
       | mess C++ has become.
        
         | monkeyelite wrote:
         | And you just do not need to be doing this. Just write out which
         | columns you need and use an index. You can write generic sort
         | and remove, and just overload the swap function for your SOA
         | thing.
        
           | pragma_x wrote:
           | That was my takeaway as well. Pivoting a conventional
           | structure in a column-major storage system just smells like a
           | look-aside of some kind (a database). Plus, you lose the
           | benefits of bulk copy/move that the compiler will likely give
           | you in a standard struct (e.g. memcpy()), should it be on the
           | big side. From there, we can use lots of other tricks to
           | further speed things up: hashes, trees, bloom filters...
           | 
           | At the same time, I don't completely understand how such a
           | pivot would result in a speedup for random access. I suppose
           | it would speed up sequential access, since the multiple array
           | storage scheme might force many more cache line updates.
        
             | thesz wrote:
             | These days one's program rarely does purely sequential
             | random access. Usually, there are several threads or
             | requests, all going into same data structure.
             | 
             | In the extreme case where all memory accesses of a
             | immensely massively parallel program are deferred, the
             | result is complete absence of cache misses [1].
             | [1] https://disneyanimation.com/technology/hyperion/
             | 
             | While our programs usually not ran at Hyperion scale, they
             | still can benefit from accesses shared between processing
             | of several requests.
             | 
             | For one such example, consider speech recognition with the
             | HCLG graph. That graph is created through composition of
             | four WFST [2] graphs, the last one is Grammar, derived from
             | SRILM n-gram language model. This HCLG graph has scale free
             | property [3] due to all G FST states has to backoff to the
             | order-0 model, and some states are visited exponentially
             | often than others.                 [2]
             | https://en.wikipedia.org/wiki/Finite-
             | state_transducer#Weighted_automata       [3]
             | https://en.wikipedia.org/wiki/Scale-free_network
             | 
             | By sorting HCLG graph states by their fan-ins (number of
             | states referring to a particular one), we sped recognition
             | process up 5%. E.g., most referred state gets index 0,
             | second most referred is 1, etc, and that's it.
             | 
             | No code change, just preprocessing step that, I believe,
             | lessen the pressure on the CPU's memory protection
             | subsystem.
             | 
             | The speech recognition process with HCLG graph uses beam
             | search [4], there are several (9-20) hundredths of states
             | in the beam front to be evaluated. Most of them have
             | outgoing arc to some of the most visited (fan-in) states.
             | By having these states close to each other we incur less
             | page faults exception handling during processing.
             | [4] https://en.wikipedia.org/wiki/Beam_search
             | 
             | Basically, we shared more MMU information between requests
             | in beam front.
             | 
             | PS
             | 
             | Fun fact: WFST created by OpenFST has "structure of arrays"
             | layout. ;)
        
         | dexterous8339 wrote:
         | I recently went down this rabbit hole myself and came out with
         | my own column-major array type.
         | 
         | The explanation drifts into thinking of 2D byte arrays as 3D
         | bit matrices, but in the end it was a 20-30x improvement in
         | speed and binary size.
         | 
         | I was honestly surprised that C++ doesn't have anything built
         | in for this, but at least it's trivial to write your own array
         | type
        
         | thechao wrote:
         | I showed up to the C++ rodeo right as C++98 was taking off, and
         | got to work with Jaakko & Gaby & Bjarne (at TAMU) on what would
         | be C++11 (C++0x). Man -- those were great times. I still can't
         | decide which craziness -- reflection or coroutines -- I detest
         | worse in "modern" C++. In either case, I really feel like I'm
         | channeling my inner/outer old-man-shakes-fist-at-clouds. On the
         | other hand, I see languages like Nim, Zig, Rust, Swift, Go, ...
         | and, I mean, _seriously_.
        
       | mac3n wrote:
       | JOVIAL language had TABLE structures, which could be declared
       | SERIAL or PARALLEL
       | (https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDeta...
       | page 281).
        
       | gr4vityWall wrote:
       | Interesting article. It does show how modern C++ can be quite
       | scary.
       | 
       | It reminded me of a Haxe macro used by the Dune: Spice Wars devs
       | to transform an AoS into a SoA at compile time to increase
       | performance: https://youtu.be/pZcKyqLcjzc?t=941
       | 
       | The end result is quite cool, though those compile time type
       | generation macros always look too magical to me. Makes me wonder
       | if just getting values using an index wouldn't end up being more
       | readable.
        
       | Enthouan wrote:
       | I'm definitely missing the point, but reading the article, I kept
       | thinking "This would've been so much easier in C."
        
         | trealira wrote:
         | I don't see how it would be easier. Whatever you can do in C,
         | you can do in C++. That said, I was also a bit confused, since
         | I guess I don't keep up with the latest C++ updates enough.
        
         | quotemstr wrote:
         | No, it would be impossible in C, the "it" here being the
         | _automatic_ conversion of a struct to an array of the struct 's
         | fields. Sure, you can do that pretty easily _by hand_ in C, but
         | you can also do it easily _by hand_ in C++, so whatever. The
         | point of the article is the metaprogramming.
        
           | zahlman wrote:
           | I get the impression that ("pure") C programmers say things
           | like that because they don't get themselves into situations
           | where they feel they'd _want_ the metaprogramming.
        
             | quotemstr wrote:
             | They might think that they don't need metaprograming, but
             | like everyone else, they eventually do.
             | 
             | That's how we get monstrosities like
             | https://github.com/freebsd/freebsd-
             | src/blob/master/sys/sys/q... (plenty of Linux equivalents
             | --- e.g. the tracepoint stuff)
             | 
             | Better to use a language with thoughtful metaprogramming
             | than tack it on badly later.
        
       | gpderetta wrote:
       | The use of reflection is interesting, but is there a significant
       | advantage, compared to something like this:
       | template<template<class> class G>       struct Point {
       | G<int> x;          G<int> y;          auto get() { return
       | std::tie(x,y); }       };
       | template<template<template<class> class> class C>       struct
       | SOA  {         template<class T> using Id = T;
       | template<class T> using Ref = T&;         C<std::vector> vs;
       | void push_back(C<Id> x) {             std::apply([&] (auto&&...
       | r) {                 std::apply([&](auto&&... v){ (
       | (r.push_back(v)),...); }, x.get());             }, vs.get());
       | }         C<Ref> operator[](size_t i) {             return
       | std::apply([&] (auto&&... r) { return C<Ref>{ r[i]...}; },
       | vs.get());         }       };            int main() {
       | SOA<Point> soa_point;         soa_point.push_back({1,2});
       | auto [x,y] = soa_point[0];       }
        
         | OskarS wrote:
         | > template<template<template<class> class> class C>
         | 
         | Boy howdy!
        
       | the__alchemist wrote:
       | I think I've lost the thread on the abstractions. (Me not being
       | very familiar with Zig outside of its most basic syntax is
       | probably why.) I've been doing a lot of SoA work in rust lately;
       | specifically because I have numerical/scientific code that uses
       | CPU SIMD and CUDA; SoA works great for these.
       | 
       | The workflow is, I set up Vec3x8, and Quaternionx8, which wrap a
       | simd instrinsic for each field (x: f32x8, y: f32x8...) etc.
       | 
       | For the GPU and general flattening, I just pack the args as Vecs,
       | then the fn signature contains them like eps: &[f32], sigma:
       | &[f32] etc. So, I'm having trouble mapping this SoA approach to
       | the abstractions used in the article. Then the (C++-like CUDA)
       | kernel sees these as *float3 params etc. It also feels like a
       | complexity reverse of the Rust/Zig stereotypes...
       | 
       | Examples:                 struct Vec3x8 {         x: f32x8,
       | y: f32x8,         z: f32x8       } // appropriate operator
       | overloads...                 struct Setup {           eps:
       | Vec<f32>,           sigma: Vec<f32>,       }
       | 
       | So, Structs of Arrays, plainly. Are the abstractions used here
       | something like Jai is attempting, where the internal
       | implementation is decoupled from the API, so you don't compromise
       | on performance vice ergonomics?
        
       | jayd16 wrote:
       | Towards the middle you realize this is about the reflection more
       | than anything else.
       | 
       | I do like how directly accessing the fields individually (the
       | whole reason you would do this) is a hypothetical presented as an
       | after thought. Enjoyably absurd.
        
       | perihelions wrote:
       | I attempted to write a minimal version of the idea in Common
       | Lisp, if anyone was curious about how it'd look like in that
       | language,
       | 
       | https://peri.pages.dev/struct-of-arrays-snippet
        
       | Dwedit wrote:
       | Struct of Arrays vs Array of Structs have different performance,
       | depending on how they are accessed, mainly due to how memory will
       | be cached.
       | 
       | If you are iterating over objects sequentially, and looking at
       | every field, they will perform about the same. If you are
       | iterating over objects sequentially, and only looking at a few
       | fields, Struct of Arrays performs better. If you are accessing
       | objects at random, and reading every field, Array of Structs
       | performs better.
       | 
       | Array of Structs has a multiplication step when you calculate the
       | address of an element. Struct of Arrays basically guarantees that
       | the multiplication will be by a power of 2, so it's a bitshift
       | rather than a multiplication. (Unless your struct contains a type
       | whose size is not a power of 2). Multiplication is still very
       | fast though on most architectures, so that's not all that much of
       | a difference.
        
         | bombela wrote:
         | How could it not be a power of 2?
         | 
         | The compiler makes sure to align fields and pad structs to
         | power of 2 anyways.
        
           | tomsmeding wrote:
           | If your struct contains a field that is itself a struct, e.g.
           | one containing three int32_t's, then that field has size 12,
           | which is not a power of two.
        
             | compiler-guy wrote:
             | Doesn't even need to be a struct within a struct:
             | 
             | struct Foo {                 int8_t a;            int8_t b;
             | int8_t c;
             | 
             | };
             | 
             | will have a size of 3 and an alignment of 1. Alignment is
             | always power of 2, and size modulus alignment will always
             | be zero. But alignment can be less than size, and size need
             | not be a power of two, just have one power-of-two factor.
        
             | bombela wrote:
             | Brain was not fully loaded yet, thank you.
             | 
             | I don't think the struct in struct is relevant in your
             | example by the way. i32 requires 4-bytes alignment. An
             | array of a struct{3xi32} will be aligned to 12 bytes. Which
             | maintains the 4 bytes alignment of the fields.
             | 
             | If this structure is embedded into another one, it will be
             | padded accordingly to the alignment requirement of the
             | outer field.
             | 
             | { { 3xi32 } i32 } will be 16 bytes.
             | 
             | { { 3xi32 } i64 } will be 12, padding 4, 8. Totalling 24.
             | 
             | I am assuming x86_64.
        
         | compiler-guy wrote:
         | Struct of arrays have the potential to be much more memory
         | efficient if the original struct had padding.
         | 
         | struct Foo {                 int64_t i;            int8_t  c;
         | // 7 bytes of padding for alignment.
         | 
         | };
         | 
         | A Foo array will have seven bytes of wasted space per element.
         | That can add up, and can even blow away what would be nice
         | caching effects if looking at both elements.
         | 
         | Access patterns do matter a lot to this, but it isn't as easy
         | as describing one. Always profile.
         | 
         | [Edit, I wish hn supported markdown.]
        
           | WalterGR wrote:
           | Lines that start with 2 spaces are monospaced.
           | 
           | https://news.ycombinator.com/formatdoc
        
         | silisili wrote:
         | I haven't benched them, but it feels like in some languages
         | bounds checking would hamper SoA performance, no?
        
       | adzm wrote:
       | Neat to see the reflection / unibrow operator ^^ in the wild!
        
       | gwenzek wrote:
       | Interesting to see the new C++ at work. But also I'm sure it's
       | easier to learn Zig full language that just the new C++
       | metaprogramming
        
       ___________________________________________________________________
       (page generated 2025-05-09 23:01 UTC)