[HN Gopher] Flattening ASTs and other compiler data structures (...
___________________________________________________________________
Flattening ASTs and other compiler data structures (2023)
Author : aw1621107
Score : 144 points
Date : 2025-01-10 19:23 UTC (1 days ago)
(HTM) web link (www.cs.cornell.edu)
(TXT) w3m dump (www.cs.cornell.edu)
| ww520 wrote:
| This is a fantastic idea. AST works well in an array based
| allocation block since it has no need for freeing individual
| nodes. It's an add-only allocation.
| JonChesterfield wrote:
| What about transforming the AST after it is built, or deriving
| a new tree which partly aliases the original in persistent
| structure fashion?
| dmagyari wrote:
| "Instead of allocating Expr objects willy-nilly on the heap,
| we'll pack them into a single, contiguous array." Zig compiler
| pipeline (AST, Zir, Air, Sema) does exactly this on all layers.
| Not only contiguous, but instead of array-of-structs it is
| struct-of-arrays, so walking the tree is even more cache
| friendly. For AST see:
| https://github.com/ziglang/zig/blob/master/lib/std/zig/Ast.z...
| agumonkey wrote:
| Makes me wonder if people in APL/J/K community have not been
| influenced or influencing this kind of technique. IIRC Aaron
| Hsu does tree processing through arrays (but i'm not skilled
| enough to analyze his code)
| gsf_emergency wrote:
| Do you have a link to such an example of Aaron's code? Thank
| you in advance!
| 082349872349872 wrote:
| https://scholarworks.iu.edu/dspace/items/3ab772c9-92c9-4f59
| -...
|
| Iverson's 1962 book also mentions tree representations, see
| pp45-62: https://archive.org/details/aprogramminglanguage19
| 62/page/n6...
| agumonkey wrote:
| Can't remember where exactly but he did demo his code in
| talks/conferences with links.
| gritzko wrote:
| I work on a C dialect where _everything_ is flattened. JSON and
| other trees in particular. Binary heaps are flat, merge sort
| and iterator heaps are absolutely great, can build LSM
| databases with that. Stacks, circular buffers, hash maps, etc,
| all flat. Templated output (PHP like) is done by a flat data
| structure.
|
| https://github.com/gritzko/librdx/blob/master/abc/B.md
|
| Apart from locality and lifetimes, these flat data structures
| improve _composability_. When every data structure is a flat
| buffer, you can mmap them or zip them or send them by the
| network, all by the same routine. They are uniform like bricks,
| in a sense.
| rurban wrote:
| I worked in a language where all datastructures were
| "flattened", could be trivially serialized to disk, and read
| in again. Called print and read. The language was called
| lisp. All flat, just parens.
|
| Some of my compilers export the AST as lisp trees. Much
| smaller and more readable than json, and it can be executed.
| Uniform like bricks
| vanderZwan wrote:
| > _All flat, just parens._
|
| So not flat then. Prefix is not postfix. Forth, and most
| concatenative languages, are much closer to _actually_
| bein, flat.
|
| Lisp is trivial to flatten, but that's not the same thing.
| cardanome wrote:
| Amazing article, very good advice to keep your data structures
| flat.
|
| Adding to that, it also makes editing the AST vastly more
| efficient.
|
| I have discovered that principle on my own when I worked on an
| editor that directly operated on the AST instead of text. I found
| manipulating the tree-style AST so painful, constantly traversing
| the tree and all. Once I made it flat, my life was a hell lot
| easier. You can just directly index any part of AST in linear
| time.
| emptysea wrote:
| Rust-analyzer uses a similar technique for parsing
| https://github.com/rust-lang/rust-analyzer/blob/master/crate...
| which then gets fed into https://github.com/rust-analyzer/rowan
| (lossless syntax tree)
| bradrn wrote:
| There is a longer overview of it here: https://github.com/rust-
| lang/rust-analyzer/blob/master/docs/...
| ndesaulniers wrote:
| Cool! Carbon is doing exactly this. I had asked leads if there
| was a paper on this approach, but they didn't have anything for
| me. I'll send them this post!
| benatkin wrote:
| Zig uses a MultiArrayList which sounds similar
| https://mitchellh.com/zig/parser
| wrsh07 wrote:
| Chandler discusses it in this video though!
| https://youtu.be/ZI198eFghJk
|
| You get some traversals for free with this layout (preorder,
| reverse post order). Can search for subtrees with string
| searching algorithms or more complex things with regex.
| kazinator wrote:
| > _Instead of allocating Expr objects willy-nilly on the heap,
| we'll pack them into a single, contiguous array._
|
| This happens naturally if you bump-allocate them in a garbage-
| collected run-time, particularly under a copying collector. Free
| lists also tend to co-locate because they are produced during
| sweep phases of GC which run through heaps in order of address.
|
| Don't make me bring out the L word for the billionth time.
|
| > _A flat array of Exprs can make it fun and easy to implement
| hash consing_
|
| OK, it's not a case of L-ignorance, just willful neglect.
| samps wrote:
| FWIW I did acknowledge this in the article:
|
| > A sufficiently smart memory allocator might achieve the same
| thing, especially if you allocate the whole AST up front and
| never add to it
|
| > Again, a really fast malloc might be hard to compete with--
| but you basically can't beat bump allocation on sheer
| simplicity.
| layer8 wrote:
| And if you don't need more than 32 GB of heap space, the JVM
| also gives you the ability to reduce reference sizes to 32
| bits, with compressed references. (Due to alignment
| requirements, the lower 3 bits of a pointer are zero and hence
| do not need to be stored, which effectively gives you a 35-bit
| address space.) Of course, the presence of object headers
| counteracts this to a certain extent.
| hencq wrote:
| As the article mentions, this makes it quite similar to a
| bytecode vm. I think the traditional wisdom is that an AST walker
| is easy to write, but for speed you'd want a bytecode
| interpreter. It'd be interesting to see how close the performance
| gets with this flattened AST.
|
| In practice I think there are more differences. E.g. AST
| interpreters tend to pass environments around while bytecode
| interpreters often store these on a stack (though I guess there's
| nothing stopping you from doing this with an AST either). I
| wonder if there's some goldilocks zone for ease of implementation
| with decent performance.
| kazinator wrote:
| If you instead flatten the expression tree into RPN, then you
| can execute it like that, with a stack machine.
|
| I seem to recall that the Red Dragon Book ( _Compilers:
| Principles, Techniques and Tools_ , Aho, Sethi, Ullman [1988])
| describes a technique whereby intermediate code is represented
| in RPN, and transformations are performed by pattern matches on
| it.
| finnh wrote:
| The sample flat program in the post is exactly RPN, no?
| samps wrote:
| I think it would be more like RPN if it used a stack, and
| operands were specified as relative offsets (i.e., stack
| offsets). In the version I wrote, operands are still
| represented as absolute offsets in the expression table.
| dang wrote:
| Related:
|
| _Flattening ASTs and other compiler data structures (2023)_ -
| https://news.ycombinator.com/item?id=42181603 - Nov 2024 (2
| comments)
|
| _Flattening ASTs and other compiler data structures_ -
| https://news.ycombinator.com/item?id=36559346 - July 2023 (81
| comments)
| jgrowl wrote:
| I thought a reddit comment on this article had an interesting
| point:
|
| https://www.reddit.com/r/rust/comments/1d3b356/my_new_favori...
|
| [-]Timzhy0 3 points 7 months ago
|
| Btw I think one can go a step further than the author, there is
| no need to keep two explicit ExprRef baked in a binary node (lhs,
| rhs). You can exploit locality, basically the AST, seen it the
| LISP way, is just an arbitrarily nestable list, where elements
| are atoms or other lists. Hence all you need to know is where
| each list ends (and if it's an atom you can assume it spans one
| node) and actually one bit to know if it is the last entry in the
| list is quite ergonomic as well (because then you can distinguish
| whether moving next slot in the AST means there is a sibling).
| Basically it's easier to keep it sync while constructing and
| takes up less memory per node. I pay 40 bits per node, stored
| interleaved for best cache locality (some unaligned accesses but
| I think it's still worthwhile), 8 bits for the tag, 32 for the
| data, if data is bigger, 32 is an index into some auxiliary
| segment (basically a ptr).
| catgary wrote:
| An arbitrarily nestable list is a tree, no?
| userbinator wrote:
| Rediscovering techniques that were somewhat well-known in the 70s
| and 80s.
|
| See also: https://en.wikipedia.org/wiki/Binary_heap
| Taniwha wrote:
| heh - I built compilers this back in the 70s because the
| machine I was working on didn't really do pointers as a 1st
| class data structure (B6700 algol), it's not really surprising
| finding someone doing something similar in another language
| that makes pointers difficult to deal with
| torginus wrote:
| Yup, and chances are if you're using a good C++ stl
| implementation, most containers already use packed storage
| internally. It doesn't even have a heap data structure, it uses
| an std::vector, with a set of helper functions.
| torginus wrote:
| I guess Rust's contribution to high performance programming is
| that its borrow checker is so loathsome that it pushes people
| into using ideas like ECS or arenas just to not have to bother
| with it.
| Agraillo wrote:
| About 10 years ago working with AST trees I (re)invented a flat
| structure representing trees in a flat array. It reminds me of
| what is described here but not exactly. In my case I needed only
| two indices per node: total sub-region length of all the children
| and sub-children and parent index (so no need to have indices of
| all children). Total sub-length basically can be interpreted as
| the index of the next sibling. With such a structure it's
| easy/cheap to execute FindFirstChild/FindNextSibling.
|
| Later the theory behind such structures was revealed as "Nested
| set model" [1]. The article seems to not mention the internal
| representation, but I think that the implementation should use
| something like my solution, so fixed number of references per
| node
|
| [1] https://en.wikipedia.org/wiki/Nested_set_model
| efitz wrote:
| Did the author just reinvent Forth?
| jonstewart wrote:
| You can also be smart about memory with lexing for great winnage.
| Have a struct for your tokens that has an enum for your token
| type and either pointer or indices or a string_view (or a &str
| but lol lotsa luck with the borrow checker). You can then have a
| vector of your token structs for fast allocation and iteration
| and you have a slice for the token back into the original input,
| no substring copying.
| uecker wrote:
| Yes, the C FE I write (in C) does exactly this and othen utputs
| in one pass a flattened intermediate code. I did not see at as
| AST because it does semantic analysis during parsing, but it
| sitll has all the information.
| pfdietz wrote:
| One advantage to this is the ease with which it handles ephemeral
| annotations.
|
| Suppose you want to attach additional information to some of the
| nodes of the AST. Different algorithms on the AST will attach
| different information; you don't necessarily need them all at the
| same time or know ahead of time what you'll need.
|
| With nodes, you have to have some sort of node/value hash table,
| or hang a key/value map off each node. But with this flattened
| representation, each datum gets its own flat array as well, which
| can be independently allocated and deallocated.
|
| One other thing I noticed about this flat representation is that
| it throws static typing into a cocked hat! All you have to refer
| to other nodes is indices. All different kinds of nodes are
| stored in the same array.
| Tarean wrote:
| Twee (an equational theorem prover in Haskell used by quickspec)
| has an interesting take on this. Terms are slices of arrays, but
| you get a normal interface including pattern matching via
| synonyms. It can also be nice to use phantom types of your
| references (array offsets), so if you project them into flat view
| types you can do so type safely
|
| Requires the language to have something equivalent to pattern
| synonyms to be as invisible as twee, though.
|
| In twee a TermList is a slice of a bytearray (two ints for
| offset/length plus a pointer).
|
| And a term is an int for the function symbol and an unpacked
| TermList for the arguments.
|
| The pattern match synonyms load a flat representation from the
| array into a view type, and the allocation of the view type
| cancels out with the pattern matching so everything remains
| allocation free.
|
| https://hackage.haskell.org/package/twee-lib-2.4.2/docs/Twee...
| Tarean wrote:
| Forgot to mention: In the twee style, the int for the function
| id contains metadata (is it a unification variable or constant
| name? how many args does it take?). That way f1(f3(f5(), f7()))
| would be serialised as something like [1,3,5,7], without even
| references to other offsets
___________________________________________________________________
(page generated 2025-01-11 23:02 UTC)