[HN Gopher] Arena-based parsers
       ___________________________________________________________________
        
       Arena-based parsers
        
       Author : ibylich
       Score  : 121 points
       Date   : 2024-05-06 15:59 UTC (2 days ago)
        
 (HTM) web link (iliabylich.github.io)
 (TXT) w3m dump (iliabylich.github.io)
        
       | justinpombrio wrote:
       | Oh yeah, arena allocation is great. I've used it a few times and
       | it's always a huge performance improvement. The requirements are
       | that (i) the objects you're allocating all have the same type,
       | and (ii) they all have the same lifetime (i.e. it's okay not to
       | free any of them while others still live).
        
         | knome wrote:
         | Why would a rust arena allocator be required to have objects
         | all of the same type in it? No C arena allocator would have
         | such a restriction.
        
           | Rusky wrote:
           | It wouldn't. For some reason in Rust people tend to apply the
           | term "arena" to "vector + indices" as well as the proper
           | definition, which does have that restriction.
        
         | o11c wrote:
         | "All objects of the same(-sized) type" is typically a
         | constraint of bitmap allocators, not arena allocators.
        
         | kragen wrote:
         | your (i) is incorrect (arena allocators can allocate objects of
         | different types, and indeed one of their strong points is that
         | they don't suffer from fragmentation) and your (ii) doesn't
         | make sense because _not_ freeing an object is _always_ okay,
         | according to conventional programming language semantics anyway
         | 
         | this particular arena allocator _does_ allocate objects of
         | different types, almost arbitrary types in fact:
         | fn alloc_uninitialized<T>(&self) -> NonNull<T> { ... }
         | 
         | the safety concern is that nothing _outside_ the arena survive
         | the arena if it contains a pointer into the arena. as the
         | linked micro-book puts it,  'it shouldn't be possible for any
         | arena-allocated [variable] to outlive the memory that it[s
         | data]'s located on.'
        
           | hansvm wrote:
           | > not freeing an object is always okay, according to
           | conventional programming language semantics anyway
           | 
           | Conventional semantics have to hit reality eventually. Long-
           | running programs can't just alloc forever without free, hence
           | an arena not being the only allocator in most long-running
           | programs.
        
             | kragen wrote:
             | yes, this is certainly a weak point in conventional
             | semantics, but not relevant to this issue, which is about
             | ensuring no dangling pointers, not about memory leaks
             | 
             | if you're running low on memory, switching to an arena
             | allocator is probably a bad idea, because it will usually
             | increase your memory requirements (by delaying deallocation
             | longer), not decrease them
        
       | asplake wrote:
       | > To release memory we simply release the blob itself, so it's
       | like a batched deallocation, sort of. The consequence here is
       | that all of our types must be:
       | 
       | > 1. trivially copyable (: Copy if we speak in Rust)
       | 
       | > 2. trivially destructible (i.e. they can't have impl Drop)
       | 
       | 2 (Drop) seems obvious enough, but why 1 (Copy)? There's a bit of
       | Rust knowledge implicit there that I'm unaware of, and I might
       | not be alone in that. Could someone explain?
        
         | mmastrac wrote:
         | Trivially copyable and destructible means that they are plain-
         | old data, with no allocations, associated resources or any
         | other side-effects.
        
           | LegionMammal978 wrote:
           | Trivial destructibility alone is generally sufficient for
           | that in Rust. Recall that all values (outside a Pin) are
           | trivially moveable, so pretty much the only reason you'd want
           | trivial copying is if you actually need more than one valid
           | copy at once. (One exception is using Copy to express a poor
           | man's Freeze bound, which is quite limited, and dubious in
           | any case.)
        
             | mmastrac wrote:
             | A type that is not Copy may have associated memory
             | allocations, however. You cannot simply forget a range of
             | !Copy structs without the risk of memory leaks.
        
               | LegionMammal978 wrote:
               | If those !Copy structs deallocate their memory when
               | dropped properly, then they aren't trivially
               | destructible. If they _were_ trivially destructible, then
               | they would leak their memory regardless of whether you
               | dropped or forgot them. That 's what it means for a type
               | to be trivially destructible: dropping it is a no-op.
        
               | mmastrac wrote:
               | I think we're talking about the same thing -- the
               | original post in the thread used "no impl Drop" as
               | "trivially destructable", not "no functional difference
               | between drop and forgetting".
        
               | LegionMammal978 wrote:
               | Well, I interpret "trivially destructible (i.e. they
               | can't have impl Drop)" as meaning "the type itself can't
               | impl Drop, nor can any of its fields or subfields", which
               | is what it means for a type to be "trivially
               | destructible" in Rust.
               | 
               | But what do you mean by "trivially destructible" being
               | different from "no functional difference between drop and
               | forgetting"? Unless I'm mistaken, they are the same
               | thing: if none of a type's subfields impl Drop, then
               | dropping it does absolutely nothing; whereas if any
               | subfield _does_ impl Drop, then it can 't be a no-op,
               | since the compiler has to form a &mut reference to pass
               | it into Drop::drop() (a distinction very relevant for
               | unsafe code).
               | 
               | Could you give me an example of a !Copy type that is
               | "trivially destructible", but can't be forgotten without
               | leaking memory?
        
         | LegionMammal978 wrote:
         | I'm pretty sure being trivially destructible is sufficient for
         | this use case. The only reason you might want a Copy bound when
         | you're not actually copying values is if you're trying to
         | disallow interior mutability, but that doesn't seem to be the
         | case here. Perhaps the author was misinformed.
         | 
         | Or maybe they're just using Copy as a type-level bound that
         | guarantees a trivial destructor, even if it's more restrictive
         | than strictly necessary.
        
           | MindSpunk wrote:
           | You can't have a negative trait bound to say "I take anything
           | that _doesn't_ implement Drop". It's not a stabilized feature
           | in Rust because it would allow adding a trait to a struct to
           | become a breaking API change where it currently is not. Copy
           | guarantees !Drop so it's currently the only way to do it
           | without opting into nightly features.
           | 
           | You can remove the requirement for trivially destructible
           | types by maintaining a singly linked list of (ptr to alloc'd
           | object, destructorFn) pairs with the links allocated from the
           | arena itself. The arena can just walk the list and call all
           | the drop functions when it gets reset. You can even
           | specialize for types with trivial destructors so they don't
           | append to the list and so you only pay the extra cost if you
           | allocate Drop types (using std::mem::needs_drop).
        
             | LegionMammal978 wrote:
             | > You can't have a negative trait bound to say "I take
             | anything that _doesn't_ implement Drop". It's not a
             | stabilized feature in Rust because it would allow adding a
             | trait to a struct to become a breaking API change where it
             | currently is not. Copy guarantees !Drop so it's currently
             | the only way to do it without opting into nightly features.
             | 
             | Alternatively, you could just do an std::mem::needs_drop()
             | check when inserting an element, and fail if it indicates
             | that it would need to be dropped. (Yeah, its documentation
             | protests that it can have false positives, but IIRC you
             | have to be doing super wacky type-level stuff to trigger
             | that for real.)
             | 
             | That's why I said that Copy is sufficient (but restrictive)
             | as a _type-level_ bound, since needs_drop() can 't be done
             | on that level. At least, not without the trick of throwing
             | a post-mono error with an associated const.
        
         | GrantMoyer wrote:
         | I think it's a mistake. _Allocator Implementation_ [1] shows
         | values being moved into the arena, which can be done without
         | Copy or even Clone. Maybe the author got confused, since moving
         | an object in Rust just _copies_ the bytes and invalidates the
         | moved from binding.
         | 
         | [1]: https://iliabylich.github.io/arena-based-
         | parsers/allocator_i...
        
       | Cloudef wrote:
       | Arena allocation is very trivial in zig
        
         | himujjal wrote:
         | This. Writing a parser in Zig is so simple. Just allocate once,
         | and then start writing your parser.
         | 
         | One allocator for parser, one for scanner. One for type
         | allocation. Keep them all for semantic analysis. Write the
         | output renderer (binary/language). Deallocate.
         | 
         | In this whole process, it makes it so easy to not think about
         | memory anymore. Just enjoy writing your program.
        
           | JonChesterfield wrote:
           | The scanner can probably be zero allocation though - zig
           | should be able to express an iterator over the input byte
           | array that returns a token on dereference without needing
           | more state than the position in the array
        
       | kragen wrote:
       | probably worth noting that the default peg backend for the c
       | combinator parser 'hammer' uses an arena allocator for speed and
       | because it's a good match for packrat memoization. there are
       | hammer bindings for several languages including python and java,
       | not sure if there's a rust one yet. hammer also has some other
       | backends including lalr, and in some cases you can write a
       | grammar such that you can parse it with either packrat or lalr
       | 
       | hammer's arena allocator is not very fast, though. it's faster
       | than any malloc i've seen, but a good arena allocator is faster
       | than calling an empty function, so you have to inline the fast
       | path. the open-source prototype arena allocator i wrote in
       | http://canonical.org/~kragen/sw/dev3/kmregion.h (and .c) doesn't
       | quite reach that level, but it takes about 1.9 nanoseconds per
       | allocation (520 million allocations per second on one core),
       | while glibc's malloc takes about 19 nanoseconds per allocation.
       | more recent versions of glibc (on a faster cpu) have reduced that
       | to 14 nanoseconds in that particular microbenchmark. details are
       | at the link. your mileage may vary
       | 
       | chris wellons wrote an excellent article last year about his
       | approach to arena allocation in c in
       | https://nullprogram.com/blog/2023/09/27/
       | 
       | for those who aren't aware, garbage collectors for languages like
       | js, c#, ocaml, or java are usually generational copying
       | collectors for efficiency, which means they use what is
       | effectively an arena allocator for most new allocations. often
       | they reserve a cpu register or two for this, which my c
       | implementation above can't
       | 
       | for c, the apache portable runtime 'apr' has a generic arena
       | allocator in it called pools, which also supports destructors
       | (the drop trait, in rust) and nested pools
       | 
       | the arena allocator in gcc is called 'obstacks' and got moved
       | into libiberty so you can use it in other c programs. or
       | hypothetically in rust
        
         | fanf2 wrote:
         | obstacks are in glibc so they are very widely available
         | https://sourceware.org/glibc/manual/latest/html_node/Obstack...
        
           | kragen wrote:
           | oh thanks, i didn't realize
        
       | a_t48 wrote:
       | protobuf for C++ has this feature -
       | https://protobuf.dev/reference/cpp/arenas/ (not so much parsing
       | but (de)serializing)
        
       | 5kg wrote:
       | Somewhat related talk: Modernizing Compiler Design for Carbon
       | Toolchain - CppNow 2023
       | (https://www.youtube.com/watch?v=ZI198eFghJk)
        
       | willvarfar wrote:
       | I enjoy messing around parsing things etc. Although I started
       | with handmade unschooled attempts many decades ago, I later went
       | through the classic yak/bison phase etc before firmly getting
       | back into the hand-rolling custom side of things where I'm much
       | happier.
       | 
       | My main motivation is speed, e.g. I have enjoyed handcrafting
       | wickedly fast custom JSON and SQL parsers and bits of code that
       | sit on top of them.
       | 
       | My general approach now is to use a tokenizer that generates an
       | int that represents each token, where the bits in the int tell me
       | the location of the token in the source buffer and its type etc.
       | 
       | In languages with a lot of per-object overhead like Java this is
       | literally a long; but in the C/C++/rust camp it can look and feel
       | like an object or struct or whatever because, underneath, it ends
       | up still being an int that gets passed in registers and on the
       | stack etc.
       | 
       | Sometimes the parsing is one-pass and the tokens don't need to be
       | stored or anything; its usually the memory allocations that kill
       | parsing performance, and a once-through json decoder can
       | completely eliminate bottlenecks on hot paths in data processing
       | etc.
       | 
       | Other times I run through once and store these tokens in an
       | array, particularly if I'm going to be going back over them etc.
       | Its actually easy to make a function that, given an 'int' token,
       | finds the next one, so if you are going through the data several
       | times you don't need any allocations. But other times you want to
       | go backwards or you are really going to be going through the data
       | a lot so it makes sense to store the offsets of everything.
       | 
       | Sometimes future steps will be reordering things and transforming
       | the AST etc; in those cases, I generally have a writeable arena
       | where I append the text of new tokens, and a bit in the token
       | ints discriminate between the read-only source buffer and this
       | transformed buffer. This is all particularly cool when it comes
       | to generating sensible error messages with context, which is a
       | finesse most handmade parser makers rue later that they had
       | overlooked :)
       | 
       | I would be interested to know just how unmainstream this kind of
       | approach is? Please weigh in, would love to learn new tricks :)
        
         | mightee_pirate wrote:
         | I am very interested in parsing and optimizing them. Currently
         | I have an html parser written in go. I have used Go's html
         | parser but it is waaay slower than mine. Here what I do: -
         | stream the content (either from a file or from network) and
         | send chunks to the tokenizer
         | 
         | - tokenizer is dead simple with a state machine
         | 
         | - when an html tag is found, emit the token with positions
         | (problem is someone needs to hold the whole incoming buffer.
         | currently its copying the incoming bytes but I plan to hold the
         | bytes until a token is found. after emitting the token with the
         | buffer (copying) the inner buffer will be freed)
         | 
         | - parser does its thing and creates a DOM
         | 
         | as you can see I am not well versed in optimizing to the last
         | bit but I am very interested in diving deep into the
         | optimizations.
         | 
         | Currently I am looking at zero-copy networking. there is
         | supposedly a library for go (fasthttp) that would not copy
         | bytes from networking interface but I havent tried it yet.
         | 
         | What kind of algorithms would you recommend for tokenizing &
         | parsing html / xml very efficiently ?
        
           | kragen wrote:
           | you may want to add blank lines between your bullets
        
             | mightee_pirate wrote:
             | thx
        
               | kragen wrote:
               | happy to help
        
           | willvarfar wrote:
           | There are so many 'it depends' here, based on your use-cases
           | and perhaps details specific to go optimisation that I am
           | unaware of.
           | 
           | The first step, tokenization, might well be very go-specific
           | around zero-copy networking and things, sorry. The general
           | idea, though, would be to use a nice little finite-state-
           | machine to eat characters and recognise when the tags open
           | and close, the name of the tag, the attributes and values,
           | the comments and the body etc. And if you can be avoiding
           | allocating any kind of object on a heap at this stage, it'll
           | almost always be a big win.
           | 
           | But you need to then create a DOM, and you're going to need
           | memory allocation for that.
           | 
           | But the arena approach in the article is good for this; use a
           | big byte array as an arena, and be writing a cleaned-up
           | normalized copy of the parsed html into it, with length-
           | prefixes and distance-to-next-tag etc baked in. Typically a
           | DOM has nodes with parent, previous sibling and next sibling
           | pointers. You'll probably still want these, but these can be
           | offsets written into the byte buffer in between fragments of
           | the parsed html, rather than maintained as a separate
           | structure.
           | 
           | Then, if the user of your DOM can modify it, you can have a
           | node navigation API that seamlessly stores rewritten DOM
           | parts in a new byte arena or appended on the end of the
           | original arena.
        
           | neonsunset wrote:
           | It may seem unexpected given all the hype around Go, but it
           | is a surprisingly poor choice for this. If you want a more
           | convenient language than C++ or Rust but retain the ability
           | to reach optimal performance, C# could serve you much better.
           | 
           | Go underperforms at trivial XML parsing:
           | https://news.ycombinator.com/item?id=40283721 (and the parent
           | comment)
           | 
           | If you push it, C# can extract optimal HW utilization when
           | parsing text, beating C++:
           | https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-
           | among-... (Go was not on the list because it was that much
           | slower)
        
             | willvarfar wrote:
             | A quick google finds https://github.com/ffenix113/fastxml
             | which seems to be doing the 'tips and tricks' of arena
             | parsing and things. Any idea how fast it compares when you
             | get away from memory allocations and things and end up just
             | seeing how the compiler does basic byte manipulation?
        
               | neonsunset wrote:
               | The description indicates it is not production ready, and
               | is archived at the same time.
               | 
               | If you pull all stops in each respective language, C#
               | will always end up winning at parsing text as it offers C
               | structs, pointers, zero-cost interop, Rust-style struct
               | generics, cross-platform SIMD API and simply has better
               | compiler. You can win back some performance in Go by
               | writing hot parts in Go's ASM dialect at much greater
               | effort for a specific platform.
               | 
               | For example, Go has to resort to this https://github.com/
               | golang/go/blob/4ed358b57efdad9ed710be7f4f... in order to
               | efficiently scan memory, while in C# you write the
               | following once and it compiles to all supported ISAs with
               | their respective SIMD instructions for a given vector
               | width: https://github.com/dotnet/runtime/blob/56e67a7aacb
               | 8a644cc6b8... (there is a lot of code because C# covers
               | much wider range of scenarios and does not accept
               | sacrificing performance in odd lengths and edge cases,
               | which Go does).
               | 
               | Another example is computing CRC32: you have to write ASM
               | for Go https://github.com/golang/go/blob/4ed358b57efdad9e
               | d710be7f4f..., in C# you simply write standard vectorized
               | routine once https://github.com/dotnet/runtime/blob/56e67
               | a7aacb8a644cc6b8... (its codegen is competitive with
               | hand-intrinsified C++ code).
               | 
               | There is a lot more of this. Performance and low-level
               | primitives to achieve it have been an area of focus of
               | .NET for a long time, so it is disheartening to see one
               | tenth of effort in Go to receive so much more spotlight.
        
         | kragen wrote:
         | this sounds awesome, can i see?
        
         | PythagoRascal wrote:
         | I would be very interested in a more detailed write-up, if you
         | have the time (or have one already).
        
       | JonChesterfield wrote:
       | This is a good idea. The tree constructed by a parser is
       | relatively likely to want similar lifetimes for most of the
       | nodes.
       | 
       | If the nodes are mutable you can link them together as you go,
       | meaning some amount of overhead in the structure for links that
       | aren't in use, but no garbage left behind.
       | 
       | Storing pointers as distance from the start of a contiguous arena
       | means a structure you can mmap later without relocating.
       | 
       | All good stuff.
        
       | olliej wrote:
       | Parsing is fun, but assuming a good baseline allocator arena
       | allocation doesn't get you a whole lot these days (maybe it's
       | better under a GC environment where tear down does not require a
       | tree traversal?), especially if you're able to lazily build the
       | AST (this is what I did in JSC years ago).
       | 
       | I still feel that there's a lot of practical parsing stuff that
       | isn't considered in academia which leads to a near constant
       | preference for LALR and LR parser generators despite them being
       | honestly worse than recursive descent in almost every important
       | metric if you want to do anything at all other than producing a
       | completely generic AST for known good data.
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:01 UTC)