[HN Gopher] Arena-based parsers
___________________________________________________________________
Arena-based parsers
Author : ibylich
Score : 121 points
Date : 2024-05-06 15:59 UTC (2 days ago)
(HTM) web link (iliabylich.github.io)
(TXT) w3m dump (iliabylich.github.io)
| justinpombrio wrote:
| Oh yeah, arena allocation is great. I've used it a few times and
| it's always a huge performance improvement. The requirements are
| that (i) the objects you're allocating all have the same type,
| and (ii) they all have the same lifetime (i.e. it's okay not to
| free any of them while others still live).
| knome wrote:
| Why would a rust arena allocator be required to have objects
| all of the same type in it? No C arena allocator would have
| such a restriction.
| Rusky wrote:
| It wouldn't. For some reason in Rust people tend to apply the
| term "arena" to "vector + indices" as well as the proper
| definition, which does have that restriction.
| o11c wrote:
| "All objects of the same(-sized) type" is typically a
| constraint of bitmap allocators, not arena allocators.
| kragen wrote:
| your (i) is incorrect (arena allocators can allocate objects of
| different types, and indeed one of their strong points is that
| they don't suffer from fragmentation) and your (ii) doesn't
| make sense because _not_ freeing an object is _always_ okay,
| according to conventional programming language semantics anyway
|
| this particular arena allocator _does_ allocate objects of
| different types, almost arbitrary types in fact:
| fn alloc_uninitialized<T>(&self) -> NonNull<T> { ... }
|
| the safety concern is that nothing _outside_ the arena survive
| the arena if it contains a pointer into the arena. as the
| linked micro-book puts it, 'it shouldn't be possible for any
| arena-allocated [variable] to outlive the memory that it[s
| data]'s located on.'
| hansvm wrote:
| > not freeing an object is always okay, according to
| conventional programming language semantics anyway
|
| Conventional semantics have to hit reality eventually. Long-
| running programs can't just alloc forever without free, hence
| an arena not being the only allocator in most long-running
| programs.
| kragen wrote:
| yes, this is certainly a weak point in conventional
| semantics, but not relevant to this issue, which is about
| ensuring no dangling pointers, not about memory leaks
|
| if you're running low on memory, switching to an arena
| allocator is probably a bad idea, because it will usually
| increase your memory requirements (by delaying deallocation
| longer), not decrease them
| asplake wrote:
| > To release memory we simply release the blob itself, so it's
| like a batched deallocation, sort of. The consequence here is
| that all of our types must be:
|
| > 1. trivially copyable (: Copy if we speak in Rust)
|
| > 2. trivially destructible (i.e. they can't have impl Drop)
|
| 2 (Drop) seems obvious enough, but why 1 (Copy)? There's a bit of
| Rust knowledge implicit there that I'm unaware of, and I might
| not be alone in that. Could someone explain?
| mmastrac wrote:
| Trivially copyable and destructible means that they are plain-
| old data, with no allocations, associated resources or any
| other side-effects.
| LegionMammal978 wrote:
| Trivial destructibility alone is generally sufficient for
| that in Rust. Recall that all values (outside a Pin) are
| trivially moveable, so pretty much the only reason you'd want
| trivial copying is if you actually need more than one valid
| copy at once. (One exception is using Copy to express a poor
| man's Freeze bound, which is quite limited, and dubious in
| any case.)
| mmastrac wrote:
| A type that is not Copy may have associated memory
| allocations, however. You cannot simply forget a range of
| !Copy structs without the risk of memory leaks.
| LegionMammal978 wrote:
| If those !Copy structs deallocate their memory when
| dropped properly, then they aren't trivially
| destructible. If they _were_ trivially destructible, then
| they would leak their memory regardless of whether you
| dropped or forgot them. That 's what it means for a type
| to be trivially destructible: dropping it is a no-op.
| mmastrac wrote:
| I think we're talking about the same thing -- the
| original post in the thread used "no impl Drop" as
| "trivially destructable", not "no functional difference
| between drop and forgetting".
| LegionMammal978 wrote:
| Well, I interpret "trivially destructible (i.e. they
| can't have impl Drop)" as meaning "the type itself can't
| impl Drop, nor can any of its fields or subfields", which
| is what it means for a type to be "trivially
| destructible" in Rust.
|
| But what do you mean by "trivially destructible" being
| different from "no functional difference between drop and
| forgetting"? Unless I'm mistaken, they are the same
| thing: if none of a type's subfields impl Drop, then
| dropping it does absolutely nothing; whereas if any
| subfield _does_ impl Drop, then it can 't be a no-op,
| since the compiler has to form a &mut reference to pass
| it into Drop::drop() (a distinction very relevant for
| unsafe code).
|
| Could you give me an example of a !Copy type that is
| "trivially destructible", but can't be forgotten without
| leaking memory?
| LegionMammal978 wrote:
| I'm pretty sure being trivially destructible is sufficient for
| this use case. The only reason you might want a Copy bound when
| you're not actually copying values is if you're trying to
| disallow interior mutability, but that doesn't seem to be the
| case here. Perhaps the author was misinformed.
|
| Or maybe they're just using Copy as a type-level bound that
| guarantees a trivial destructor, even if it's more restrictive
| than strictly necessary.
| MindSpunk wrote:
| You can't have a negative trait bound to say "I take anything
| that _doesn't_ implement Drop". It's not a stabilized feature
| in Rust because it would allow adding a trait to a struct to
| become a breaking API change where it currently is not. Copy
| guarantees !Drop so it's currently the only way to do it
| without opting into nightly features.
|
| You can remove the requirement for trivially destructible
| types by maintaining a singly linked list of (ptr to alloc'd
| object, destructorFn) pairs with the links allocated from the
| arena itself. The arena can just walk the list and call all
| the drop functions when it gets reset. You can even
| specialize for types with trivial destructors so they don't
| append to the list and so you only pay the extra cost if you
| allocate Drop types (using std::mem::needs_drop).
| LegionMammal978 wrote:
| > You can't have a negative trait bound to say "I take
| anything that _doesn't_ implement Drop". It's not a
| stabilized feature in Rust because it would allow adding a
| trait to a struct to become a breaking API change where it
| currently is not. Copy guarantees !Drop so it's currently
| the only way to do it without opting into nightly features.
|
| Alternatively, you could just do an std::mem::needs_drop()
| check when inserting an element, and fail if it indicates
| that it would need to be dropped. (Yeah, its documentation
| protests that it can have false positives, but IIRC you
| have to be doing super wacky type-level stuff to trigger
| that for real.)
|
| That's why I said that Copy is sufficient (but restrictive)
| as a _type-level_ bound, since needs_drop() can 't be done
| on that level. At least, not without the trick of throwing
| a post-mono error with an associated const.
| GrantMoyer wrote:
| I think it's a mistake. _Allocator Implementation_ [1] shows
| values being moved into the arena, which can be done without
| Copy or even Clone. Maybe the author got confused, since moving
| an object in Rust just _copies_ the bytes and invalidates the
| moved from binding.
|
| [1]: https://iliabylich.github.io/arena-based-
| parsers/allocator_i...
| Cloudef wrote:
| Arena allocation is very trivial in zig
| himujjal wrote:
| This. Writing a parser in Zig is so simple. Just allocate once,
| and then start writing your parser.
|
| One allocator for parser, one for scanner. One for type
| allocation. Keep them all for semantic analysis. Write the
| output renderer (binary/language). Deallocate.
|
| In this whole process, it makes it so easy to not think about
| memory anymore. Just enjoy writing your program.
| JonChesterfield wrote:
| The scanner can probably be zero allocation though - zig
| should be able to express an iterator over the input byte
| array that returns a token on dereference without needing
| more state than the position in the array
| kragen wrote:
| probably worth noting that the default peg backend for the c
| combinator parser 'hammer' uses an arena allocator for speed and
| because it's a good match for packrat memoization. there are
| hammer bindings for several languages including python and java,
| not sure if there's a rust one yet. hammer also has some other
| backends including lalr, and in some cases you can write a
| grammar such that you can parse it with either packrat or lalr
|
| hammer's arena allocator is not very fast, though. it's faster
| than any malloc i've seen, but a good arena allocator is faster
| than calling an empty function, so you have to inline the fast
| path. the open-source prototype arena allocator i wrote in
| http://canonical.org/~kragen/sw/dev3/kmregion.h (and .c) doesn't
| quite reach that level, but it takes about 1.9 nanoseconds per
| allocation (520 million allocations per second on one core),
| while glibc's malloc takes about 19 nanoseconds per allocation.
| more recent versions of glibc (on a faster cpu) have reduced that
| to 14 nanoseconds in that particular microbenchmark. details are
| at the link. your mileage may vary
|
| chris wellons wrote an excellent article last year about his
| approach to arena allocation in c in
| https://nullprogram.com/blog/2023/09/27/
|
| for those who aren't aware, garbage collectors for languages like
| js, c#, ocaml, or java are usually generational copying
| collectors for efficiency, which means they use what is
| effectively an arena allocator for most new allocations. often
| they reserve a cpu register or two for this, which my c
| implementation above can't
|
| for c, the apache portable runtime 'apr' has a generic arena
| allocator in it called pools, which also supports destructors
| (the drop trait, in rust) and nested pools
|
| the arena allocator in gcc is called 'obstacks' and got moved
| into libiberty so you can use it in other c programs. or
| hypothetically in rust
| fanf2 wrote:
| obstacks are in glibc so they are very widely available
| https://sourceware.org/glibc/manual/latest/html_node/Obstack...
| kragen wrote:
| oh thanks, i didn't realize
| a_t48 wrote:
| protobuf for C++ has this feature -
| https://protobuf.dev/reference/cpp/arenas/ (not so much parsing
| but (de)serializing)
| 5kg wrote:
| Somewhat related talk: Modernizing Compiler Design for Carbon
| Toolchain - CppNow 2023
| (https://www.youtube.com/watch?v=ZI198eFghJk)
| willvarfar wrote:
| I enjoy messing around parsing things etc. Although I started
| with handmade unschooled attempts many decades ago, I later went
| through the classic yak/bison phase etc before firmly getting
| back into the hand-rolling custom side of things where I'm much
| happier.
|
| My main motivation is speed, e.g. I have enjoyed handcrafting
| wickedly fast custom JSON and SQL parsers and bits of code that
| sit on top of them.
|
| My general approach now is to use a tokenizer that generates an
| int that represents each token, where the bits in the int tell me
| the location of the token in the source buffer and its type etc.
|
| In languages with a lot of per-object overhead like Java this is
| literally a long; but in the C/C++/rust camp it can look and feel
| like an object or struct or whatever because, underneath, it ends
| up still being an int that gets passed in registers and on the
| stack etc.
|
| Sometimes the parsing is one-pass and the tokens don't need to be
| stored or anything; its usually the memory allocations that kill
| parsing performance, and a once-through json decoder can
| completely eliminate bottlenecks on hot paths in data processing
| etc.
|
| Other times I run through once and store these tokens in an
| array, particularly if I'm going to be going back over them etc.
| Its actually easy to make a function that, given an 'int' token,
| finds the next one, so if you are going through the data several
| times you don't need any allocations. But other times you want to
| go backwards or you are really going to be going through the data
| a lot so it makes sense to store the offsets of everything.
|
| Sometimes future steps will be reordering things and transforming
| the AST etc; in those cases, I generally have a writeable arena
| where I append the text of new tokens, and a bit in the token
| ints discriminate between the read-only source buffer and this
| transformed buffer. This is all particularly cool when it comes
| to generating sensible error messages with context, which is a
| finesse most handmade parser makers rue later that they had
| overlooked :)
|
| I would be interested to know just how unmainstream this kind of
| approach is? Please weigh in, would love to learn new tricks :)
| mightee_pirate wrote:
| I am very interested in parsing and optimizing them. Currently
| I have an html parser written in go. I have used Go's html
| parser but it is waaay slower than mine. Here what I do: -
| stream the content (either from a file or from network) and
| send chunks to the tokenizer
|
| - tokenizer is dead simple with a state machine
|
| - when an html tag is found, emit the token with positions
| (problem is someone needs to hold the whole incoming buffer.
| currently its copying the incoming bytes but I plan to hold the
| bytes until a token is found. after emitting the token with the
| buffer (copying) the inner buffer will be freed)
|
| - parser does its thing and creates a DOM
|
| as you can see I am not well versed in optimizing to the last
| bit but I am very interested in diving deep into the
| optimizations.
|
| Currently I am looking at zero-copy networking. there is
| supposedly a library for go (fasthttp) that would not copy
| bytes from networking interface but I havent tried it yet.
|
| What kind of algorithms would you recommend for tokenizing &
| parsing html / xml very efficiently ?
| kragen wrote:
| you may want to add blank lines between your bullets
| mightee_pirate wrote:
| thx
| kragen wrote:
| happy to help
| willvarfar wrote:
| There are so many 'it depends' here, based on your use-cases
| and perhaps details specific to go optimisation that I am
| unaware of.
|
| The first step, tokenization, might well be very go-specific
| around zero-copy networking and things, sorry. The general
| idea, though, would be to use a nice little finite-state-
| machine to eat characters and recognise when the tags open
| and close, the name of the tag, the attributes and values,
| the comments and the body etc. And if you can be avoiding
| allocating any kind of object on a heap at this stage, it'll
| almost always be a big win.
|
| But you need to then create a DOM, and you're going to need
| memory allocation for that.
|
| But the arena approach in the article is good for this; use a
| big byte array as an arena, and be writing a cleaned-up
| normalized copy of the parsed html into it, with length-
| prefixes and distance-to-next-tag etc baked in. Typically a
| DOM has nodes with parent, previous sibling and next sibling
| pointers. You'll probably still want these, but these can be
| offsets written into the byte buffer in between fragments of
| the parsed html, rather than maintained as a separate
| structure.
|
| Then, if the user of your DOM can modify it, you can have a
| node navigation API that seamlessly stores rewritten DOM
| parts in a new byte arena or appended on the end of the
| original arena.
| neonsunset wrote:
| It may seem unexpected given all the hype around Go, but it
| is a surprisingly poor choice for this. If you want a more
| convenient language than C++ or Rust but retain the ability
| to reach optimal performance, C# could serve you much better.
|
| Go underperforms at trivial XML parsing:
| https://news.ycombinator.com/item?id=40283721 (and the parent
| comment)
|
| If you push it, C# can extract optimal HW utilization when
| parsing text, beating C++:
| https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-
| among-... (Go was not on the list because it was that much
| slower)
| willvarfar wrote:
| A quick google finds https://github.com/ffenix113/fastxml
| which seems to be doing the 'tips and tricks' of arena
| parsing and things. Any idea how fast it compares when you
| get away from memory allocations and things and end up just
| seeing how the compiler does basic byte manipulation?
| neonsunset wrote:
| The description indicates it is not production ready, and
| is archived at the same time.
|
| If you pull all stops in each respective language, C#
| will always end up winning at parsing text as it offers C
| structs, pointers, zero-cost interop, Rust-style struct
| generics, cross-platform SIMD API and simply has better
| compiler. You can win back some performance in Go by
| writing hot parts in Go's ASM dialect at much greater
| effort for a specific platform.
|
| For example, Go has to resort to this https://github.com/
| golang/go/blob/4ed358b57efdad9ed710be7f4f... in order to
| efficiently scan memory, while in C# you write the
| following once and it compiles to all supported ISAs with
| their respective SIMD instructions for a given vector
| width: https://github.com/dotnet/runtime/blob/56e67a7aacb
| 8a644cc6b8... (there is a lot of code because C# covers
| much wider range of scenarios and does not accept
| sacrificing performance in odd lengths and edge cases,
| which Go does).
|
| Another example is computing CRC32: you have to write ASM
| for Go https://github.com/golang/go/blob/4ed358b57efdad9e
| d710be7f4f..., in C# you simply write standard vectorized
| routine once https://github.com/dotnet/runtime/blob/56e67
| a7aacb8a644cc6b8... (its codegen is competitive with
| hand-intrinsified C++ code).
|
| There is a lot more of this. Performance and low-level
| primitives to achieve it have been an area of focus of
| .NET for a long time, so it is disheartening to see one
| tenth of effort in Go to receive so much more spotlight.
| kragen wrote:
| this sounds awesome, can i see?
| PythagoRascal wrote:
| I would be very interested in a more detailed write-up, if you
| have the time (or have one already).
| JonChesterfield wrote:
| This is a good idea. The tree constructed by a parser is
| relatively likely to want similar lifetimes for most of the
| nodes.
|
| If the nodes are mutable you can link them together as you go,
| meaning some amount of overhead in the structure for links that
| aren't in use, but no garbage left behind.
|
| Storing pointers as distance from the start of a contiguous arena
| means a structure you can mmap later without relocating.
|
| All good stuff.
| olliej wrote:
| Parsing is fun, but assuming a good baseline allocator arena
| allocation doesn't get you a whole lot these days (maybe it's
| better under a GC environment where tear down does not require a
| tree traversal?), especially if you're able to lazily build the
| AST (this is what I did in JSC years ago).
|
| I still feel that there's a lot of practical parsing stuff that
| isn't considered in academia which leads to a near constant
| preference for LALR and LR parser generators despite them being
| honestly worse than recursive descent in almost every important
| metric if you want to do anything at all other than producing a
| completely generic AST for known good data.
___________________________________________________________________
(page generated 2024-05-08 23:01 UTC)